CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20...

46
CS250, UC Berkeley Fall ‘20 Lecture 10 CS250 VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi

Transcript of CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20...

Page 1: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

CS250VLSISystemsDesign

Fall2020

JohnWawrzynek

with

AryaReais-Parsi

Page 2: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ProjectTeamPresentations‣ Inclass,Oct1(thisThursday)

‣ Target10minutes(withdiscussion)

‣ Slideswithillustrations(powerpoint,…)

‣ Onepresentationeachfrom:config,CLB,SRAM,MACteams

‣ Twointerconnectionpresentations

‣ Threefabricteampresentations:

‣ toolsupport

‣ high-levelfabricarchitecture

‣ simulation,testing,integrationplan

‣ Followingweek,privategroupmeetingswithArya&Johntogetfeedbackandbrainstormideas

2

Page 3: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ProjectTeamPresentations‣ Thepointistogetthediscussiongoingonthefunctionand

implementationofyourpiece.

‣ Youareresponsiblefora“straw-man”/draftproposal

‣ Okaytoleavesomeissuesopenfornow

‣ Outline

‣ Onlyonepersonneedstospeakbut,introduceteammembers

‣ Describeyourproposedfunction/featuresandstructure(blockdiagram/circuit)ofyourpiece

‣ Describehowyouplantorefinethedefinitionsoffunction/structureandtooptimizethedesign

‣ Saysomethingaboutimplementationstrategy

‣ Saysomethingaboutwhatinformationyouwillneedfromotherteamsandwhatotherteamswillneedfromyou

3

Page 4: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ProjectTeams‣ Ifyouhavequestionsabouthow

youendedupinwhichteam,mailmeorsetupappointment

‣ Ifyouhavequestionsaboutyourteam’sroleandresponsibility,asknow,ormailuslater

‣ Ifyoudon’thaveemailcontactsforyourotherteammembers,asknow,ormailuslater

‣ Toprepareforthepresentationsnextweek,notnecessaryrightnowtoreachouttoothergroups,butfeelfreetodoso

4

Fabric

InterconnectConfiguration

CLB

MAC

SRAMInteraction/coordination Graph

Page 5: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

CircuitsTopics:Basic(review?)

‣ Processing/devices:planar,finfets,GDR

‣ Devicemodels:switch,RC,Vth

‣ Logiccircuits:gates,muxes,transmissiongates,FFs

‣ CircuitDelay:gatedelay,wiredelay,FETsizing

‣ CircuitPower:formulation/factors

‣ SystemDelay:factors,optimization

‣ SystemPower:factors,optimization

5

What you need to know as a VLSI Systems designer.

Page 6: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

SystemDelay‣ CriticalPath

‣ OptimizationTechniques

‣ Clockdistribution

6

Page 7: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

InGeneral...

T ≥ τclk→Q + τCL + τsetup

‣ Howdoweenumerateallpaths?– Anycircuitinputorregisteroutputtoanyregisterinputorcircuit

output?

• Note:– “setuptime”foroutputsisafunctionofwhatitconnectsto.

– “clk-to-q”forcircuitinputsdependsonwhereitcomesfrom.

7

For correct operation:

for all paths.

Page 8: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ComponentsofPathDelay

1. #oflevelsoflogic2. Internalcelldelay3. wiredelay

4. cellinputcapacitance5. cellfanout6. celloutputdrivestrength

8

How do we optimize?Tackle “critical path”

Synthesis tools approximate path delay and attempt to optimize by rearranging logic network and choosing appropriately sized cells.

“Logical Effort” method for hand sizing of transistors.

Place and route tools attempt to minimize wire delay on critical paths.

Page 9: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Treesforoptimization

9

+ + + + + + +x0

x1 x2 x3 x4 x5 x6 x7

T = O(N)

+ +

+ + +

+

+

T = O(log N)

(( x0 + x1 ) + ( x2 + x3 )) + (( x4 + x5 ) + ( x6 + x7 ))

((((((x0 + x1 ) + x2 ) + x3 ) + x4 ) + x5 ) + x6 ) + x7

❑ What property of “+” are we exploiting? ❑ Other associate operators? Boolean operations? Division? Min/Max?

Same number of operations (N-1)

Page 10: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Pipelining‣ Generalprinciple:

‣ CuttheCLblockintopieces(stages)andseparatewithregisters:

Assume T=8ns TFF(setup +clk→q)=1ns F = 1/9ns = 111MHz

Assume T1 = T2 = 4ns

10

Simple, except in classes of feedback (loop carry dependance)

Page 11: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

SystemPower‣ Chip/blocklevelPower

‣ Optimizationforpowerandenergyefficiency

‣ Powerdistribution

11

Page 12: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

EnergyandPower

‣ Handheldandportable(batteryoperated):❑ EnergyEfficiency-limitsbatterylife❑ Power-limitedbyheat

‣ Infrastructureandservers(connectedtopowergrid):❑ EnergyEfficiency-dictatesoperationcost❑ Power-heatremovalcontributestoTCO

12

Energy Efficiency: energy per operation

P =dWdt

Energy is the ability to do work (W).Power is rate of expending energy.

Remember: reducing power is easy - just slow down. Improving energy efficiency is difficult.

Heat is a byproduct of computation. Heat dissipated is proportional to the energy used per unit time, P.

Page 13: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Fivelow-powerdesigntechniques

13

Power-down idle transistors

Parallelism and pipelining

Slow down non-critical paths

Clock gating

Thermal management

Page 14: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10 14

Gate delay roughly linear

with Vdd

This magic trick brought to you by Cory Hall ...

3636

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

Vo

ltag

e

Hig

h S

up

ply

Vo

ltag

e

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated DesignsAnd so, we can transform this:

Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”.

P ~ F ⨯ Vdd2

P ~ 1 ⨯ 1 2

Into this: Top block processes “left”, bottom “right”.

3636

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

Vo

ltag

e

Hig

h S

up

ply

Vo

ltag

e

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated Designs

CV2 power only

P ~ #blks ⨯ F ⨯ Vdd 2

P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4

Page 15: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Cell(PS3Chip):1CPU+8“SPUs”

15

PowerPC

L2 Cache512 KB

Synergistic Processing

Units(SPUs)

8

Page 16: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 103434

Circuit Techniques ReduceCircuit Techniques ReduceSource Drain LeakageSource Drain Leakage

Body BiasBody Bias

+ + VeVe

VddVddVbpVbp

VbnVbn

- - VeVe

2 - 10X2 - 10X

Sleep TransistorSleep Transistor

2 - 1000X2 - 1000X

Stack EffectStack Effect

5 - 10X5 - 10X

Logic Logic

BlockBlock

Equal LoadingEqual Loading

LeakageLeakage

ReductionReduction

Add“sleep”transistorstologic...

16

Example:Floatingpointunitlogic.

Whenrunningfixed-pointinstructions,putlogic“tosleep”.

+++When“asleep”,leakagepowerisdramaticallyreduced.

---Presenceofsleeptransistorsslowsdowntheclockratewhenthelogicblockisinuse.

Page 17: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Fact:Mostlogiconachipis“toofast”° Aproductthat

17

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

netlist. Of these, 1 2 1 7 1 3 were top-level chip global nets,and 2 1 7 1 1 were processor-core-level global nets. Againstthis model 3 .5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9 .8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 1 2 GB to 1 4 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6 0 0 0 *Model S8 0 configured with 6 4 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2 .5 –3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1 .9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 2 4 -hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

SummaryThe 1 7 4 -million-transistor !1 .3 -GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4 , emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

Figure 25

POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

VIM

Timer files ReportsAsserts

Spice

Spice

GL/1

Reports

< 12 hr

< 12 hr

< 12 hr

< 48 hr

< 24 hr

Non-uplift timing

Noiseimpacton timing

Upliftanalysis

Capacitanceadjust

Chipbench /EinsTimer

Chipbench /EinsTimer

Extraction

Core or chipwiring

Analysis/update(wires, buffers)

Notes:• Executed 2– 3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

Figure 26

Histogram of the POWER4 processor path delays.

!40 !20 0 20 40 6 0 80 100 120 140 16 0 180 200 220 240 26 0 280Timing slack (ps)

Lat

e-m

ode

timin

g ch

ecks

(th

ousa

nds)

0

50

100

150

200

IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.

47

Most wires have hundreds of picoseconds to spare.The critical path

Page 18: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

3636

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

Vo

ltag

e

Hig

h S

up

ply

Vo

ltag

e

Multiple Supply

Voltages

Logic BlockFreq = 1

Vdd = 1

Throughput = 1

Power = 1

Area = 1

Pwr Den = 1

Vdd

Logic Block

Freq = 0.5

Vdd = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Den = 0.125

Vdd/2

Logic Block

Replicated Designs

Useseveralsupplyvoltagesonachip...

18

Whyusemulti-Vdd?Wecanreducedynamicpowerbyusinglow-powerVddforlogicoffthecriticalpath.

Whatifwecan’tdoamulti-Vdddesign?Inamulti-Vtprocess,wecanreduceleakagepowerontheoffcriticalpathlogicbyusinghigh-Vthtransistors.

Page 19: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ClockGatingReducesClockLoad

19

“Upto70%powersavingsattheblocklevel,forapplicablecircuits”SynopsisDataSheet

<=

Page 20: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Keepchipcooltominimizeleakage

20

Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

R

Optimizing Designs for Power Consumption through Changes to the FPGA Environment

To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

X-Ref Target - Figure 3

Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

-40 -20 200 40 6 0 80 100 120 140

25°C

50°C

WP285_03_021208

25

50

80°C

100°C

I CC

INT

Q L

eaka

ge C

urre

nt(N

orm

aliz

ed to

25°

C)

Junction Temp °C

JunctionTemperature

(TJ °C)

NormalizedStatic Poweror ICCINTQ

Typical

85

100

1.00

1.46

2.50

3.14

1

2

3

4

5

6

7

Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

R

Optimizing Designs for Power Consumption through Changes to the FPGA Environment

To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

X-Ref Target - Figure 3

Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

-40 -20 200 40 6 0 80 100 120 140

25°C

50°C

WP285_03_021208

25

50

80°C

100°C

I CC

INT

Q L

eaka

ge C

urre

nt(N

orm

aliz

ed to

25°

C)

Junction Temp °C

JunctionTemperature

(TJ °C)

NormalizedStatic Poweror ICCINTQ

Typical

85

100

1.00

1.46

2.50

3.14

1

2

3

4

5

6

7

Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

R

Optimizing Designs for Power Consumption through Changes to the FPGA Environment

To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

X-Ref Target - Figure 3

Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

-40 -20 200 40 6 0 80 100 120 140

25°C

50°C

WP285_03_021208

25

50

80°C

100°C

I CC

INT

Q L

eaka

ge C

urre

nt(N

orm

aliz

ed to

25°

C)

Junction Temp °C

JunctionTemperature

(TJ °C)

NormalizedStatic Poweror ICCINTQ

Typical

85

100

1.00

1.46

2.50

3.14

1

2

3

4

5

6

7

A recipe for thermal runaway

Page 21: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

CircuitsTopics:Advanced‣ Clocksandclocking:

‣ clockdriversanddistribution

‣ skeweffects

‣ hold-time

‣ clockdomainsandsynchronization

‣ Phase-lockedLoops(PLL)/Delay-lockedLoops(DLL)

‣ GloballyAsynchronouslocallySynchronous(GALS)clocking

‣ Powersupplyanduse

‣ Powerdistributionanddecouplingcapacitors

‣ DynamicVoltageandFrequencyScaling(DVFS)

‣ voltageregulators

‣ devicestacking,powergating,clockgating,multi-threshold

‣ Multi-voltagesystems

‣ chargepumps

‣ latch-up/wellplugs

‣ Input/Output

‣ ElectrostaticDischarge(ESD)suppression/pad-drivers

‣ High-speedI/O,Serializer/Deserializer(SerDes)

‣ packaging

21

Page 22: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ClocksandClocking‣ Inmyexperiencebuildingchipsandbringingupsystems,

thegreatmajorityofproblemsrelatetoclocksandpower.

22

Page 23: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

Cycle time (max): TClk > tclk-q,max + tlogic,max + tsetup

Race margin (min): thold < tclk-q,min + tlogic,min

RevisedTimingConstraints

tclk-q,maxtclk-q,min

tlogic,maxtlogic,min

tsetup, thold

23

Page 24: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ClockNonidealities

‣ Clock skew: tSK ‣ Timedifferencebetweenthesink(receiving)andsource

(launching)clockedge;deterministic+random

‣ Clock jitter ‣ Temporalvariationsinconsecutiveedgesoftheclock

signal;modulation+randomnoise

‣ Cycle-to-cycle(short-term)tJS

‣ LongtermtJL

‣ Variation of the pulse width ‣ Importantforlevelsensitiveclocking

24

Page 25: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

SourcesofClockUncertainties

25

Page 26: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ClockSkewandJitter‣ Bothskewandjitteraffecttheeffectivecycletimeand

theracemargin

Clk

Clk’

tSK

tJS

26

Page 27: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

PositiveSkew

Launching edge arrives before the receiving edge

27

Page 28: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

NegativeSkew

Receiving edge arrives before the launching edge

CLK1

CLK2

TCLK

δ

TCLK - δ

2

1

4

3

28

Page 29: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

TimingConstraints

29

Minimum cycle time: Tclk + δ = tclk-q,max + tsetup + tlogic,max

tclk-q,maxtclk-q,min

tlogic,max

tlogic,mintsetup, thold

Skew may be negative or positive

Page 30: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

TimingConstraints

30

tclk-q,maxtclk-q,min

tlogic,maxtlogic,mintsetup, thold

Hold time constraint: t(clk-q,min) + t(logic,min) > thold + δ

Skew may be negative or positive

Page 31: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [1 6 ] was required to tune themore than 2 0 0 0 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3 D visualizations. Figure 8 visualizes four ofthe 6 4 sector trees containing about 1 2 5 tuned wiresdriving 1 /1 6 th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [1 7 , 1 8 ]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 1 5 2 0 0 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ay

Grid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2

32

Clock Tree Delays, IBM Power

clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.

Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 1 0 0top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 2 0 ps to 4 0 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 7 0 ps skew from across-chip channel-lengthvariations [1 9 ]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.

Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [2 0 ]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.

The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,

including uncertainties associated with the modelingof the floating-body effect [2 1 –2 3 ] and its impact onnoise immunity [2 2 , 2 4 –2 7 ] and overall chip decouplingcapacitance requirements [2 6 ], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.

Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.

The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description

Figure 9

Global clock wav eforms showing 20 ps of measured skew.

1.5

1.0

0.5

0.0

0 500 1000 1500 2000 2500

20 ps skew

Vol

ts (

V)

Time (ps)

IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.

33

�31

Page 32: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

ClockConstraintsinEdge-TriggeredSystems

32

If launching edge is late and receiving edge is early, the data will not be too late if:

Minimum cycle time is determined by the maximum delays through the logic

tclk-q,max + tlogic,max + tsetup < TCLK – tJS,1 – tJS,2 + δ

tclk-q,max + tlogic,max + tsetup - δ + 2tJS < TCLK

Skew can be either positive or negative Jitter tJS usually expressed as peak-to-peak or n x RMS value

Page 33: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 10

DatapathwithFeedback

33

Page 34: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

ClockDistribution

!34

❑ Single clock generally used to synchronize all logic on the same chip (or region of chip) ▪ Need to distribute clock over the entire region ▪ While maintaining low skew/jitter ▪ And without burning too much power

Page 35: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

ClockDistribution

35

❑ What’s wrong with just routing wires to every point that needs a clock?

CLK

Page 36: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

SymmetricClockTrees

!36

❑ Minimizes skew by matching RC delays from drive point to all terminal points

❑ Is that sufficient?

Page 37: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

H-TreeClockDistribution

!37

❑ H-tree ensures that the relative RC delay to each terminal is matched

❑ What about absolute RC delay?

Wide wires reduce RC delay

Page 38: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

H-TreeClockDistribution

!38

❑ Tapered H-tree can ensure high frequency (low RC value on paths)

❑ Furthermore, wide wires reduces the RC variation (less skew)

❑ What about power?

❑ Adding buffers along the way reduces power and maintains high-frequency, but introduces skew!

Wide wires increases C

Page 39: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

BufferedH-Tree

39

To minimize skew, all buffers must be matched (same layout, orientation)

Page 40: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

MorerealisticASICH-tree

40

[Restle98]

Page 41: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

ClockGridAlternative

Driver

Driver

Driver D

river

GCLK GCLK

GCLK

GCLK

•No RC matching •But huge power

�41

Page 42: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

42

Clock circuits live in center column.

32 global clock wires go down the red column.

Also, 4 regional clocks (restricted functionality).

Any 10 may be sent to a clock region.

FPGA

Page 43: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

Clockshavededicatedwires(lowskew)

43

From: Xilinx Spartan 3 data sheet. Virtex is similar.

Page 44: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

Low-skewClockinginFPGAs

�44Figures from Xilinx App Notes

Page 45: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

45

Die

photo:

Xilinx

Virtex

Gold wires

are the clock tree.

Page 46: CS250 VLSI Systems Designcs250/fa20/files/lec10-circ3.pdfLecture 10 CS250, UC Berkeley Fall ‘20 Project Team Presentations ‣ The point is to get the discussion going on the function

CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

EndofLecture10

46