Timing Issues in Digital ASIC Design
-
Upload
nsaivs1513 -
Category
Documents
-
view
236 -
download
3
Transcript of Timing Issues in Digital ASIC Design
This Class + LogisticsTiming
Storage elements, Clock distribution, Clock tree synthesis
ReadingWhitepapers/datasheets on STA; papers on clock tree synthesis
ScheduleMT in one week (lab/recitation fair game); Lab #2 due Mon 1/27
HW #9: As a block’s layout is compacted down to fit into a smaller and smaller region, the timing of the block at first improves, but then worsens. Explain.
HW #10: Hold time violations mean that the chip doesn’t work at any frequency. Propose several distinct methods for fixing hold time violations (guided by post-routing static timing analysis), and explain the pros and cons of each.
HW #11: Compare DEC’s first Alpha and first StrongArm processors (look up transistor counts, supply voltage, frequency, etc.). (a) How much of StrongArm’s power efficiency can be attributed to process, supply, and frequency scaling? (b) What factors might contribute to the remainder?
Slide courtesy of S. P. Levitan, U. Pittsburg
Review
Static timing analysis (Lecture 4)Pin-based timing graph
Directed acyclic graph (DAG) of timing arcs
Longest path in DAG time linear in #arcs (edges)
Slack = required arrival time – actual arrival time (long path analysis)
Logic synthesis (Lecture 5)
Slide courtesy of S. P. Levitan, U. Pittsburg
Static Analysis vs. Dynamic AnalysisWhy static analysis when dynamic simulation is more accurate?
Drawbacks of simulationRequires input vectors (stimuli for circuit)
Long runtimes
Example: calculate worst-case rising delay from a to zExponential explosion with number of possible design input states
a
c=0 c=1b=0 a-z delay1 a-z delay2 b=1 a-z delay3 a-z delay4
b
c
z
STA Terminology
90
10
(Actual) arrival time (AAT, or AT) = time at which a pin switches state
Usually 50% point on voltage curve, i.e., AT = t50
Slew time = time over which signal switchesUsually difference between 10% and 90% on voltage curve, i.e., tslew = t90 – t10
Required arrival time (RAT) = time at which a signal must arrive in order to avoid a chip fail
Slack = RAT – AATPositive slack good (= margin), negative slack bad
Vdd50
Time
Example: What is slack at PO?
d=2
d=1
d=5
d=3
d=2
d=1
d=3
d=3d=1
temp at=3 temp at=7
at=0
at=0
at=0
at=1
at=2
at=5 at=6
at=5
at=8at=11
rat=10
Slack= -1
Example: Incremental Timing Analysis
d=2
d=1
d=5
d=3
d=2
d=1
d=3
d=3d=1
temp at=3 temp at=7
at=0
at=0
at=0
at=1
at=2
at=5 at=6
at=5
at=8at=11
rat=10
Slack = 0
at=10
d=1
d=1d=1
at=3
at=7
Amount of work is bounded by sizes of fanin, fanoutcones of logic
Early-Mode Analysis
Definitions change as followsRAT = lower bound on arrival timePropagate shortest possible instead of longest possible delaysSlack = Arrival – Required
Example: negative slack because ATc is too small (early)
0=aAT1=bAT
2=xRAT
1=xAT121 −=−=xSL101 =−=bSL
000 =−=aSL1=yAT
0=cAT
011 =−=ySL
a
b xc
y1 1
110 −=−=cSL
Enhancements of STA
Incremental timing analysis
Nanometer-scale process effects – variation (probabilistic timing analysis)
Interference – crosstalk
Multiple inputs switching
Conservatism of delay propagation
(Old: HW #8: Suppose you change the size of one (combinational) gate in your design, thus invalidating the previous timing analysis. How much work must be done to regain a correct timing analysis?)
Courtesy K. Keutzer et al. UCB
Timing Correction
Driven by STA“Incremental performance analysis backplane”
Fix electrical violationsResize cellsBuffer netsCopy (clone) cells
Fix timing problemsLocal transforms (bag of tricks)Path-based transforms
DAC-2002, Physical Chip Implementation
Local Synthesis Transforms
Resize cells
Buffer or clone to reduce load on critical nets
Decompose large cells
Swap connections on commutative pins or among equivalent nets
Move critical signals forward
Pad early paths
Area recovery
DAC-2002, Physical Chip Implementation
Transform Example
…..
Double Inverter
Removal
…..
…..
Delay = 4
Delay = 2
DAC-2002, Physical Chip Implementation
Resizing
00.010.020.030.040.05
0 0.2 0.4 0.6 0.8 1load
dA B C
b
ad
e
f0.2
0.2
0.3
?
b
aA
0.035
b
aC
0.026
DAC-2002, Physical Chip Implementation
Cloning
00.010.020.030.040.05
0 0.2 0.4 0.6 0.8 1load
d
A B C
b
a
d
e
f
gh
0.2
0.2
0.20.20.2
?
b
a
d
ef
gh
A
B
DAC-2002, Physical Chip Implementation
Buffering
00.010.020.030.040.05
0 0.2 0.4 0.6 0.8 1load
d
A B C
b
a
d
e
f
gh
0.1
0.2
0.20.20.2
BB
0.2
b
a
d
e
f
gh
0.2
0.2
0.20.20.2
?
DAC-2002, Physical Chip Implementation
Redesign Fan-in Tree
a
cd
b eArr(b)=3
Arr(c)=1
Arr(d)=0
Arr(a)=4
Arr(e)=61
1
1
cd
e
Arr(e)=51
1b1
a
DAC-2002, Physical Chip Implementation
Redesign Fan-out Tree
1
1
1
3
1
1
1
Longest Path = 5
1
1
1
3
1
2
Longest Path = 4Slowdown of buffer due to load
DAC-2002, Physical Chip Implementation
Swap Commutative Pins
a
cb
2
1
0
1
1
2
1 5
Simple sorting on arrival times and delay works
c
ab
2
1
0 1
1
1
3
2
DAC-2002, Physical Chip Implementation
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Why Clocks?Clocks provide the means to synchronize
By allowing events to happen at known timing boundaries, we can sequence these events
Greatly simplifies building of state machines
No need to worry about variable delay through combinational logic (CL)
All signals delayed until clock edge (clock imposes the worst case delay)
CombLogic
register
CombLogic
register
registerDataflowFSM
Clock Cycle Time
Cycle time is determined by the delay through the CLSignal must arrive before the latching edgeIf too late, it waits until the next cycle
- Synchronization and sequential order becomes incorrect
tcycle > tprop_delay + toverhead
Can change circuit architecture to obtain smaller Tcycle
PipeliningFor dataflow:
Instead of a long critical path, split the critical path into chunks Insert registers to store intermediate resultsThis allows 2 waves of data to coexist within the CL
Can we extend this ad infinitum?Overhead eventually limits the pipelining
- E.g., 1.5 to 2 gate delays for latch or FFGranularity limits as well
- Minimum time quantum: delay of a gate
tcycle > tpd + toverhead tcycle > max(tpd1, tpd2) + toverhead
register
register
register
register
register
CL
A+B
CL
A+BCL
A
CL
ACL
B
CL
B
tpd tpd1 tpd2
FO4 INV Delays Per Clock Period
0.00
20.00
40.00
60.00
80.00
100.00
120.00
1982 1987 1993 1998 2004
Year
Num
ber o
f FO
4 in
verte
r del
ays
386
486 DX2 DX4
Pentium
Pentium MMX
Pentium Pro
Pentium II
Celeron
Pentium III
Pentium 4
FO4 INV = inverter driving 4 identical inverters (no interconnect)
Half of frequency improvement has been from reduced logic stages, i.e., pipelining
Parallelism
For FSMs:Same functionality and performance can be achieved at half the clock rateHowever, the input and output signals must be doubled (to account for the outputs for each original cycle)Instead of doubling the delay, the optimized logic is often logarithmically related to the degree of parallelism
tcycle1 > tpd + tov tcycle2 > Ntpd + tov tcycle3 > log(Ntpd) + tov
register
tpd
M-bits
reg
tpd
M-bits
tpd
reg
M-bits register
tpd
2*M-bitsCLCL CLCL
Opt.
CL
Opt.
CLCLCL
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Storage Elements
LatchesLevel sensitive – transparent when H, hold when L
ckb
d
ck
qp_q
ck
q
d
ck
qdck
q
d
Flip-flopsEdge-triggered – data is sampled at the clock edge
Latch and Flip-Flop Gates
Rising edge flip-flopActive high latch
clock
D QN
Q
clock
clock
clock
clock
clock
clock
clock
clock
clock
clockclock
QND
Q
in out
enable
enable
out
enable
enable
in
Latch and flip-flop schematics from TSMC 0.13um LV Artisan Sage-X Standard Cell Library.
Latch and Flip-Flop Behavior
Rising edge flip-flopActive high latchWhen clock is high When clock is high
D QND QN
Q Q
tCQ 4 inverter delaystDQ 2 inverter delays
When clock is lowWhen clock is low
D QNQND
Clock Skew and Jitter
A B
clock
(a)
(b)
(c)
clock at B
clock at B
T – tjtj/2 tj/2
Thigh – tduty tduty
clock at Bclock at B
tsk,AB
clock at Bclock at A
tsk,AB
Clock skew
Duty cycle jitter
Cycle-to-cycle edge jitter
Flip-Flop Timing Characteristics
Rising edge flip-flop
non-idealclock
tCQmax tcomb,max tsutsk+tj
Tflip-flops
non-idealclock
clock
A B
tcomb,min
tCQ,min
A
B
A
B
thtsk
Setup time constraint Hold time constraint
Latch Setup Time and Transparency
Active high latch
clock
tcomb
non-idealclock
tDQ tDQ
A B
AB
clock
tCQ tcomb,max tsu tsk+tjtduty
non-idealclock
A B
AB
No penalty to clock period for setup time constraint!
Setup time constraint
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Setup Time
Important characteristics of storage elementsSetup time, hold time, clock-to-q delay
Setup time, tsuTime before the clock edge that the data must arrive in order for the new data to be storedThe setup time for a F/F occurs before the latching edge.The setup time for a Latch occurs before the transition from transparent to hold
tsetup
d
ck
q
Hold TimeA second important characteristic is the hold time, th
Time after the clock edge that the data must remain in order to the data to be properly heldNote that Hold time (and Setup time) can be negative
Why isn’t hold time just the negative of setup time?Storage elements typically have some data dependence
- Capacitances, and devices may be faster for one data value versus another
Specify the worst case for process technology and operating condition variations
d
ckthold
q
Clocking OverheadInherent delay in any storage element
The delay is measured from Clock transition to Output data transition, tc2q
Input data transition to Output data transition, td2q
Flip-flop is edge triggeredThe overhead is tc2q + tsu
Latch is level-sensitiveThe overhead is td2q
d
tc2q td2q
ck
q
Clock Skew
Clock Source (ex. PLL)
CLK1
CLK2
Skew
Time
Time
Time
t1 t2
Latency
Most “high-profile” of clock network metrics
Maximum difference in arrival times of clock signal to any 2 latches/FF’s fed by the network
Skew = max | t1 – t2 |
Fig. From Zarkesh-HaSylvester / Shepard, 2001
Clock Skew Causes
Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths
Process variation – process spread across die yielding different Leff, Tox, etc. values
Temperature gradients – changes MOSFET performance across die
IR voltage drop in power supply – changes MOSFET performance across die
Note: Delay from clock generator to fan-out points (clock latency) is not important by itself
BUT: increased latency leads to larger skew for same amount of relative variation
Sylvester / Shepard, 2001
Clock Jitter
Clock network delay uncertaintyFrom one clock cycle to the next, the period is not exactly the same each timeMaximum difference in phase of clock between any two periods is jitterMust be considered in max path (setup) timing; typically O(50ps) for high-end designs
Sylvester / Shepard, 2001
Clock Jitter Causes
PLL oscillation frequency
Various noise sources affecting clock generation and distribution
E.g., power supply noise dynamically alters drive strength of intermediate buffer stagesJitter reduced by minimizing IR and L*(di/dt) noise
Courtesy Cypress Semi
Sylvester / Shepard, 2001
Clocking Methodology (Edge-Triggered)
FlipFlop
Comb
Logic
Comb
Logic
tper
Max(tpd) < tper – tsu – tc2q – tskewDelay is too long for data to be captured
Min(tpd) > th-tc2q+tskewDelay is too short and data can race through, skipping a state
Example of tpdmax Violation
Suppose there is skew between the registers in a dataflow (regA after regB)
“i” gets its input values from regA at transition in Ck’
CL output “o” arrives after Ck transition due to skew
To correct this problem, can increase cycle time
regA
regB
tpdmax
Ck’ Ck
Ck
i
i o
tskew
Too late!
tpdmax
Comb
Logic
Comb
Logic
Ck’
o
Example of tpdmin Violation: Race ThroughSuppose clock skew causes regA to be clocked before regB
“i” passes through the CL with little delay (tpdmin)
“o” arrives before the rising Ck’ causes the data to be latched
Cannot be fixed by changing frequency have rock instead of chip
Ck Ck’
regA
regB
tpdmin
i o
Cktskew
Too early!
tpdmin
Comb
Logic
Comb
Logic
Ck’i
o
Time Borrowing (Cycle Stealing)
Cycle steal with flip-flops using delayed clocks
FlipFlop
FlipFlop
Intentional delay = skew
Comb
Logic
Comb
Logic
CkLatch
Latch
Ck
Time borrowing with latches
Give it back in later stages
Comb
Logic
Comb
LogicComb
Logic
Comb
Logic
tpd < tper + tw
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Clock Distribution
General goal of clock distributionDeliver clock to all memory elements with acceptable skewDeliver clock edges with acceptable sharpness
Clocking network design is one of the greatest challenges in the design of a large chip
Clocks generally distributed via wiring trees (and meshes)
Low-resistance interconnect to minimize delay
Multiple drivers to distribute driver requirementsUse optimal sizing principles to design buffersClock lines can create significant crosstalk
Clock Distribution Problem StatementObjective
Minimum skew (performance and hold time issues)Minimum cell area and metal use(sometimes) minimal latency(sometimes) particular latency(sometimes) intermixed gating for power reduction(sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent
Subject to:Process variation from lot-to-lotProcess variation across the dieRadically different loading (ff density) around the dieMetal variation across the diePower variation across the die (both static IR and dynamic)Coupling (same and other layers)
Issues in Clock Distribution Network Design
Skew Process, voltage, and temperatureData dependenceNoise couplingLoad balancing
Power, CV2f – (no ½ or α)Clock gating
Flexibility/TunabilityCompactness – fit into existing layout/design
ReliabilityElectromigration
Clock Distribution Methods
RC-TreeLess capacitanceMore accuracyFlexible wiring
GridsReliableLess data dependencyTunable (late in design)
Shown here for final stage drivers driving F/F loads
RC-Trees
X-Tree Binary-TreeH-Tree
Asymmetric trees can and are used due to uneven sink distribution, hard macros in floorplan ( hierarchical clock distribution), etc.; the basic goal is to have even RC delays
Grids
Gridded clock distribution common on earlier DEC Alpha microprocessors
Advantages:Skew determined by grid density, not too sensitive to load positionClock signals available everywhereTolerant to process variationsUsually yields extremely low skew values
Disadvantages:Huge amount of wiring and powerTo minimize such penalties, need to make grid pitch coarser lose the grid advantage
Pre-drivers
Global grid
Sylvester / Shepard, 2001
Trees
H-tree (Bakoglu)One large central driver, recursive structure to match wirelengthsHalve wire width at branching points to reduce reflections
DisadvantagesSlew degradation along long RC pathsUnrealistically large central driver
- Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C)
Non-uniform load distributionInherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching points
courtesy of P. Zarkesh-Ha
Sylvester / Shepard, 2001
Buffered Tree
L2
WGBuf EGBuf
NGBuf
SGBuf
L3
PLL
Drives all clock loads within its region
Other regions of the chip
Sylvester / Shepard, 2001
Buffered H-tree
AdvantagesIdeally zero-skewCan be low power (depending on skew requirements)Low area (silicon and wiring)CAD tool friendly (regular)
DisadvantagesSensitive to process variations
- Devices Want same size buffers at each level of tree- Wires Want similar segment lengths on each layer in each source-sink
path !!!Local clocking loads inherently non-uniform
Sylvester / Shepard, 2001
Tree Balancing
Some techniques:a) Introduce dummy loads
b) Snaking of wirelength to match delays
Con: Routing area often more valuable than Silicon
Sylvester / Shepard, 2001
Examples From Processor Chips
SerpentinesIntel x86[Young ISSCC97]
GridsDEC [Alphas]
H-Tree, Asymmetric RC-Tree (IBM)
Examples From Processor Chips
DEC-Alpha 21064 clock spines
DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid)
DEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC local delays
ReShape Clocks Example (High-End ASIC)
Balanced, shielded H-tree for pre-clock distribution
Mesh for block level distribution
Pre-clock 2 Level H-tree
output mesh
All routes 5-6u M6/5, shielded with 1u grounds
~10 buffers per nodeE.g., ganged BUFx20’s
Output mesh must hit every sub-block
Block Level Mesh (.18u)
Max 600u stride
1u m5 ribs every 20 - 30 u (4 to 6 rows)
Shielded input and output m6 shorting straps
Clumps of 1-6 clock buffers, surrounded by capacitor pads
Pre-clock connects to input shorting straps
Problems with Meshes
Burn more power at low frequencies
Blocks more routing resources (solution, integrated power distribution with ribs can provide shielding for ‘free’)
Difficult for ‘spare’ clock domains that will not tolerate regioning
Post placement (and routing) tuning required
No ‘beneficial skew’ possible
Problems with Meshes (#2)
Clock gating only easy at root
Fighting tools to do analysis:Clumped buffers a problem in Static Timing Analysis toolsLarge shorted meshes a problem for STA toolsWhat does Elmore delay calculation look like for a non-tree?
Need full extractions and spice-like simulation (e.g. Avant! Star-Sim) to determine skew
Benefits of Meshes (#3)
Deterministic since shielded all the way down to rib distribution
No ECO placement required: all buffers preplaced before block placement
Low latency since uses shorted (= ganged, parallel) drivers, therefore lower skew
ECO placements of FFs later do not require rebalance of tree
“Idealized” clocking environment for concurrent RTL design and timing convergence dance
Alternative Clock Network Strategy
Globally – Tree
Power requirements reduced relative to global grid
Smaller routing requirements, frees up global tracks
Trees balanced easily at global level
Keeps global skew low (with minimal process variation)
Sylvester / Shepard, 2001
Vertex Locations in a Bounded-Skew Tree
skew0 2 4 6
2
4
6
0246
2
4
6
skew
v
s4
va b
s1 s2 s3
Topologys0 b
a
Given a skew bound, where can internal nodes of the given topology (e.g., a, b, v) be placed?
Deferred-Merge Embedding (DME) Algorithm
s4
va b
s1 s2 s3
Topology
s0
s1
s3
s4
s2
mr(a)mr(b)mr(v)
B = 4
Bottom-Up: build tree of merging regions corresponding to given topology
s0
Special case: skew = 0 merging segments
Top-Down Embedding Phase of DME
s4
va b
s1 s2 s3
Topology
s0
s1
s3
s4
s2
a bv
B = 4
Top-Down: choose embedding points within merging regions
s0
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Skew Reduction Using Package
• Most clock network latency occurs at global level (largest distances spanned)
• Latency ∝ Skew
• With reverse scaling, routing low-RC signals at global level becomes more difficult & area-consuming
Sylvester / Shepard, 2001
Skew Reduction Using Package
System clock
µP/ASIC Solder bump
substrate
⇒ Incorporate globalclock distribution into the package
⇒ Flip-chip packaging allows for high density, low parasitic access from substrate to IC
• RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring
• Global skew reduced
• Lower capacitance lower power
• Opens up global routing tracks
• Results not yet conclusive
Sylvester / Shepard, 2001
Useful Skew (= cycle-stealing)
Zero skew
FF fast FF FFslow
hold setup hold setup
Timing Slacks
FF fast FF FFslow
Useful skew
hold setup hold setup
Useful skew• Local skew constraints• Shift slack to critical paths
Zero skew• Global skew constraint• All skew is bad
W. Dai, UC Santa Cruz
Skew = Local Constraint
Timing is correct as long as the signal arrives in the permissible skew range
D : longest pathd : shortest pathFF FF
safe
Skew
race condition cycle time violation
-d + thold Tperiod - D - tsetup< <
permissible range
W. Dai, UC Santa Cruz
Skew Scheduling for Design Robustness
Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge
Can solve a linear program to maximize robustness = determine prescribed sink skews
FF FF FF2 ns 6 ns T = 6 ns
“0 0 0”: at verge of violation
“2 0 2”: more safety margin4 0
-22
4 0
W. Dai, UC Santa Cruz
Potential Advantages of Useful Skew
Reduce peak current consumption by distributing the FF switch point in the range of permissible skew
CLK
0-skew
CLK
U-skewAffords extra margin to increase clock frequency or reduce sizing (= power)
W. Dai, UC Santa Cruz
Conventional Zero-Skew Flow
PlacementPlacement
SynthesisSynthesis
Extraction & Delay CalculationExtraction & Delay Calculation
Static Timing AnalysisStatic Timing Analysis
0-Skew Clock Synthesis0-Skew Clock Synthesis
Clock RoutingClock Routing
Signal RoutingSignal Routing
Useful-Skew Flow
Existing PlacementExisting Placement
Extraction & Delay CalculationExtraction & Delay Calculation
Static Timing AnalysisStatic Timing Analysis
U-Skew Clock SynthesisU-Skew Clock Synthesis
Clock RoutingClock Routing
Signal RoutingSignal Routing
Permissible range generationPermissible range generation
Initial skew schedulingInitial skew scheduling
Clock tree topology synthesisClock tree topology synthesis
Clock net routingClock net routing
Clock timing verificationClock timing verification
W. Dai, UC Santa Cruz
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and used-skew degrees of freedom
Clock power issues
Gate timing models
Clock Power
Power consumption in clocks due to:Clock driversLong interconnectionsLarge clock loads – all clocked elements (latches, FF’s) are driven
Different components dominateDepending on type of clock network usedEx. Grid – huge pre-drivers & wire cap. drown out load cap.
Sylvester / Shepard, 2001
Clock Power Is LARGE
Not only is the clock capacitance large, it switches every cycle!
P = α C Vdd2 f
Sylvester / Shepard, 2001
Low-Power Clocking
Gated clocksGated clocksPrevent switching in areas of chip not being usedPrevent switching in areas of chip not being usedEasier in static designsEasier in static designs
EdgeEdge--triggered flops in ARM rather than transparent latches triggered flops in ARM rather than transparent latches in Alphain Alpha
Reduced load on clock for each latch/flopReduced load on clock for each latch/flopEliminated spurious powerEliminated spurious power--consuming transitions during latch flowconsuming transitions during latch flow--through (transparency)through (transparency)
Sylvester / Shepard, 2001
Clock Area
Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area
Routing area is most vital
Top-level metals are used to reduce RC delaysThese levels are precious resources (unscaled)Power routing, clock routing, key global signals
Reducing area also reduces wiring capacitance and power
Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing
Sylvester / Shepard, 2001
Clock Slew Rates
To maintain signal integrity and latch performance, minimum slew rates are required
Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew(ps)], more short-circuit power for large clock driversToo fast – burns too much power, overdesigned network, enhanced ground bounce
Rule-of-thumb: Trise and Tfall of clock are each between 10-20% of clock period (10% - aggressive target)
1 GHz clock; Trise = Tfall = 100-200ps
Sylvester / Shepard, 2001
Example: Alpha 21264
Grid + H-tree approach
Power = 32% of total
Wire usage = 3% of metals 3 & 4
4 major clock quadrants, each with a large driver connected to local grid structures
Sylvester / Shepard, 2001
Power vs. Skew
Fundamental design decisionMeeting skew requirements is easy with unlimited power budget
Wide wires reduce RC product but increase total CDriver upsizing reduces latency ( reduces skew as well) but increases buffer cap
SOC context: plastic package power limit is 2-3 W
Sylvester / Shepard, 2001
Clock Distribution Trends
TimingClock period dropping fast, skew must followSlew rates must also scale with cycle timeJitter – PLL’s get better with CMOS scaling but other sources of noise increase
- Power supply noise more important- Switching-dependent temperature gradients
MaterialsCu reduces RC slew degradation, potential skewLow-k decreases power, improves latency, skew, slews
PowerComplexity, dynamic logic, pipelining more clock sinksLarger chips bigger clock networks
Sylvester / Shepard, 2001
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Gate Timing Characterization
CL DA
B
F
CL
“Extract” exact transistor characteristics from layoutTransistor width, length, junction area and perimeterLocal wire length and inter-wire distance
Compute all transistor and wire capacitances
Cell Timing Characterization
Delay tables generated using a detailed transistor-level circuit simulator SPICE (differential-equations solver)
For a number of different input slews and load capacitances simulate the circuit of the cell
Propagation time (50% Vdd at input to 50% at output)Output slew (10% Vdd at output to 90% Vdd at output)
Time
tslew
tpd
Vdd
Non-linear effects reflected in tables
InputSlew
InputSlew
Delay at the gate
OutputCapacitance
OutputCapacitance
OutputSlew
IntrinsicDelay
Resulting waveform
DG = f (CL, Sin) and Sout = f (CL, Sin)Non-linear
Interpolate between table entries
Interpolation error is usually below 10% of SPICE
Timing Library Example (.lib)library(my_lib) {
delay_model : table_lookup;
library_features (report_delay_calculation);
time_unit : "1ns";
voltage_unit : "1V";
current_unit : "1mA";
leakage_power_unit : 1uW;
capacitive_load_unit(1,pf);
pulling_resistance_unit : "1kohm";
default_fanout_load : 1.0;
default_inout_pin_cap : 1.0;
default_input_pin_cap : 1.0;
default_output_pin_cap : 0.0;
default_cell_leakage_power : 0.0;
nom_voltage : 1.08;
nom_temperature : 125.0;
nom_process : 1.0;
slew_derate_from_library : 0.500000;
operating_conditions("slow_125_1.08") {
process : 1.0 ;
temperature : 125 ;
voltage : 1.08 ;
tree_type : "worst_case_tree" ;
}
default_operating_conditions : slow_125_1.08 ;
lu_table_template("load") {
variable_1 : input_net_transition;
variable_2 : total_output_net_capacitance;
index_1( "1, 2, 3, 4" );
index_2( "1, 2, 3, 4" );
}
fall_transition(load) {
index_1( "0.0326, 0.1614, 0.4192, 1.5017" );
index_2( "0.0010, 0.4249, 2.1491, 8.1881" );
values ( \
"0.011974, 0.071668, 0.317800, 1.189560", \
"0.033212, 0.101182, 0.328540, 1.189562", \
"0.059282, 0.155052, 0.389900, 1.202360", \
"0.162830, 0.317380, 0.628160, 1.441260" );
}
rise_transition(load) {
index_1( "0.0375, 0.1650, 0.5455, 1.5078" );
index_2( "0.0010, 0.4449, 1.7753, 5.1139" );
values ( \
"0.016690, 0.115702, 0.418200, 1.189060", \
"0.038256, 0.139336, 0.422960, 1.189081", \
"0.076248, 0.213280, 0.491820, 1.203700", \
"0.170992, 0.353120, 0.694740, 1.384760" );
}
}
cell("INV") {
pin(A) {
max_transition : 1.500000;
direction : input;
rise_capacitance : 0.0739000;
fall_capacitance : 0.0703340;
capacitance : 0.07278646;
}
pin(Z) {
direction : output;
function : "!A";
max_transition : 1.500000;
max_capacitance : 5.1139;
timing() {
related_pin : "A";
cell_rise(load) {
index_1( "0.0375, 0.2329, 0.6904, 1.5008" );
index_2( "0.0010, 0.9788, 2.2820, 5.1139" );
values ( \
"0.013211, 0.071051, 0.297500, 0.642340", \
"0.028657, 0.110849, 0.362620, 0.707070", \
"0.053289, 0.165930, 0.496550, 0.860400", \
"0.091041, 0.234440, 0.661840, 1.091700" );
}
cell_fall(load) {
index_1( "0.0326, 0.1614, 0.5432, 1.5017" );
index_2( "0.0010, 0.4249, 3.6538, 8.1881" );
values ( \
"0.009472, 0.072284, 0.317370, 0.688390", \
"0.009992, 0.095862, 0.360530, 0.731610", \
"0.009994, 0.126620, 0.477260, 0.867670", \
"0.009996, 0.144150, 0.644140, 1.127700" );
}
Delay Calculation
Cell FallCap\Tr 0.05 0.2 0.5
0.01 0.02 0.16 0.30
0.5 0.04 0.32 0.60
2.0 0.08 0.64 1.20
Cap\Tr 0.05 0.2 0.5
0.01 0.03 0.18 0.33
0.5 0.06 0.36 0.66
2.0 0.09 0.72 1.32
Cell Rise
1.0pf
0.1ns
0.12ns
Fall delay = 0.178nsRise delay = 0.261nsFall transition = 0.147nsRise transition = …
0.178
0.261
Cap\Tr 0.05 0.2 0.5
0.01 0.01 0.09 0.15
0.5 0.03 0.27 0.45
2.0 0.06 0.54 0.90
Fall Transition
0.147
0.147ns
PVT Derating: Example + Min/Typ/Max Triples
Proc_var (0.5:1.0:1.3)Voltage (5.5:5.0:4.5)Temperature (0:20:50)KP = 0.80 : 1.00 : 1.30KV = 0.93 : 1.00 : 1.08KT = 0.80 : 1.07 : 1.35
KPVT = 0.60 : 1.07 : 1.90
Cell delay = 0.261nsDerated delay = 0.157 : 0.279 : 0.496 {min : typical : max}