Timing Issues in Digital ASIC Design

101
Timing issues in Digital ASIC Design S.Sivanantham VLSI and ES Division VIT University

Transcript of Timing Issues in Digital ASIC Design

Timing issues in Digital ASIC Design

S.Sivanantham

VLSI and ES Division

VIT University

This Class + LogisticsTiming

Storage elements, Clock distribution, Clock tree synthesis

ReadingWhitepapers/datasheets on STA; papers on clock tree synthesis

ScheduleMT in one week (lab/recitation fair game); Lab #2 due Mon 1/27

HW #9: As a block’s layout is compacted down to fit into a smaller and smaller region, the timing of the block at first improves, but then worsens. Explain.

HW #10: Hold time violations mean that the chip doesn’t work at any frequency. Propose several distinct methods for fixing hold time violations (guided by post-routing static timing analysis), and explain the pros and cons of each.

HW #11: Compare DEC’s first Alpha and first StrongArm processors (look up transistor counts, supply voltage, frequency, etc.). (a) How much of StrongArm’s power efficiency can be attributed to process, supply, and frequency scaling? (b) What factors might contribute to the remainder?

Slide courtesy of S. P. Levitan, U. Pittsburg

Review

Static timing analysis (Lecture 4)Pin-based timing graph

Directed acyclic graph (DAG) of timing arcs

Longest path in DAG time linear in #arcs (edges)

Slack = required arrival time – actual arrival time (long path analysis)

Logic synthesis (Lecture 5)

Slide courtesy of S. P. Levitan, U. Pittsburg

Static Analysis vs. Dynamic AnalysisWhy static analysis when dynamic simulation is more accurate?

Drawbacks of simulationRequires input vectors (stimuli for circuit)

Long runtimes

Example: calculate worst-case rising delay from a to zExponential explosion with number of possible design input states

a

c=0 c=1b=0 a-z delay1 a-z delay2 b=1 a-z delay3 a-z delay4

b

c

z

STA Terminology

90

10

(Actual) arrival time (AAT, or AT) = time at which a pin switches state

Usually 50% point on voltage curve, i.e., AT = t50

Slew time = time over which signal switchesUsually difference between 10% and 90% on voltage curve, i.e., tslew = t90 – t10

Required arrival time (RAT) = time at which a signal must arrive in order to avoid a chip fail

Slack = RAT – AATPositive slack good (= margin), negative slack bad

Vdd50

Time

Example: What is slack at PO?

d=2

d=1

d=5

d=3

d=2

d=1

d=3

d=3d=1

temp at=3 temp at=7

at=0

at=0

at=0

at=1

at=2

at=5 at=6

at=5

at=8at=11

rat=10

Slack= -1

Example: Incremental Timing Analysis

d=2

d=1

d=5

d=3

d=2

d=1

d=3

d=3d=1

temp at=3 temp at=7

at=0

at=0

at=0

at=1

at=2

at=5 at=6

at=5

at=8at=11

rat=10

Slack = 0

at=10

d=1

d=1d=1

at=3

at=7

Amount of work is bounded by sizes of fanin, fanoutcones of logic

Early-Mode Analysis

Definitions change as followsRAT = lower bound on arrival timePropagate shortest possible instead of longest possible delaysSlack = Arrival – Required

Example: negative slack because ATc is too small (early)

0=aAT1=bAT

2=xRAT

1=xAT121 −=−=xSL101 =−=bSL

000 =−=aSL1=yAT

0=cAT

011 =−=ySL

a

b xc

y1 1

110 −=−=cSL

Enhancements of STA

Incremental timing analysis

Nanometer-scale process effects – variation (probabilistic timing analysis)

Interference – crosstalk

Multiple inputs switching

Conservatism of delay propagation

(Old: HW #8: Suppose you change the size of one (combinational) gate in your design, thus invalidating the previous timing analysis. How much work must be done to regain a correct timing analysis?)

Courtesy K. Keutzer et al. UCB

Timing Correction

Driven by STA“Incremental performance analysis backplane”

Fix electrical violationsResize cellsBuffer netsCopy (clone) cells

Fix timing problemsLocal transforms (bag of tricks)Path-based transforms

DAC-2002, Physical Chip Implementation

Local Synthesis Transforms

Resize cells

Buffer or clone to reduce load on critical nets

Decompose large cells

Swap connections on commutative pins or among equivalent nets

Move critical signals forward

Pad early paths

Area recovery

DAC-2002, Physical Chip Implementation

Transform Example

…..

Double Inverter

Removal

…..

…..

Delay = 4

Delay = 2

DAC-2002, Physical Chip Implementation

Resizing

00.010.020.030.040.05

0 0.2 0.4 0.6 0.8 1load

dA B C

b

ad

e

f0.2

0.2

0.3

?

b

aA

0.035

b

aC

0.026

DAC-2002, Physical Chip Implementation

Cloning

00.010.020.030.040.05

0 0.2 0.4 0.6 0.8 1load

d

A B C

b

a

d

e

f

gh

0.2

0.2

0.20.20.2

?

b

a

d

ef

gh

A

B

DAC-2002, Physical Chip Implementation

Buffering

00.010.020.030.040.05

0 0.2 0.4 0.6 0.8 1load

d

A B C

b

a

d

e

f

gh

0.1

0.2

0.20.20.2

BB

0.2

b

a

d

e

f

gh

0.2

0.2

0.20.20.2

?

DAC-2002, Physical Chip Implementation

Redesign Fan-in Tree

a

cd

b eArr(b)=3

Arr(c)=1

Arr(d)=0

Arr(a)=4

Arr(e)=61

1

1

cd

e

Arr(e)=51

1b1

a

DAC-2002, Physical Chip Implementation

Redesign Fan-out Tree

1

1

1

3

1

1

1

Longest Path = 5

1

1

1

3

1

2

Longest Path = 4Slowdown of buffer due to load

DAC-2002, Physical Chip Implementation

Decomposition

DAC-2002, Physical Chip Implementation

Swap Commutative Pins

a

cb

2

1

0

1

1

2

1 5

Simple sorting on arrival times and delay works

c

ab

2

1

0 1

1

1

3

2

DAC-2002, Physical Chip Implementation

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models

Why Clocks?Clocks provide the means to synchronize

By allowing events to happen at known timing boundaries, we can sequence these events

Greatly simplifies building of state machines

No need to worry about variable delay through combinational logic (CL)

All signals delayed until clock edge (clock imposes the worst case delay)

CombLogic

register

CombLogic

register

registerDataflowFSM

Clock Cycle Time

Cycle time is determined by the delay through the CLSignal must arrive before the latching edgeIf too late, it waits until the next cycle

- Synchronization and sequential order becomes incorrect

tcycle > tprop_delay + toverhead

Can change circuit architecture to obtain smaller Tcycle

PipeliningFor dataflow:

Instead of a long critical path, split the critical path into chunks Insert registers to store intermediate resultsThis allows 2 waves of data to coexist within the CL

Can we extend this ad infinitum?Overhead eventually limits the pipelining

- E.g., 1.5 to 2 gate delays for latch or FFGranularity limits as well

- Minimum time quantum: delay of a gate

tcycle > tpd + toverhead tcycle > max(tpd1, tpd2) + toverhead

register

register

register

register

register

CL

A+B

CL

A+BCL

A

CL

ACL

B

CL

B

tpd tpd1 tpd2

FO4 INV Delays Per Clock Period

0.00

20.00

40.00

60.00

80.00

100.00

120.00

1982 1987 1993 1998 2004

Year

Num

ber o

f FO

4 in

verte

r del

ays

386

486 DX2 DX4

Pentium

Pentium MMX

Pentium Pro

Pentium II

Celeron

Pentium III

Pentium 4

FO4 INV = inverter driving 4 identical inverters (no interconnect)

Half of frequency improvement has been from reduced logic stages, i.e., pipelining

Parallelism

For FSMs:Same functionality and performance can be achieved at half the clock rateHowever, the input and output signals must be doubled (to account for the outputs for each original cycle)Instead of doubling the delay, the optimized logic is often logarithmically related to the degree of parallelism

tcycle1 > tpd + tov tcycle2 > Ntpd + tov tcycle3 > log(Ntpd) + tov

register

tpd

M-bits

reg

tpd

M-bits

tpd

reg

M-bits register

tpd

2*M-bitsCLCL CLCL

Opt.

CL

Opt.

CLCLCL

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models

Storage Elements

LatchesLevel sensitive – transparent when H, hold when L

ckb

d

ck

qp_q

ck

q

d

ck

qdck

q

d

Flip-flopsEdge-triggered – data is sampled at the clock edge

Latch and Flip-Flop Gates

Rising edge flip-flopActive high latch

clock

D QN

Q

clock

clock

clock

clock

clock

clock

clock

clock

clock

clockclock

QND

Q

in out

enable

enable

out

enable

enable

in

Latch and flip-flop schematics from TSMC 0.13um LV Artisan Sage-X Standard Cell Library.

Latch and Flip-Flop Behavior

Rising edge flip-flopActive high latchWhen clock is high When clock is high

D QND QN

Q Q

tCQ 4 inverter delaystDQ 2 inverter delays

When clock is lowWhen clock is low

D QNQND

QQ

Clock Skew and Jitter

A B

clock

(a)

(b)

(c)

clock at B

clock at B

T – tjtj/2 tj/2

Thigh – tduty tduty

clock at Bclock at B

tsk,AB

clock at Bclock at A

tsk,AB

Clock skew

Duty cycle jitter

Cycle-to-cycle edge jitter

Flip-Flop Timing Characteristics

Rising edge flip-flop

non-idealclock

tCQmax tcomb,max tsutsk+tj

Tflip-flops

non-idealclock

clock

A B

tcomb,min

tCQ,min

A

B

A

B

thtsk

Setup time constraint Hold time constraint

Latch Setup Time and Transparency

Active high latch

clock

tcomb

non-idealclock

tDQ tDQ

A B

AB

clock

tCQ tcomb,max tsu tsk+tjtduty

non-idealclock

A B

AB

No penalty to clock period for setup time constraint!

Setup time constraint

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models

Setup Time

Important characteristics of storage elementsSetup time, hold time, clock-to-q delay

Setup time, tsuTime before the clock edge that the data must arrive in order for the new data to be storedThe setup time for a F/F occurs before the latching edge.The setup time for a Latch occurs before the transition from transparent to hold

tsetup

d

ck

q

Hold TimeA second important characteristic is the hold time, th

Time after the clock edge that the data must remain in order to the data to be properly heldNote that Hold time (and Setup time) can be negative

Why isn’t hold time just the negative of setup time?Storage elements typically have some data dependence

- Capacitances, and devices may be faster for one data value versus another

Specify the worst case for process technology and operating condition variations

d

ckthold

q

Clocking OverheadInherent delay in any storage element

The delay is measured from Clock transition to Output data transition, tc2q

Input data transition to Output data transition, td2q

Flip-flop is edge triggeredThe overhead is tc2q + tsu

Latch is level-sensitiveThe overhead is td2q

d

tc2q td2q

ck

q

Clock Skew

Clock Source (ex. PLL)

CLK1

CLK2

Skew

Time

Time

Time

t1 t2

Latency

Most “high-profile” of clock network metrics

Maximum difference in arrival times of clock signal to any 2 latches/FF’s fed by the network

Skew = max | t1 – t2 |

Fig. From Zarkesh-HaSylvester / Shepard, 2001

Clock Skew Causes

Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths

Process variation – process spread across die yielding different Leff, Tox, etc. values

Temperature gradients – changes MOSFET performance across die

IR voltage drop in power supply – changes MOSFET performance across die

Note: Delay from clock generator to fan-out points (clock latency) is not important by itself

BUT: increased latency leads to larger skew for same amount of relative variation

Sylvester / Shepard, 2001

Clock Jitter

Clock network delay uncertaintyFrom one clock cycle to the next, the period is not exactly the same each timeMaximum difference in phase of clock between any two periods is jitterMust be considered in max path (setup) timing; typically O(50ps) for high-end designs

Sylvester / Shepard, 2001

Clock Jitter Causes

PLL oscillation frequency

Various noise sources affecting clock generation and distribution

E.g., power supply noise dynamically alters drive strength of intermediate buffer stagesJitter reduced by minimizing IR and L*(di/dt) noise

Courtesy Cypress Semi

Sylvester / Shepard, 2001

Clocking Methodology (Edge-Triggered)

FlipFlop

Comb

Logic

Comb

Logic

tper

Max(tpd) < tper – tsu – tc2q – tskewDelay is too long for data to be captured

Min(tpd) > th-tc2q+tskewDelay is too short and data can race through, skipping a state

Example of tpdmax Violation

Suppose there is skew between the registers in a dataflow (regA after regB)

“i” gets its input values from regA at transition in Ck’

CL output “o” arrives after Ck transition due to skew

To correct this problem, can increase cycle time

regA

regB

tpdmax

Ck’ Ck

Ck

i

i o

tskew

Too late!

tpdmax

Comb

Logic

Comb

Logic

Ck’

o

Example of tpdmin Violation: Race ThroughSuppose clock skew causes regA to be clocked before regB

“i” passes through the CL with little delay (tpdmin)

“o” arrives before the rising Ck’ causes the data to be latched

Cannot be fixed by changing frequency have rock instead of chip

Ck Ck’

regA

regB

tpdmin

i o

Cktskew

Too early!

tpdmin

Comb

Logic

Comb

Logic

Ck’i

o

Time Borrowing (Cycle Stealing)

Cycle steal with flip-flops using delayed clocks

FlipFlop

FlipFlop

Intentional delay = skew

Comb

Logic

Comb

Logic

CkLatch

Latch

Ck

Time borrowing with latches

Give it back in later stages

Comb

Logic

Comb

LogicComb

Logic

Comb

Logic

tpd < tper + tw

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models

Clock Distribution

General goal of clock distributionDeliver clock to all memory elements with acceptable skewDeliver clock edges with acceptable sharpness

Clocking network design is one of the greatest challenges in the design of a large chip

Clocks generally distributed via wiring trees (and meshes)

Low-resistance interconnect to minimize delay

Multiple drivers to distribute driver requirementsUse optimal sizing principles to design buffersClock lines can create significant crosstalk

Clock Distribution Problem StatementObjective

Minimum skew (performance and hold time issues)Minimum cell area and metal use(sometimes) minimal latency(sometimes) particular latency(sometimes) intermixed gating for power reduction(sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent

Subject to:Process variation from lot-to-lotProcess variation across the dieRadically different loading (ff density) around the dieMetal variation across the diePower variation across the die (both static IR and dynamic)Coupling (same and other layers)

Issues in Clock Distribution Network Design

Skew Process, voltage, and temperatureData dependenceNoise couplingLoad balancing

Power, CV2f – (no ½ or α)Clock gating

Flexibility/TunabilityCompactness – fit into existing layout/design

ReliabilityElectromigration

Skew: Clock Delay Varies With Position

Clock Distribution Methods

RC-TreeLess capacitanceMore accuracyFlexible wiring

GridsReliableLess data dependencyTunable (late in design)

Shown here for final stage drivers driving F/F loads

RC-Trees

X-Tree Binary-TreeH-Tree

Asymmetric trees can and are used due to uneven sink distribution, hard macros in floorplan ( hierarchical clock distribution), etc.; the basic goal is to have even RC delays

Grids

Gridded clock distribution common on earlier DEC Alpha microprocessors

Advantages:Skew determined by grid density, not too sensitive to load positionClock signals available everywhereTolerant to process variationsUsually yields extremely low skew values

Disadvantages:Huge amount of wiring and powerTo minimize such penalties, need to make grid pitch coarser lose the grid advantage

Pre-drivers

Global grid

Sylvester / Shepard, 2001

Trees

H-tree (Bakoglu)One large central driver, recursive structure to match wirelengthsHalve wire width at branching points to reduce reflections

DisadvantagesSlew degradation along long RC pathsUnrealistically large central driver

- Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C)

Non-uniform load distributionInherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching points

courtesy of P. Zarkesh-Ha

Sylvester / Shepard, 2001

Buffered Tree

L2

WGBuf EGBuf

NGBuf

SGBuf

L3

PLL

Drives all clock loads within its region

Other regions of the chip

Sylvester / Shepard, 2001

Buffered H-tree

AdvantagesIdeally zero-skewCan be low power (depending on skew requirements)Low area (silicon and wiring)CAD tool friendly (regular)

DisadvantagesSensitive to process variations

- Devices Want same size buffers at each level of tree- Wires Want similar segment lengths on each layer in each source-sink

path !!!Local clocking loads inherently non-uniform

Sylvester / Shepard, 2001

Tree Balancing

Some techniques:a) Introduce dummy loads

b) Snaking of wirelength to match delays

Con: Routing area often more valuable than Silicon

Sylvester / Shepard, 2001

Examples From Processor Chips

SerpentinesIntel x86[Young ISSCC97]

GridsDEC [Alphas]

H-Tree, Asymmetric RC-Tree (IBM)

Examples From Processor Chips

DEC-Alpha 21064 clock spines

DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid)

DEC-Alpha 21064 RC delays

DEC-Alpha 21164 RC local delays

ReShape Clocks Example (High-End ASIC)

Balanced, shielded H-tree for pre-clock distribution

Mesh for block level distribution

Pre-clock 2 Level H-tree

output mesh

All routes 5-6u M6/5, shielded with 1u grounds

~10 buffers per nodeE.g., ganged BUFx20’s

Output mesh must hit every sub-block

Block Level Mesh (.18u)

Max 600u stride

1u m5 ribs every 20 - 30 u (4 to 6 rows)

Shielded input and output m6 shorting straps

Clumps of 1-6 clock buffers, surrounded by capacitor pads

Pre-clock connects to input shorting straps

Problems with Meshes

Burn more power at low frequencies

Blocks more routing resources (solution, integrated power distribution with ribs can provide shielding for ‘free’)

Difficult for ‘spare’ clock domains that will not tolerate regioning

Post placement (and routing) tuning required

No ‘beneficial skew’ possible

Problems with Meshes (#2)

Clock gating only easy at root

Fighting tools to do analysis:Clumped buffers a problem in Static Timing Analysis toolsLarge shorted meshes a problem for STA toolsWhat does Elmore delay calculation look like for a non-tree?

Need full extractions and spice-like simulation (e.g. Avant! Star-Sim) to determine skew

Benefits of Meshes (#3)

Deterministic since shielded all the way down to rib distribution

No ECO placement required: all buffers preplaced before block placement

Low latency since uses shorted (= ganged, parallel) drivers, therefore lower skew

ECO placements of FFs later do not require rebalance of tree

“Idealized” clocking environment for concurrent RTL design and timing convergence dance

Mesh Example

~ 100k flops

6 blocks

Clock Skew Thermal Map

Pre-tuning

Clock Skew Thermal Map #2

50ps block/ 100ps global skew, post tuning

Alternative Clock Network Strategy

Globally – Tree

Power requirements reduced relative to global grid

Smaller routing requirements, frees up global tracks

Trees balanced easily at global level

Keeps global skew low (with minimal process variation)

Sylvester / Shepard, 2001

Vertex Locations in a Bounded-Skew Tree

skew0 2 4 6

2

4

6

0246

2

4

6

skew

v

s4

va b

s1 s2 s3

Topologys0 b

a

Given a skew bound, where can internal nodes of the given topology (e.g., a, b, v) be placed?

Deferred-Merge Embedding (DME) Algorithm

s4

va b

s1 s2 s3

Topology

s0

s1

s3

s4

s2

mr(a)mr(b)mr(v)

B = 4

Bottom-Up: build tree of merging regions corresponding to given topology

s0

Special case: skew = 0 merging segments

Top-Down Embedding Phase of DME

s4

va b

s1 s2 s3

Topology

s0

s1

s3

s4

s2

a bv

B = 4

Top-Down: choose embedding points within merging regions

s0

Zero-Skew Example (555 sinks, 40 obstacles)

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models

Skew Reduction Using Package

• Most clock network latency occurs at global level (largest distances spanned)

• Latency ∝ Skew

• With reverse scaling, routing low-RC signals at global level becomes more difficult & area-consuming

Sylvester / Shepard, 2001

Skew Reduction Using Package

System clock

µP/ASIC Solder bump

substrate

⇒ Incorporate globalclock distribution into the package

⇒ Flip-chip packaging allows for high density, low parasitic access from substrate to IC

• RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring

• Global skew reduced

• Lower capacitance lower power

• Opens up global routing tracks

• Results not yet conclusive

Sylvester / Shepard, 2001

Useful Skew (= cycle-stealing)

Zero skew

FF fast FF FFslow

hold setup hold setup

Timing Slacks

FF fast FF FFslow

Useful skew

hold setup hold setup

Useful skew• Local skew constraints• Shift slack to critical paths

Zero skew• Global skew constraint• All skew is bad

W. Dai, UC Santa Cruz

Skew = Local Constraint

Timing is correct as long as the signal arrives in the permissible skew range

D : longest pathd : shortest pathFF FF

safe

Skew

race condition cycle time violation

-d + thold Tperiod - D - tsetup< <

permissible range

W. Dai, UC Santa Cruz

Skew Scheduling for Design Robustness

Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge

Can solve a linear program to maximize robustness = determine prescribed sink skews

FF FF FF2 ns 6 ns T = 6 ns

“0 0 0”: at verge of violation

“2 0 2”: more safety margin4 0

-22

4 0

W. Dai, UC Santa Cruz

Potential Advantages of Useful Skew

Reduce peak current consumption by distributing the FF switch point in the range of permissible skew

CLK

0-skew

CLK

U-skewAffords extra margin to increase clock frequency or reduce sizing (= power)

W. Dai, UC Santa Cruz

Conventional Zero-Skew Flow

PlacementPlacement

SynthesisSynthesis

Extraction & Delay CalculationExtraction & Delay Calculation

Static Timing AnalysisStatic Timing Analysis

0-Skew Clock Synthesis0-Skew Clock Synthesis

Clock RoutingClock Routing

Signal RoutingSignal Routing

Useful-Skew Flow

Existing PlacementExisting Placement

Extraction & Delay CalculationExtraction & Delay Calculation

Static Timing AnalysisStatic Timing Analysis

U-Skew Clock SynthesisU-Skew Clock Synthesis

Clock RoutingClock Routing

Signal RoutingSignal Routing

Permissible range generationPermissible range generation

Initial skew schedulingInitial skew scheduling

Clock tree topology synthesisClock tree topology synthesis

Clock net routingClock net routing

Clock timing verificationClock timing verification

W. Dai, UC Santa Cruz

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and used-skew degrees of freedom

Clock power issues

Gate timing models

Clock Power

Power consumption in clocks due to:Clock driversLong interconnectionsLarge clock loads – all clocked elements (latches, FF’s) are driven

Different components dominateDepending on type of clock network usedEx. Grid – huge pre-drivers & wire cap. drown out load cap.

Sylvester / Shepard, 2001

Clock Power Is LARGE

Not only is the clock capacitance large, it switches every cycle!

P = α C Vdd2 f

Sylvester / Shepard, 2001

Low-Power Clocking

Gated clocksGated clocksPrevent switching in areas of chip not being usedPrevent switching in areas of chip not being usedEasier in static designsEasier in static designs

EdgeEdge--triggered flops in ARM rather than transparent latches triggered flops in ARM rather than transparent latches in Alphain Alpha

Reduced load on clock for each latch/flopReduced load on clock for each latch/flopEliminated spurious powerEliminated spurious power--consuming transitions during latch flowconsuming transitions during latch flow--through (transparency)through (transparency)

Sylvester / Shepard, 2001

Clock Area

Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area

Routing area is most vital

Top-level metals are used to reduce RC delaysThese levels are precious resources (unscaled)Power routing, clock routing, key global signals

Reducing area also reduces wiring capacitance and power

Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing

Sylvester / Shepard, 2001

Clock Slew Rates

To maintain signal integrity and latch performance, minimum slew rates are required

Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew(ps)], more short-circuit power for large clock driversToo fast – burns too much power, overdesigned network, enhanced ground bounce

Rule-of-thumb: Trise and Tfall of clock are each between 10-20% of clock period (10% - aggressive target)

1 GHz clock; Trise = Tfall = 100-200ps

Sylvester / Shepard, 2001

Example: Alpha 21264

Grid + H-tree approach

Power = 32% of total

Wire usage = 3% of metals 3 & 4

4 major clock quadrants, each with a large driver connected to local grid structures

Sylvester / Shepard, 2001

Alpha 21264 Skew Map

Ref: Compaq, ASP-DAC00

Sylvester / Shepard, 2001

Power vs. Skew

Fundamental design decisionMeeting skew requirements is easy with unlimited power budget

Wide wires reduce RC product but increase total CDriver upsizing reduces latency ( reduces skew as well) but increases buffer cap

SOC context: plastic package power limit is 2-3 W

Sylvester / Shepard, 2001

Clock Distribution Trends

TimingClock period dropping fast, skew must followSlew rates must also scale with cycle timeJitter – PLL’s get better with CMOS scaling but other sources of noise increase

- Power supply noise more important- Switching-dependent temperature gradients

MaterialsCu reduces RC slew degradation, potential skewLow-k decreases power, improves latency, skew, slews

PowerComplexity, dynamic logic, pipelining more clock sinksLarger chips bigger clock networks

Sylvester / Shepard, 2001

OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models

Gate Timing Characterization

CL DA

B

F

CL

“Extract” exact transistor characteristics from layoutTransistor width, length, junction area and perimeterLocal wire length and inter-wire distance

Compute all transistor and wire capacitances

Cell Timing Characterization

Delay tables generated using a detailed transistor-level circuit simulator SPICE (differential-equations solver)

For a number of different input slews and load capacitances simulate the circuit of the cell

Propagation time (50% Vdd at input to 50% at output)Output slew (10% Vdd at output to 90% Vdd at output)

Time

tslew

tpd

Vdd

Delay and Transition Measurement

Transition

80%50%

20%

Cell Delay

Non-linear effects reflected in tables

InputSlew

InputSlew

Delay at the gate

OutputCapacitance

OutputCapacitance

OutputSlew

IntrinsicDelay

Resulting waveform

DG = f (CL, Sin) and Sout = f (CL, Sin)Non-linear

Interpolate between table entries

Interpolation error is usually below 10% of SPICE

Timing Library Example (.lib)library(my_lib) {

delay_model : table_lookup;

library_features (report_delay_calculation);

time_unit : "1ns";

voltage_unit : "1V";

current_unit : "1mA";

leakage_power_unit : 1uW;

capacitive_load_unit(1,pf);

pulling_resistance_unit : "1kohm";

default_fanout_load : 1.0;

default_inout_pin_cap : 1.0;

default_input_pin_cap : 1.0;

default_output_pin_cap : 0.0;

default_cell_leakage_power : 0.0;

nom_voltage : 1.08;

nom_temperature : 125.0;

nom_process : 1.0;

slew_derate_from_library : 0.500000;

operating_conditions("slow_125_1.08") {

process : 1.0 ;

temperature : 125 ;

voltage : 1.08 ;

tree_type : "worst_case_tree" ;

}

default_operating_conditions : slow_125_1.08 ;

lu_table_template("load") {

variable_1 : input_net_transition;

variable_2 : total_output_net_capacitance;

index_1( "1, 2, 3, 4" );

index_2( "1, 2, 3, 4" );

}

fall_transition(load) {

index_1( "0.0326, 0.1614, 0.4192, 1.5017" );

index_2( "0.0010, 0.4249, 2.1491, 8.1881" );

values ( \

"0.011974, 0.071668, 0.317800, 1.189560", \

"0.033212, 0.101182, 0.328540, 1.189562", \

"0.059282, 0.155052, 0.389900, 1.202360", \

"0.162830, 0.317380, 0.628160, 1.441260" );

}

rise_transition(load) {

index_1( "0.0375, 0.1650, 0.5455, 1.5078" );

index_2( "0.0010, 0.4449, 1.7753, 5.1139" );

values ( \

"0.016690, 0.115702, 0.418200, 1.189060", \

"0.038256, 0.139336, 0.422960, 1.189081", \

"0.076248, 0.213280, 0.491820, 1.203700", \

"0.170992, 0.353120, 0.694740, 1.384760" );

}

}

cell("INV") {

pin(A) {

max_transition : 1.500000;

direction : input;

rise_capacitance : 0.0739000;

fall_capacitance : 0.0703340;

capacitance : 0.07278646;

}

pin(Z) {

direction : output;

function : "!A";

max_transition : 1.500000;

max_capacitance : 5.1139;

timing() {

related_pin : "A";

cell_rise(load) {

index_1( "0.0375, 0.2329, 0.6904, 1.5008" );

index_2( "0.0010, 0.9788, 2.2820, 5.1139" );

values ( \

"0.013211, 0.071051, 0.297500, 0.642340", \

"0.028657, 0.110849, 0.362620, 0.707070", \

"0.053289, 0.165930, 0.496550, 0.860400", \

"0.091041, 0.234440, 0.661840, 1.091700" );

}

cell_fall(load) {

index_1( "0.0326, 0.1614, 0.5432, 1.5017" );

index_2( "0.0010, 0.4249, 3.6538, 8.1881" );

values ( \

"0.009472, 0.072284, 0.317370, 0.688390", \

"0.009992, 0.095862, 0.360530, 0.731610", \

"0.009994, 0.126620, 0.477260, 0.867670", \

"0.009996, 0.144150, 0.644140, 1.127700" );

}

Delay Calculation

Cell FallCap\Tr 0.05 0.2 0.5

0.01 0.02 0.16 0.30

0.5 0.04 0.32 0.60

2.0 0.08 0.64 1.20

Cap\Tr 0.05 0.2 0.5

0.01 0.03 0.18 0.33

0.5 0.06 0.36 0.66

2.0 0.09 0.72 1.32

Cell Rise

1.0pf

0.1ns

0.12ns

Fall delay = 0.178nsRise delay = 0.261nsFall transition = 0.147nsRise transition = …

0.178

0.261

Cap\Tr 0.05 0.2 0.5

0.01 0.01 0.09 0.15

0.5 0.03 0.27 0.45

2.0 0.06 0.54 0.90

Fall Transition

0.147

0.147ns

PVT (Process, Voltage, Temperature) Derating

Actual cell delay = Original delay x KPVT

PVT Derating: Example + Min/Typ/Max Triples

Proc_var (0.5:1.0:1.3)Voltage (5.5:5.0:4.5)Temperature (0:20:50)KP = 0.80 : 1.00 : 1.30KV = 0.93 : 1.00 : 1.08KT = 0.80 : 1.07 : 1.35

KPVT = 0.60 : 1.07 : 1.90

Cell delay = 0.261nsDerated delay = 0.157 : 0.279 : 0.496 {min : typical : max}

Conservatism of Gate Delay Modeling

True gate delay depends on input arrival time patterns

STA will assume that only 1 input is switchingWill use worst slope among several inputs

Time

A B Ftpd

A Ftpd

Vdd

Vdd

DA

B

F

CLD

A

B

F

CL

Time