lec17-clk-pwr - University of California, Berkeleyeecs151/sp19/files/lec17-clk-pwr.pdfClk Q D Q Clk...

48
EE141 EECS151/251A Spring 2019 Digital Design and Integrated Circuits Instructor: John Wawrzynek Lecture 17: Clock and Power Distribution

Transcript of lec17-clk-pwr - University of California, Berkeleyeecs151/sp19/files/lec17-clk-pwr.pdfClk Q D Q Clk...

  • EE141

    EECS151/251A
Spring2019 
DigitalDesignandIntegratedCircuitsInstructor:JohnWawrzynek

    Lecture 17:
Clock and Power Distribution

  • EE141

    Outline❑ Clock non-idealities ❑ Clock Distribution ❑ Power Distribution

    2

  • EE141

    Synchronous Timing - Review

  • EE141

    Synchronous Timing

    4

  • EE141

    Register Timing Parameters

    D

    Clk

    Q

    D

    Q

    Clk

    tclk-q

    thold

    T

    tsetup

    Output delays can be different for rising and falling data transitions

    5

  • EE141

    Timing Constraints

    tclk-q.maxtclk-q.min

    tlogic,maxtlogic,min

    tsetup, thold

    6

  • EE141

    Cycle time (max): TClk > tclk-q,max + tlogic,max + tsetup

    Race margin (min): thold < tclk-q,min + tlogic,min

    Timing Constraints

    tclk-q,maxtclk-q,min

    tlogic,maxtlogic,min

    tsetup, thold

    7

  • EE141

    Clock Nonidealities

  • EE141

    Clock Nonidealities❑ Clock skew: tSK ▪ Time difference between the sink (receiving) and

    source (launching) edge ▪ Spatial variation in temporally equivalent clock

    edges; deterministic + random ❑ Clock jitter ▪ Temporal variations in consecutive edges of the

    clock signal; modulation + random noise ▪ Cycle-to-cycle (short-term) tJS ▪ Long term tJL

    ❑ Variation of the pulse width ▪ Important for level sensitive clocking

    9

  • EE141

    Clock UncertaintiesSources of clock uncertainty

    10

  • EE141

    Clock Skew and Jitter

    ❑ Both skew and jitter affect the effective cycle time and the race margin

    Clk

    Clk’

    tSK

    tJS

    11

  • EE141

    Positive Skew

    Launching edge arrives before the receiving edge 12

  • EE141

    Negative Skew

    Receiving edge arrives before the launching edge

    CLK1

    CLK2

    TCLK

    δ

    TCLK - δ

    2

    1

    4

    3

    13

  • EE141

    Timing Constraints

    14

    Minimum cycle time: Tclk + δ = tclk-q,max + tsetup + tlogic,max

    tclk-q,maxtclk-q,min

    tlogic,maxtlogic,mintsetup, thold

  • EE141

    Timing Constraints

    15

    tclk-q,maxtclk-q,min

    tlogic,maxtlogic,mintsetup, thold

    Hold time constraint: t(clk-q,min) + t(logic,min) > thold + δ

  • EE141

    Longest Logic Path in Edge-Triggered Systems

    16

    Clk2tsetup

    Latest point 
of launching

    Earliest arrival
of next cycle

    tJS2 - δ

    Clk1

    TCLK

    tclk-Q,maxtlogic,max

    tJS1

  • EE141

    Clock Constraints in 
Edge-Triggered Systems

    17

    If launching edge is late and receiving edge is early, the data will not be too late if:

    Minimum cycle time is determined by the maximum delays through the logic

    tclk-q,max + tlogic,max + tsetup < TCLK – tJS,1 – tJS,2 + δ

    tclk-q,max + tlogic,max + tsetup - δ + 2tJS < TCLKSkew can be either positive or negative Jitter tJS usually expressed as peak-to-peak or n x RMS value

  • EE141

    Datapath with Feedback

    18

  • EE141

    Clock Distribution

  • EE141

    Clock Distribution

    !20

    ❑ Single clock generally used to synchronize all logic on the same chip (or region of chip) ▪ Need to distribute clock over the entire region ▪ While maintaining low skew/jitter ▪ And without burning too much power

  • EE141

    Clock Distribution

    21

    ❑ What’s wrong with just routing wires to every point that needs a clock?

  • EE141

    H-Tree

    22

    Equal wire length/number of buffers to get to every location

  • EE141

    More realistic H-tree

    23

    [Restle98]

  • EE141

    Clock Grid

    Driver

    Driver

    Driver Driver

    GCLK GCLK

    GCLK

    GCLK

    •No RC matching •But huge power

    24

  • EE141

    Example: DEC Alpha 21164 (1995)

    ❑ 2 phase single wire clock, distributed globally

    ❑ 2 distributed driver channels ▪ Reduced RC delay/skew ▪ Improved thermal distribution ▪ 3.75nF clock load, 20W power ▪ 58 cm final driver width

    ❑ Local inverters for latching ❑ Conditional clocks in caches to

    reduce power

    trise = 350ps tskew = 150ps

    tcycle= 3.3ns

    Clock waveform

    Location of clock driver on die

    pre-driver

    final drivers

    25

  • EE141

    26

    Clock Drivers

  • EE141

    Clock Skew in Alpha Processor

    27

  • EE141

    ❑ 2 Phase, with multiple conditional buffered clocks ▪ 2.8 nF clock load ▪ 40 cm final driver width

    ❑ Local clocks can be gated “off” to save power ❑ Reduced load/skew ❑ Reduced thermal issues ❑ Multiple clocks complicate race checking

    trise = 350ps tskew = 50ps

    tcycle= 1.67ns

    EV6 (Alpha 21264) Clocking 600 MHz – 0.35 micron CMOS

    Global clock waveform

    28

  • EE141 29

    EV6 Clock Results

    GCLK Skew (at Vdd/2 Crossings)

    ps 5

    10 15 20 25 30 35 40 45 50

    ps 300 305 310 315 320 325 330 335 340 345

    GCLK Rise Times (20% to 80% Extrapolated to 0% to 100%)

  • UC Regents Fall 2013 © UCBCS 250 L3: Timing

    the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [1 6 ] was required to tune themore than 2 0 0 0 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3 D visualizations. Figure 8 visualizes four ofthe 6 4 sector trees containing about 1 2 5 tuned wiresdriving 1 /1 6 th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [1 7 , 1 8 ]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 1 5 2 0 0 clock pins on the chip,regardless of proximity.

    From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

    Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

    Figure 6

    Schematic diagram of global clock generation and distribution.

    PLL

    Bypass

    Referenceclock in

    Referenceclock out

    Clock distributionClock out

    Figure 7

    3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

    Del

    ayGrid

    Tunedsectortrees

    Sectorbuffers

    Buffer level 2

    Buffer level 1

    y

    x

    Figure 8

    Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

    Del

    ay Multiple-fingeredtransmissionline

    yx

    J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2

    32

    Clock Tree Delays,

    IBM “Power” CPU

    Del

    ay

    30

  • UC Regents Fall 2013 © UCBCS 250 L3: Timing

    the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [1 6 ] was required to tune themore than 2 0 0 0 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3 D visualizations. Figure 8 visualizes four ofthe 6 4 sector trees containing about 1 2 5 tuned wiresdriving 1 /1 6 th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [1 7 , 1 8 ]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 1 5 2 0 0 clock pins on the chip,regardless of proximity.

    From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

    Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

    Figure 6

    Schematic diagram of global clock generation and distribution.

    PLL

    Bypass

    Referenceclock in

    Referenceclock out

    Clock distributionClock out

    Figure 7

    3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

    Del

    ay

    Grid

    Tunedsectortrees

    Sectorbuffers

    Buffer level 2

    Buffer level 1

    y

    x

    Figure 8

    Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

    Del

    ay Multiple-fingeredtransmissionline

    yx

    J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2

    32

    Clock Tree Delays, IBM Power

    clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.

    Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 1 0 0top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 2 0 ps to 4 0 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 7 0 ps skew from across-chip channel-lengthvariations [1 9 ]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.

    Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [2 0 ]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.

    The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,

    including uncertainties associated with the modelingof the floating-body effect [2 1 –2 3 ] and its impact onnoise immunity [2 2 , 2 4 –2 7 ] and overall chip decouplingcapacitance requirements [2 6 ], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.

    Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.

    The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description

    Figure 9

    Global clock wav eforms showing 20 ps of measured skew.

    1.5

    1.0

    0.5

    0.0

    0 500 1000 1500 2000 2500

    20 ps skew

    Vol

    ts (

    V)

    Time (ps)

    IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.

    33

    31

  • EE141

    Clock circuits live in center column.

    32 global clock wires go down the red column.

    Also, 4 regional clocks (restricted functionality).

    Any 10 may be sent to a clock region.

    FPGA

  • EE141

    Clocks have dedicated wires (low skew)

    From: Xilinx Spartan 3 data sheet. Virtex is similar.

  • EE141

    Low-skew Clocking in FPGAs

    Figures from Xilinx App Notes

  • EE141

    Die photo: Xilinx Virtex

    Gold wires are the clock tree.

  • EE141

    Power Distribution

  • EE141

    Power Supply Impedance❑ No voltage source is ideal - ||Z|| > 0 ❑ Two principal elements increase Z: ▪ Resistance of supply lines (IR drop) ▪ Inductance of supply lines (L⋅di/dt drop)

    37

  • EE141

    Technology (µm)0.10.20.30.40.50.6

    10-3

    10-2

    10-1

    100

    Scaling and Supply Impedance❑ Typical target for supply impedance is to get 5-10%

    voltage variation of nominal supply (e.g., 100mV for 1V supply)

    ❑ In traditional scaling Vdd drops while power stays constant

    ❑ This forced drastic drop in supply impedance ▪ Vdd ↓, Idd ↑ à |Zrequired| ↓↓

    ❑ Today’s chips: ▪ |Zrequired| ≈ 1 mΩ! ▪ Vdd = 1V, P=100W => Idd=100A ▪ For ΔVdd,max = 100mV, 


    Zdd,max = 100mV/100A = 1mΩ!

    Impedance Requirements of High-Performance Processors

    Req

    uire

    d Im

    peda

    nce

    (Ω)

    38

  • EE141

    IR Drop Example❑ Intel Pentium 4: ~103W at ~1.275V ▪ Idd = 81Amps

    ❑ For 10% IR drop, total distribution resistance must be less than 1.6mΩ

    ❑ On-chip wire R ≈ 20mΩ/sq. (thick metal) ▪ Can’t meet R requirement even with multiple,

    complete layers dedicated to power ▪ Main motivation for flip-chip packaging

    39

  • EE141

    Power Delivery

    ❑ Achieving such low impedance requires a lot of resources: ▪ ~70% of package pins just for power ▪ Top 2-3 (thick) metal layers 40

    Chip Metal Layers

  • EE141

    Not Just Impedance - Electromigration

    ❑ On-chip wires: current limited to ~1mA/µm for 5-7 year lifetime

    41

  • EE141

    On-Chip Power Distribution❑ Power network usually follows pre-defined

    template (often referred to as “power grid”)

    VDD Gnd VDD Gnd

    VDD

    Gnd

    VDD

    Gnd 42

  • EE141

    3 Metal Layer Approach (EV4)3rd “coarse and thick” metal layer added to the

    technology for EV4 design Power supplied from two sides of the die via 3rd metal layer

    2nd metal layer used to form power grid 90% of 3rd metal layer used for power/clock routing

    Metal 3

    Metal 2Metal 1

    Courtesy Compaq 43

  • EE141

    4 Metal Layers Approach (EV5)4th “coarse and thick” metal layer added to the

    technology for EV5 design Power supplied from four sides of the die Grid strapping done all in coarse metal

    90% of 3rd and 4th metals used for power/clock routing

    Metal 3

    Metal 2Metal 1

    Metal 4

    Courtesy Compaq 44

  • EE141

    2 reference plane metal layers added to the technology for EV6 design Solid planes dedicated to Vdd/Vss

    Lowers on-chip inductance

    6 Metal Layer Approach – EV6

    Metal 4

    Metal 2Metal 1

    RP2/Vdd

    RP1/Vss

    Metal 3

    Courtesy Compaq 45

  • EE141

    SUPPLY

    Boardwiring

    Bondingwire

    Decouplingcapacitor

    CHIPCd

    1

    2

    Decoupling Capacitors

    ❑ On the board (right under the supply pins) ❑ On the chip (under the supply straps, near large buffers)

    Decoupling capacitors are added:

    46

  • EE141

    Decoupling Capacitors

    ❑ Under the die

    47

  • EE141

    Pin Inductance Example❑ Processor transient current is 100A in

    20ps from 1V supply ❑ C4 bump inductance is 25pH ❑ How many C4 bumps do we need to get

    supply noise spike of less than 10%?

    ❑ With wirebond inductance of 1nH (1nH/mm) how many wirebonds are needed?

    48

    V = L ⋅dIdt