7a.3.09

download 7a.3.09

of 41

Transcript of 7a.3.09

  • 8/7/2019 7a.3.09

    1/41

    1

    Power-efficient scalable multi-

    core and high-speed IO clockingarchitecture

    Nasser Kurd

    Praveen Mosalikanti

    Intel Corporation

    Session 7A

    CMOSETSep 25, 09

  • 8/7/2019 7a.3.09

    2/41

    2

    Introduction

    Clocking has significant impact on power andperformance Large percentage of power

    Deep state exit latencies & voltage/frequency transitions

    Clock skew margins IO timing: QPI, PCIe, DDR

    Adaptive techniques to reduce power and improvemargins

    Talk cover: clock circuit innovations enablingpower-efficient, scalable and modular Intel Core i7and i5 (Nehalem) family

  • 8/7/2019 7a.3.09

    3/41

    3

    Outline

    High level Nehalem overview

    Clock Generation

    Clock Distribution Adaptive Frequency System

    Intel QuickPath Technology

    Clocking Conclusion

  • 8/7/2019 7a.3.09

    4/41

    4

    The First Nehalem Processor

    A Modular Design forA Modular Design for

    FlexibilityFlexibility

    MiscI

    O

    MiscI

    O

    QPI1

    QPI0

    Memory Controller

    Core Core Core CoreQu

    eue

    Shared L3 Cache

    QPI:IntelQuickPath

    Interconnect

    BW up to

    ~25.6GB/sNehalem: Next Generation Intel Microarchitecture

    Memory BW

    up to32GB/s

  • 8/7/2019 7a.3.09

    5/41

    5

    Clock Generation Architecture

  • 8/7/2019 7a.3.09

    6/41

    6

    Clock Generation Design Goals

    Nehalem: Next Generation Intel Microarchitecture

    Modular & scalable

    Decoupled frequency and voltages

    Power efficient clocking architecture

    Q

    P

    I

    0

    Q

    P

    I

    1

    Memory Controller

    CoreCore CoreCore

    LLC

  • 8/7/2019 7a.3.09

    7/41

    7

    PLL Architecture

    Local PLL placement On-die LVR per PLL

    FPLLBCLK

    133MHz

    UPLL

    CPLL CPLL CPLLQPLL QPLL

    DPLL

    CPLL

    LPLLCPLL: Core PLLQPLL: QPI PLL

    FPLL: Filter PLL

    DPLL: DDR PLL

    UPLL: Un-core PLL

    4.8, 5.9, 6.4GTs

    800, 1066, 1333MTs

    667-multi GHz

    266, 533MHz

  • 8/7/2019 7a.3.09

    8/41

    8

    PLL Loop

    Filter PLL: higher sampling frequencies Clock distribution in PLL loop

    Adaptive duty cycle adjust loop

    Adaptive clocking system

    central

    filter PLL

    feedback

    divider local

    adaptive

    PLL

    core

    feedback

    divider

    global clock

    Dist.

    ref ck

    1X

    2X

    4X

    analog

    supply

    digital

    supply

    fb ck local clocking

    DCS

    duty cycle

    adjust

    duty cycle

    sentinel

  • 8/7/2019 7a.3.09

    9/41

    9

    Measured Lock Time And Jitter

    30% jitter reduction

    56% lock time reduction

    lock time

    long term jitter

    1X 2X 4X

    0.75

    0.44

    1

    0.7

    0.8

  • 8/7/2019 7a.3.09

    10/41

    10

    Why Adaptive clocking

    Fixed Freq

    Varying Core Digital Supply

    Varying Freq

    Varying Latency

    Setup problem

    CLK CLK

    PLL

    Clk Distribution

    Data PathFlop Flop

    Analog SupplyDigital Supply

    Digital Supply

  • 8/7/2019 7a.3.09

    11/41

    11

    Why 1st droop

    6666

  • 8/7/2019 7a.3.09

    12/41

    12

    Adaptive Frequency System (AFS)

    Digital supply noise resistive coupling 1st droop

    Voltage Compare And Track (VCAT) DC tracking

    Frequency Voltage

    freq

    voltage

    time

    on-dieLVR

    adaptive PLL

    control

    R1 R2

    core

    on-

    boardVRM

    V

    C

    O

    PFDCP

    clock

    clock frequency control

    analogsupply

    supplycontrol

    digitalsupply

    mixer

    VCAT

  • 8/7/2019 7a.3.09

    13/41

    13

    Adaptive Frequency Benefit

    DC Load

    Line

    Core Current

    Core

    Vo

    l tag

    e

    1st Droop

    Transient

    AFS

    Benefit

  • 8/7/2019 7a.3.09

    14/41

    14

    Measured AFS Frequency Upside

    Higher sensitivity increases benefit

    Dependent on voltage, temp, & cores

    0.1%

    50%

    99.9%

    2.5% 5%0%

    low sen. higher sen.

  • 8/7/2019 7a.3.09

    15/41

    15

    Summary of Clock Generation

    Scalable performance and power efficient

    architecture are enabled by

    Filter PLL

    Fast lock PLL

    Local PLL with decoupled frequency and voltages Adaptive duty cycle correction

    Adaptive Frequency

    Improves top bin yield

    Up to 5% frequency improvement at same voltage Lower power at same frequency

  • 8/7/2019 7a.3.09

    16/41

  • 8/7/2019 7a.3.09

    17/41

    17

    Core Clock Distribution Design

    Metrics

    Low power

    High level of automation

    Scalable to next process generation

    Approach: pseudo-Grid topology

  • 8/7/2019 7a.3.09

    18/41

    18

    Core Clock Distribution

    VerticalSpine Horizontal

    Spine

    M8

    Grid

    Wire

    PLL

    PLL

  • 8/7/2019 7a.3.09

    19/41

    19

    Un-Core Clock Distribution Issues

    Reality of Un-Core Long routes & large variation in clock density

    Multiple clock and voltage Domains

    Difficult to fully automate

    Un-Core Approach Hybrid clocking

    Custom solution per domain

    Clock grid in highly loaded regions

    Point to point clock distribution in lightly loaded

    Adaptive clock compensation

  • 8/7/2019 7a.3.09

    20/41

    20

    Un-Core Distribution

    Architecture

    L. Grid

    LLC Spine

    R. GridUPLL

    CORE-0

    CORE-2

    CORE-3

    CORE-1

  • 8/7/2019 7a.3.09

    21/41

    21

    Clock Distribution Summary

    Extensive power/performance tradeoffs

    in all clocks

    Un-core custom solutions save routing

    Trade higher skew for lower power

    High degree of automation in core

    Quickly retune for changes

    Generates all required schematics/layout

  • 8/7/2019 7a.3.09

    22/41

    22

    Configurable Intel QuickPath

    Technology Clocking Architecture

  • 8/7/2019 7a.3.09

    23/41

    23

    I/O Clock Design Goals

    Enable very high bandwidth

    interfaces

    Tight clock specs Accumulated jitter

    Jitter amplification

    Duty cycle

    Scalable clocking

    Performance and power

  • 8/7/2019 7a.3.09

    24/41

    24

    IntelQuickPath Interconnect

    (IntelQPI) TX/RX Clock Architecture

    TX: low jitter PLL, duty cycle correction, shallow dist.

    RX: TA-DLL, low swing distribution

    TX RX

    20 data pairs

    1 clock pair

    D Q

    PI

    DLL

    DQ

    CLK Amp/DCCCLK Driver

    TX Data [20]

    TX CLK RX CLK

    RX Data [20]

    TX

    PLL

    DCCTX

    PLL

    DCCDCC

    full-swing

    low-swing

    phase distbias

  • 8/7/2019 7a.3.09

    25/41

  • 8/7/2019 7a.3.09

    26/41

    26

    Reduced I/O PLL Jitter

    Lower VCO gain

    Adjustable VCO range

    Capacitive and load tuning

    Decrease noise Increase current

    Improve PSRR

    On-die VR &exploit higher voltages

  • 8/7/2019 7a.3.09

    27/41

    27

    Transmit Duty Cycle Correction

    Analog DCC integrated into transmit PLL

    VCOCorrector

    ck

    ckb

    Detector

    err errb

    DCC

    CP

    +LPF

    PFD

    /N

    refclk

    fbclk

  • 8/7/2019 7a.3.09

    28/41

  • 8/7/2019 7a.3.09

    29/41

    29

    IO DLL

    Self-Biased DLL (SBDLL)

    22.5 degree resolution

    Frequency-based capacitive load tuning

    improve performance

    Time-Averaging

    reduce jitter & restore duty cycle

    Low swing distribution with PVT tracking

  • 8/7/2019 7a.3.09

    30/41

    30

    DLL Delay Element

    Frequency-based capacitive load tuning (FCT)

    Further extends delay range

    pbias

    nbias

    in inboutb out

    FCTenb[0]

    FCTenb[1]

    FCTenb[0]

    FCTenb[1]

  • 8/7/2019 7a.3.09

    31/41

    31

    DLL Time AVG Concept 1

    C L Kc y c l e n

    C L Kc y c l e n-1

    t1

    t1/2

    C L KT A

    Phase mix adjacentcycles

    Average HF jitter

  • 8/7/2019 7a.3.09

    32/41

    32

    DLL Time AVG Concept 2

    P hn

    P ho u t

    P hn-1 Phase mixadjacent clock

    phases

    Uniform clock

    phases

  • 8/7/2019 7a.3.09

    33/41

    33

    Time Average(Continued) pbias

    nbias

    in1 in1#

    out# out

    in2 in2#

  • 8/7/2019 7a.3.09

    34/41

    34

    SBDLL + TA1Ph1#

    Ph1

    Ph2#

    Ph2

    Ph3#

    Ph3

    Ph4#

    Ph4

    Ph5#

    Ph5

    Ph6#

    Ph6

    Ph7#

    Ph7

    Ph8#

    Ph8

    TA1

    TA1

    TA1

    TA1

    Ck0, Ck180

    Ck45, Ck225

    Ck90, Ck270 Ck135, Ck315

  • 8/7/2019 7a.3.09

    35/41

    35

    SBDLL + TA1 + TA2

    TA1 TA1 TA1 TA1

    TA2 TA2 TA2 TA2

    Ck0

    Ck180

    Ck45

    Ck225

    Ck90

    Ck270 Ck135

    Ck315

    Ck0 Ck180Ck45 Ck225

    Ck90 Ck270Ck135 Ck315

  • 8/7/2019 7a.3.09

    36/41

    36

    TA-DLL Jitter Attenuation Simulation

    DLL Jitter attenuation ~27% Final attenuation at the receiver ~20%

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    Delay Line TA1 TA2 BFR PI

    1.0

    0.6

    0.4

    1.4

  • 8/7/2019 7a.3.09

    37/41

    37

    TA-DLL Duty Cycle Correction

    Simulation

    DLL +/- 15 duty cycle correction30

    35

    40

    45

    50

    55

    60

    65

    70

    75

    65% Input

    35% Input

    TA 50% Output

    Delay Line TA1 TA2 BFR

  • 8/7/2019 7a.3.09

    38/41

    38

    Jitter Measurement: TA Disabled

    PP jitter: 69.8ps

  • 8/7/2019 7a.3.09

    39/41

    39

    Jitter Measurement: TA Enabled

    PP jitter reduction ~16%

  • 8/7/2019 7a.3.09

    40/41

    40

    I/O Summary

    High Speed requires optimum clocking At transmit:

    Shallow TX differential clock distribution

    Optimally tuned transmit PLL

    Transmit duty cycle correction

    At receive: innovative receive DLL

    27% jitter attenuation

    +/-15% receive duty cycle correction

    Low-swing clock distribution for better PSRR Continuous PVT tracking

  • 8/7/2019 7a.3.09

    41/41

    41

    Conclusion

    Clock innovations key enabler modular & scalable processors

    Power efficiency

    Chip frequency adapts to power supply voltage and droops

    Fast power state transitions with faster PLL lock time

    Duty cycle adapts to transistor variationand lifetime stress

    Dynamic clock skew compensation

    High speed IO: QPI/DDR/PCIe

    Optimized power, PLL and clock delivery

    Jitter attenuating techniques