7a.3.09

8/7/2019 7a.3.09

1/41

1

Power-efficient scalable multi-

core and high-speed IO clockingarchitecture

Nasser Kurd

Praveen Mosalikanti

Intel Corporation

Session 7A

CMOSETSep 25, 09

8/7/2019 7a.3.09

2/41

2

Introduction

Clocking has significant impact on power andperformance Large percentage of power

Deep state exit latencies & voltage/frequency transitions

Clock skew margins IO timing: QPI, PCIe, DDR

Adaptive techniques to reduce power and improvemargins

Talk cover: clock circuit innovations enablingpower-efficient, scalable and modular Intel Core i7and i5 (Nehalem) family

8/7/2019 7a.3.09

3/41

3

Outline

High level Nehalem overview

Clock Generation

Clock Distribution Adaptive Frequency System

Intel QuickPath Technology

Clocking Conclusion

8/7/2019 7a.3.09

4/41

4

The First Nehalem Processor

A Modular Design forA Modular Design for

FlexibilityFlexibility

MiscI

O

MiscI

O

QPI1

QPI0

Memory Controller

Core Core Core CoreQu

eue

Shared L3 Cache

QPI:IntelQuickPath

Interconnect

BW up to

~25.6GB/sNehalem: Next Generation Intel Microarchitecture

Memory BW

up to32GB/s

8/7/2019 7a.3.09

5/41

5

Clock Generation Architecture

8/7/2019 7a.3.09

6/41

6

Clock Generation Design Goals

Nehalem: Next Generation Intel Microarchitecture

Modular & scalable

Decoupled frequency and voltages

Power efficient clocking architecture

Q

P

I

0

Q

P

I

1

Memory Controller

CoreCore CoreCore

LLC

8/7/2019 7a.3.09

7/41

7

PLL Architecture

Local PLL placement On-die LVR per PLL

FPLLBCLK

133MHz

UPLL

CPLL CPLL CPLLQPLL QPLL

DPLL

CPLL

LPLLCPLL: Core PLLQPLL: QPI PLL

FPLL: Filter PLL

DPLL: DDR PLL

UPLL: Un-core PLL

4.8, 5.9, 6.4GTs

800, 1066, 1333MTs

667-multi GHz

266, 533MHz

8/7/2019 7a.3.09

8/41

8

PLL Loop

Filter PLL: higher sampling frequencies Clock distribution in PLL loop

Adaptive duty cycle adjust loop

Adaptive clocking system

central

filter PLL

feedback

divider local

adaptive

PLL

core

feedback

divider

global clock

Dist.

ref ck

1X

2X

4X

analog

supply

digital

supply

fb ck local clocking

DCS

duty cycle

adjust

duty cycle

sentinel

8/7/2019 7a.3.09

9/41

9

Measured Lock Time And Jitter

30% jitter reduction

56% lock time reduction

lock time

long term jitter

1X 2X 4X

0.75

0.44

1

0.7

0.8

8/7/2019 7a.3.09

10/41

10

Why Adaptive clocking

Fixed Freq

Varying Core Digital Supply

Varying Freq

Varying Latency

Setup problem

CLK CLK

PLL

Clk Distribution

Data PathFlop Flop

Analog SupplyDigital Supply

Digital Supply

8/7/2019 7a.3.09

11/41

11

Why 1st droop

6666

8/7/2019 7a.3.09

12/41

12

Adaptive Frequency System (AFS)

Digital supply noise resistive coupling 1st droop

Voltage Compare And Track (VCAT) DC tracking

Frequency Voltage

freq

voltage

time

on-dieLVR

adaptive PLL

control

R1 R2

core

on-

boardVRM

V

C

O

PFDCP

clock

clock frequency control

analogsupply

supplycontrol

digitalsupply

mixer

VCAT

8/7/2019 7a.3.09

13/41

13

Adaptive Frequency Benefit

DC Load

Line

Core Current

Core

Vo

l tag

e

1st Droop

Transient

AFS

Benefit

8/7/2019 7a.3.09

14/41

14

Measured AFS Frequency Upside

Higher sensitivity increases benefit

Dependent on voltage, temp, & cores

0.1%

50%

99.9%

2.5% 5%0%

low sen. higher sen.

8/7/2019 7a.3.09

15/41

15

Summary of Clock Generation

Scalable performance and power efficient

architecture are enabled by

Filter PLL

Fast lock PLL

Local PLL with decoupled frequency and voltages Adaptive duty cycle correction

Adaptive Frequency

Improves top bin yield

Up to 5% frequency improvement at same voltage Lower power at same frequency

8/7/2019 7a.3.09

16/41

8/7/2019 7a.3.09

17/41

17

Core Clock Distribution Design

Metrics

Low power

High level of automation

Scalable to next process generation

Approach: pseudo-Grid topology

8/7/2019 7a.3.09

18/41

18

Core Clock Distribution

VerticalSpine Horizontal

Spine

M8

Grid

Wire

PLL

PLL

8/7/2019 7a.3.09

19/41

19

Un-Core Clock Distribution Issues

Reality of Un-Core Long routes & large variation in clock density

Multiple clock and voltage Domains

Difficult to fully automate

Un-Core Approach Hybrid clocking

Custom solution per domain

Clock grid in highly loaded regions

Point to point clock distribution in lightly loaded

Adaptive clock compensation

8/7/2019 7a.3.09

20/41

20

Un-Core Distribution

Architecture

L. Grid

LLC Spine

R. GridUPLL

CORE-0

CORE-2

CORE-3

CORE-1

8/7/2019 7a.3.09

21/41

21

Clock Distribution Summary

Extensive power/performance tradeoffs

in all clocks

Un-core custom solutions save routing

Trade higher skew for lower power

High degree of automation in core

Quickly retune for changes

Generates all required schematics/layout

8/7/2019 7a.3.09

22/41

22

Configurable Intel QuickPath

Technology Clocking Architecture

8/7/2019 7a.3.09

23/41

23

I/O Clock Design Goals

Enable very high bandwidth

interfaces

Tight clock specs Accumulated jitter

Jitter amplification

Duty cycle

Scalable clocking

Performance and power

8/7/2019 7a.3.09

24/41

24

IntelQuickPath Interconnect

(IntelQPI) TX/RX Clock Architecture

TX: low jitter PLL, duty cycle correction, shallow dist.

RX: TA-DLL, low swing distribution

TX RX

20 data pairs

1 clock pair

D Q

PI

DLL

DQ

CLK Amp/DCCCLK Driver

TX Data [20]

TX CLK RX CLK

RX Data [20]

TX

PLL

DCCTX

PLL

DCCDCC

full-swing

low-swing

phase distbias

8/7/2019 7a.3.09

25/41

8/7/2019 7a.3.09

26/41

26

Reduced I/O PLL Jitter

Lower VCO gain

Adjustable VCO range

Capacitive and load tuning

Decrease noise Increase current

Improve PSRR

On-die VR &exploit higher voltages

8/7/2019 7a.3.09

27/41

27

Transmit Duty Cycle Correction

Analog DCC integrated into transmit PLL

VCOCorrector

ck

ckb

Detector

err errb

DCC

CP

+LPF

PFD

/N

refclk

fbclk

8/7/2019 7a.3.09

28/41

8/7/2019 7a.3.09

29/41

29

IO DLL

Self-Biased DLL (SBDLL)

22.5 degree resolution

Frequency-based capacitive load tuning

improve performance

Time-Averaging

reduce jitter & restore duty cycle

Low swing distribution with PVT tracking

8/7/2019 7a.3.09

30/41

30

DLL Delay Element

Frequency-based capacitive load tuning (FCT)

Further extends delay range

pbias

nbias

in inboutb out

FCTenb[0]

FCTenb[1]

FCTenb[0]

FCTenb[1]

8/7/2019 7a.3.09

31/41

31

DLL Time AVG Concept 1

C L Kc y c l e n

C L Kc y c l e n-1

t1

t1/2

C L KT A

Phase mix adjacentcycles

Average HF jitter

8/7/2019 7a.3.09

32/41

32

DLL Time AVG Concept 2

P hn

P ho u t

P hn-1 Phase mixadjacent clock

phases

Uniform clock

phases

8/7/2019 7a.3.09

33/41

33

Time Average(Continued) pbias

nbias

in1 in1#

out# out

in2 in2#

8/7/2019 7a.3.09

34/41

34

SBDLL + TA1Ph1#

Ph1

Ph2#

Ph2

Ph3#

Ph3

Ph4#

Ph4

Ph5#

Ph5

Ph6#

Ph6

Ph7#

Ph7

Ph8#

Ph8

TA1

TA1

TA1

TA1

Ck0, Ck180

Ck45, Ck225

Ck90, Ck270 Ck135, Ck315

8/7/2019 7a.3.09

35/41

35

SBDLL + TA1 + TA2

TA1 TA1 TA1 TA1

TA2 TA2 TA2 TA2

Ck0

Ck180

Ck45

Ck225

Ck90

Ck270 Ck135

Ck315

Ck0 Ck180Ck45 Ck225

Ck90 Ck270Ck135 Ck315

8/7/2019 7a.3.09

36/41

36

TA-DLL Jitter Attenuation Simulation

DLL Jitter attenuation ~27% Final attenuation at the receiver ~20%

0.2

0.4

0.6

0.8

1

1.2

1.4

Delay Line TA1 TA2 BFR PI

1.0

0.6

0.4

1.4

8/7/2019 7a.3.09

37/41

37

TA-DLL Duty Cycle Correction

Simulation

DLL +/- 15 duty cycle correction30

35

40

45

50

55

60

65

70

75

65% Input

35% Input

TA 50% Output

Delay Line TA1 TA2 BFR

8/7/2019 7a.3.09

38/41

38

Jitter Measurement: TA Disabled

PP jitter: 69.8ps

8/7/2019 7a.3.09

39/41

39

Jitter Measurement: TA Enabled

PP jitter reduction ~16%

8/7/2019 7a.3.09

40/41

40

I/O Summary

High Speed requires optimum clocking At transmit:

Shallow TX differential clock distribution

Optimally tuned transmit PLL

Transmit duty cycle correction

At receive: innovative receive DLL

27% jitter attenuation

+/-15% receive duty cycle correction

Low-swing clock distribution for better PSRR Continuous PVT tracking

8/7/2019 7a.3.09

41/41

41

Conclusion

Clock innovations key enabler modular & scalable processors

Power efficiency

Chip frequency adapts to power supply voltage and droops

Fast power state transitions with faster PLL lock time

Duty cycle adapts to transistor variationand lifetime stress

Dynamic clock skew compensation

High speed IO: QPI/DDR/PCIe

Optimized power, PLL and clock delivery

Jitter attenuating techniques

7a.3.09

Documents

Transcript of 7a.3.09