Digital Design for Low Power Systems - Stanford University

40
Digital Design for Low Power Systems Shekhar Borkar Intel Corp.

Transcript of Digital Design for Low Power Systems - Stanford University

Page 1: Digital Design for Low Power Systems - Stanford University

Digital Design for

Low Power Systems

Shekhar Borkar

Intel Corp.

Page 2: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar2

Outline• Low Power—Outlook & Challenges• Circuit solutions for leakage

avoidance, control, & tolerance• Microarchitecture for Low Power• System considerations• Technology, Circuits, Architecture,

and System co-optimization

Page 3: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar3

Unconstrained SD Leakage

1

10

100

1000

10000

30 80 130

Temp (C)

Ioff

(na/

u)

0.25

90nm

45nm

Page 4: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar4

Unconstrained SD Leakage Power

0

1

10

100

1,000

0.25u 0.18u 0.13u 90nm 65nm 45nm

SD L

eaka

ge (W

atts

)

1.5X Tr Growth

2X Tr Growth

Temp = 100C

2V1.7V

1.5V

1.2V

1V 0.9V

Page 5: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar5

Product Cost Pressure

0200400600800

100012001400

1998 2000 2002 2004 2006

Des

k To

p A

SP ($

)

Other

Thermal $10

Power Delivery

$25

Platform Budget

Shrinking ASP & budget for power

Page 6: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar6

Unconstrained Total Power

0

200

400

600

800

1000

1200

1400Po

wer

(W),

Pow

er D

ensi

ty (W

/cm2 )

SiO2 LkgSD LkgActive

10 mm Die

90nm 65nm 45nm 32nm 22nm 16nm

Technology, Circuits, & Architecture to constrain power

Page 7: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar7

Active Power Reduction

Slow Fast Slow

Low

Sup

ply

Volta

ge

Hig

h Su

pply

Vo

ltage

Multiple Supply Voltages

Multiple supply voltages mimic Vdd scalingIssues:Performance loss due to interface circuitsMultiple Vdd routing overhead

Page 8: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar8

SD Leakage Avoidance—Dual Vt

Delay

# of

Pat

hs High High VtVt

0%

20%

40%

60%

80%

100%

0% 5% 10% 15% 20% 25%% T scaling from all high-Vt design

Low

-Vtt

rans

isto

r wid

th(%

of t

otal

tran

sist

or w

idth

)

Full Low Vt Performance

0%

20%

40%

60%

80%

100%

0% 5% 10% 15% 20% 25%% T scaling from all high-Vt design

Low

-Vtt

rans

isto

r wid

th(%

of t

otal

tran

sist

or w

idth

)

Full Low Vt Performance

# of

Pat

hs Low Low VtVt

Delay

Delay

# of

Pat

hs Dual Dual VtVt Leakage 3X smallerLeakage 3X smaller(Active & Standby)(Active & Standby)No performance lossNo performance loss

Page 9: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar9

Leakage Control CircuitsBody Bias

VddVbp

Vbn-Ve

+Ve

Sleep Transistor

Logic Block

22--1000X1000XReductionReduction

Stack Effect

Equal Loading

22--10X10XReduction

55--10X10XReductionReduction Reduction

Page 10: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar10

Body BiasReverse BB Forward BB

Increases Vt

Reduces Ioff

Reduces Vt

Increases Ioff

To inactive circuits to reduce leakage

To active circuits to improve performance

VddVbp

Vbn-Ve

+Ve

Tradeoff & analysis required• Applying BB consumes

active power• Switching delay between

body bias

Page 11: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar11

Reverse Body Bias100

Leak

age

Red

uctio

n Fa

ctor

(X)

110C0.5V RBB110C0.5V RBB

Lower VTHigher VT10

Shorter L

10.01 0.1 1 10 100 100

0Target Ioff (nA/µm)

RBB less effective at shorter L and lower Vt

* A. Keshavarzi et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED)

Page 12: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar12

Scalability of RBB

000.0E+0

100.0E-9

200.0E-9

300.0E-9

400.0E-9

-1.0 -0.8 -0.6 -0.4 -0.2 0.0Body Bias (V)

Pow

er (W

atts

)

Total Leakage Power

SD leakage

Ibn junction leakageIbp junction leakage

Optimum

0.35 µm 0.18 µmOptimum RBB

2V 0.5V

Ioff Red. 1000X 10X

I/O circuit

Microprocessor critical path circuit

I/O circuit

Microprocessor critical path circuit

Total Leakage PowerMeasured on Test Chip

0.18 µm

RBB Less effective with technology scaling* A. Keshavarzi et. al., 1999 Int. Symp. Low Power Electronics & Design (ISLPED)

Page 13: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar13

Forward Body Bias

0

0.5

1

1.5

0 200 400 600Forward body bias (mV)

Nor

mal

ized

op

erat

ing

freq

uenc

y

450mV

0

0.5

1

1.5

0 200 400 600Forward body bias (mV)

Nor

mal

ized

op

erat

ing

freq

uenc

y

450mV

250

500

750

1000

1250

1500

1750

2000

0.9 1.1 1.3 1.5 1.7Vcc (V)

Fmax

(MH

z)

Body bias chipwith 450 mV FBB

NBB chip& body biaschip withZBBTj ~ 60°C

250500750

10001250150017502000

0 5 10 15 20Active power (W)

Fmax

(MH

z)

Body bias chipwith 450 mV FBB

Body biaschip withZBB

Tj ~ 60°C250500750

10001250150017502000

0 5 10 15 20Active power (W)

Fmax

(MH

z)

Body bias chipwith 450 mV FBB

Body biaschip withZBB

Tj ~ 60°C

Router chip with body bias

Digital Core

PLL

CB

G

24 LBGs

I/O: S-Links

I/O: F-LinksDigital Core

PLLPLL

CB

GC

BG

24 LBGs

I/O: S-LinksI/O: S-Links

I/O: F-LinksI/O: F-Links

* S. Narendra et al, ISSCC 2002, 16.4

Die size 10.1 X 10.1 mm2Technology 150nm CMOSTransistors 6.6 millionArea overhead 2%Power overhead 1%

* S. Narendra et al, ISSCC 2002, 16.4

FBB increases frequency & SD leakage

Page 14: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar14

Adaptive Body Bias Experiment

Resistor Network

4.5 mm

5.3

mm

Multiplesubsites PD & Counter Resistor

Network

CUT Bias Amplifier

Delay

1.6 X 0.24 mm, 21 sites per die150nm CMOSTechnology 150nm CMOS

Number of subsites per die 21

Body bias range 0.5V FBB to 0.5V RBB

Bias resolution 32 mV

Die frequency: Min(F1..F21)Die power: Sum(P1..P21)

Page 15: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar15

Adaptive Body Bias Results

0%

20%

60%

100%noBB

100% yield

ABB

Higher Frequency

Num

ber o

f die

s

Frequency

too slow

ftarget

too leaky

ftarget

ABB

FBB RBB

Num

ber o

f die

s

Frequency

too slow

ftarget

too leaky

ftarget

ABB

FBB RBB

97% highest bin

within die ABB

For given Freq and Power densityFor given Freq and Power density•• 100% yield with ABB 100% yield with ABB •• 97% highest freq bin with ABB for 97% highest freq bin with ABB for within die variability within die variability

Acc

epte

d di

e

Page 16: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar16

Transistor Stack

00.20.40.60.8

11.2

0 0.5 1 1.5Vint(V)

Nor

mal

ized

Cur

rentVdd

Vint

Stack leakage is ~5-10X smaller

* S. Narendra et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED)

Page 17: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar17

Scalability of Stack Effect

0

5

10

15

20

0 0.5 1 1.5 2

Stac

k ef

fect

fact

or X Zero body bias

80 C

0

5

10

15

20

0 0.5 1 1.5 2

Stac

k ef

fect

fact

or X

Zero body bias80 C

Vdd (V)

0.13 µm LVT

0.13 µm HVT

0.18 µm

Model

0.18 0.13 0.09 0.06

Technology generation (µm)

Zero body bias80 C

1.8 V @ 0.18 µm

0.18 0.13 0.09 0.06

Technology generation

Stac

k ef

fect

fact

or

(µm)

Zero body bias80 C

0

10

20

30

40

50

0

10

20

30

40

50

X 1.8 V @ 0.18 µm

10% Vdd scaling

20% Vdd scaling

Stack effects becomes stronger with scaling

Page 18: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar18

Exploiting Natural Stacks32-bit Kogge-Stone adder

0%10%20%30%

5.0 5.6 6.2 6.8 7.4 105 120 135

vect

ors High VT Low VT

% o

f inp

ut

Standby leakage current (µA)

High VT Low VT

Energy Overhead

1.64 nJ 1.84 nJ

Savings 2.2 µA 38.4 µAMin time in Standby

84 µS 5.4 µS

Reduction Avg WorstHigh VT 1.5X 2.5XLow VT 1.5X 2X

* Y. Ye et. al., 1998 Symp. VLSI Circuits

Page 19: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar19

Stack Forcing

0

1

2

3

4

5

1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00

Ioff (Relative)

Del

ay (R

elat

ive)

High Vt Low Vt

Leakage ReductionPerformance Loss

Equal Loading

Circuit technique provides additional Vt’s

Page 20: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar20

Sleep Transistor Technique

ALU core

Vddexternal

DynamicALU 32

Vssexternal

Sleep ALU

ScanFIFO

Control

Body bias

Output scan

Non-sleep ALU

Sleep ALU

ScanFIFO

Control

Body bias

Output scan

Non-sleep ALU

0

2

4

6

8

10

12

Clock gating +body bias

Clock gatingonly

Clock gating +sleep transistor

Tota

pow

er (m

W)

Switching Leakage Overhead4GHz, 75Cα=0.1 10.3%9.6%

Frequency change

Leakage savings

Area over-head

PMOS sleep transistor -2% 13X 11%

PMOS body bias 0% 2X 2%

J. Tschanz et al, ISSCC 2003, 6.1

Page 21: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar21

Leakage Tolerant Register File

* R. Krishnamurthy et. al., 2001 Symp. VLSI Circuits

Local Bit-line

RS0

D0 • •• •

RS15

D15

Φ1

N0LBL0

LBL1M1

M2

Local Bit-line

RS0

D0 • •• •

RS15

D15

Φ1

N0LBL0

LBL1M1

M2

Domino: leakage-sensitive

Pseudo-static: leakage-tolerant

• • • •

Φ1

RS0

Vs

D0#

Pk

M1

M2

LBL0

LBL1

Px

RS15

D15#

• • • •

Φ1

RS0

Vs

D0#

Pk

M1

M2

LBL0

LBL1

Px

RS15

D15#

0

0.05

0.1

0.15

0.2

0.25

150 160 170 180 190 200Read Delay (ps)

LBL

DC

Noi

se R

obus

tnes

s

6GHz design

This work

Dual-Vt

Low-VtNoise floor

Keeper upsizing

Improves delayImproves noise marginLeakage tolerant

Page 22: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar22

Employing Leakage ControlN

umbe

r of P

aths

Low Vt

Low Vt +Stack Forcing

Dual Vt

Dual Vt +Stack Forcing

TargetDelay

Page 23: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar23

Optimum Frequency

Pipeline Depth

02468

10

1 2 3 4 5 6 7 8 9 10Relative Pipeline Depth

Power Efficiency

Optimum

Process Technology

02468

10

1 2 3 4 5 6 7 8 9 10Relative Frequency

Sub-threshold Leakage increases

exponentially

Pipeline & Performance

02468

10

1 2 3 4 5 6 7 8 9 10Relative Frequency (Pipelining)

Performance

DiminishingReturn

Maximum performance with• Optimum pipeline depth • Optimum frequency

Page 24: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar24

Efficiency of Microarchitectures

0

1

2

3

4

S-Scalar Dynamic DeepPipeline

Gro

wth

(X) f

rom

pre

viou

s uA

rch

Die AreaPerformancePower

Same Process Technology

0%

20%

40%

S-Scalar Dynamic DeepPipeline

Red

uctio

n in

MIP

S/W

att Same Process Technology

Enegry efficiency drops ~20%

Employ efficient microarchitecture and design

Page 25: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar25

Memory Latency

MemoryCPU Cache

Small~few Clocks Large

50-100ns1

10

100

1000

100 1000 10000

Freq (MHz)M

emor

y La

tenc

y (C

lock

s)Assume: 50ns Memory latency

Cache miss hurts performanceWorse at higher frequency

Page 26: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar26

Increase on-die Memory

0%

25%

50%

75%

100%

1u 0.5u 0.25u 0.13u 65nm

Cac

he %

of T

otal

Are

a

486 Pentium®

Pentium® III

Pentium® 4

Pentium® M

Power Density of Logic ≈ 10X of Memory

Increased Data Bandwidth & Reduced LatencyHence, higher performance for much lower power

Page 27: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar27

Multi-threading

0%

20%

40%

60%

80%

100%

100% 98% 96%

Cache Hit %

Perf

orm

ance

1 GHz

2 GHz

3 GHz

ST Wait for Mem

Single ThreadSingle Thread

MT1 Wait for Mem

MT2 Wait

MT3

MultiMulti--ThreadingThreading

Full HW Utilization

Thermals & Power Delivery Thermals & Power Delivery designed for full HW utilizationdesigned for full HW utilization

Multi-threading improves performance without impacting thermals & power delivery

Page 28: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar28

Throughput Oriented Design

Logic Block

VddFreq = 1Throughput = 1Active Power = 1SD Lkg Power = 1

0.7 x Vdd

Logic Block Freq = 0.7Throughput = 1.4Active Power = 0.7SD Lkg Power = 0.7

Logic Block

Higher logic throughput, yet lower power

Page 29: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar29

Special Purpose HardwareTCP Offload EngineTCP Offload Engine

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1995 2000 2005 2010 2015M

IPS

GP MIPS@75W

TOE MIPS@~2WTC

B ExecCore

PLL

OOO

ROM

CA

M1

TCB Exec

Core

PLL

ROB

ROM

CLB

Inputseq

Sendbuffer

Opportunities:Network processing enginesMPEG Encode/Decode enginesSpeech engines

2.23 mm X 3.54 mm, 260K transistors

Special Purpose HW—Best Mips/Watt

Page 30: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar30

Chip Level Multi-processingVoltage Frequency Power Performance

1% 1% 3% 0.66%Rule of thumb:

In the same process technology…

Core

Cache

Core

Cache

Core

Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8

Voltage = 1Freq = 1Area = 1Power = 1Perf = 1

Page 31: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar31

Multi-CorePower

Large Core

CachePower = 1/4

C1 C2

C3 C4

Cache

1

2

3

4 Performance Performance = 1/2

SmallCore1

2

1 1

1

2

3

4

1

2

3

4 MultiMulti--Core:Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

Page 32: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar32

From Multi to Many…

0

5

10

15

20

25

30

TPT OneApp

TwoApp

FourApp

EightApp

Syst

em P

erfo

rman

ceLargeMedSmall

Single Core Performance

1

0.5

0.3

0

0.2

0.4

0.6

0.8

1

1.2

Larg

e

Med

Smal

l

Rel

ativ

e Pe

rfor

man

ce

13mm, 100W, 48MB Cache, 4B Transistors, in 22nm12 Cores 48 Cores 144 Cores

Page 33: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar33

Future Multi-core Platform

GP GP

GP

GP GP

GP

GP

GP GP

GP

GP GP

General Purpose CoresSP SP

SP SPSpecial Purpose HW

CC

CC

CC

CC

CC

CC

CC

CC Interconnect fabric

Heterogeneous MultiHeterogeneous Multi--Core PlatformCore Platform

Page 34: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar34

Low Power & PlatformBreakdown of Platform Power

CPU MCH, ICH

Gfx

Memory

CLK Gen

Com

Disk

Other

Display

Mobile

CPU MCH, ICH

Gfx

Disk

Other

Power Supply Loss

Desktop

CPU & electronics power in a platform is lowMust reduce power of other platform ingredients

Page 35: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar35

Efficiency of Power Delivery

70%

75%

80%

85%

90%

0 10 20 30 40 50

I (amp)

Effic

ienc

y (%

)Workload %

Design PointLoss

Difficult to maintain high efficiency across workloadsPower supply design needs to exploit new technologies

Page 36: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar36

Typical Power Delivery System

Mobile Platform, ~50W

PowerDeliveryElectronics

Page 37: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar37

Fine-grain Power Management

Usage

Power Supply Response

TimeImprove response timeBring power supply closer

Fast Slow

Slow FastCache

Provide multiple supply voltagesFine grain Vdd and Freq scaling

Cache Core Hopping for hot spot &power density management

Page 38: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar38

Software & Low PowerProvide guidance to compilers and operating system through policyApplications

Add hooks for fine grain power management, and assist OS by providing checkpoints

Compilers

Fine grain power management algorithmControl HW through firmwareOperating System

Hardware controlDetect conditions, take action, notify OSFirmware

Fine grain power management is possible only by cooperation of the entire software stack

Page 39: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar39

Co-optimizationTechnology Circuits µArch Platform Software

SD Leakage (Ioff)Number of Vt’sGate dielectric (Tox)

Leakage avoidance, control & tolerance

Optimal pipelineThroughput oriented micro-architecture

Fine grain power management hardware & software

Supply voltage (Vdd)

Multiple Vddcircuits

Partitioning of logicError correction

Efficient power delivery & management of multiple Vdd’s

Transistor density

Tradeoff frequency for transistors

Throughput oriented micro-architecture

Software and applications to utilize logic throughput

Interconnect RC delay tolerance Interconnect cost vsefficiency of power delivery

Page 40: Digital Design for Low Power Systems - Stanford University

Digital Design for Low Power Systems/S. Borkar40

Summary• Power and energy will limit integration• Employ circuit techniques for SD leakage

avoidance, control, & tolerance• Microarchitecture for Low Power• Opportunities to save power at the platform

level—Hardware & Software• Technology, Circuits, Architecture, and

System co-optimization will be key