EE241 - Spring 2005bwrcs.eecs.berkeley.edu/Classes/icdesign/ee241_s05/... · 2005. 3. 1. · •...

1

EE241 - Spring 2005Advanced Digital Integrated Circuits

Lecture 11:Voltage Optimization

2

Power /Energy Optimizaton Space

+ Variable VT

Sleep T’s

Multi-VDD Variable VT

+ Input control

Stack effects

+ Multi-VTLeakage

DFS, DVSClock Gating

Logic design

Scaled VDD

TSizing

Multi-VDD

Active

Run TimeSleep ModeDesign TimeEnergy

Variable Throughput/LatencyConstant Throughput/Latency

2

3

♦ Design Parameters• Circuit

(sizing, supply, threshold)

• Logic style(domino, pass-gate, …)

• Block topology (adder: CLA, CSA, …)

• Micro-architecture (parallel, pipelined)

Design Time Optimization of Active PowerDesign Time Optimization of Active Power

topology A

topology B

Delay

En

erg

y/o

p

Source: B. Nikolic

4

Sizing, Supply, Threshold Optimization

Transistor sizing can yield large power savings with small delay penalties

Gate sizing

Beta-ratio adjustments

Stack resizing

IBM EinsTuner

Supply voltage affects both active and leakage energyThreshold voltage affects primarily the leakage

3

5

Sizing, Supply, Threshold Optimization

There exists optimal supply + threshold for each function

In this optimum ESw/ELk ~ 2

Depends on logic depth, activity, function

Technology is not optimal for all blocks

Adjust during the designMultiple supplies, thresholds

Variable throughput applicationsVariable supplies, thresholds

6

Multi-dimensional search

En

erg

y [E

no

m]

Delay [Dnom]

• Well-defined optimization problem• Can get pretty close to optimum with only 2 variables• Getting the minimum speed or delay is very expensive

4

7

Example: W-VDD Optimization for LA Adder

♦ Reference: all paths are critical

♦ Internal energy ⇒ W more effective than Vdd

• W: E(-54%), 2Vdd: E(-27%) at dinc=10%

sizing: E (-54%)dinc=10%

nominalD=Dmin

2Vdd: E (-27%)dinc=10%

8

Multiple Supply Voltages

Block-level supply assignmentHigher throughput/lower latency functions are implemented in higher VDD

Slower functions are implemented with lower VDD

“Voltage islands” as called by IBMSeparate supply grids, level conversion performed at block boundaries

Multiple supplies inside a blockNon-critical paths moved to lower supply voltageLevel conversion within the blockPhysical design challenging

5

9

Multiple Supplies in a Block

Lower VDD portion is shaded

CVS StructureConventional Design

Critical Path

Level-Shifting F/F

Critical Path

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

FF

M.Takahashi, ISSCC’98. “Clustered voltage scaling”

10

Pulsed LCFF

M/S and pulsed half-latch LCFFs (MSHL, PHL)Smaller # of MOSFETs / clock loading

Faster level conversion using half-latch structure

Shorter D-Q path from pulsed circuit

q

ck

ckb ckclk

level conversion

ckb

ckd q (inv.)

ck

ckclk

level conversion

dmo

mf

sfso db

sfso

MN1 MN2

Ishihara, ISLPED’03

6

11

Pulsed LCFF

Pulsed precharge LCFF (PPR)Fast level conversion by precharge structure

Suppressed charge/discharge toggle by conditional capture

Short D-Q path

Ishihara, ISLPED’03

12

Multiple Supply Voltages

Two supply voltages per block are optimal

Optimal ratio between the supply voltages is 0.7

Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF)

An option is to use an asynchronous level converter

More sensitive to coupling and supply noise

7

13V1 = 1.5V, VTH = 0.3V, p(t):lambda

V2 (V)V3 (V)

Po

wer

Red

uct

ion

Rat

io

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

V1 (V)

V

2

(V

)

+

V2 (V)V

3(V

)

Three VDD’s

From Kuroda

14

1.0

0.5

Su

pp

ly V

olta

ge

Rat

io

1.0

0.4

0.5 1.0 1.5V1 (V)

Po

wer

Dis

sip

atio

n R

atio

V2/V1

P2/P1

{ V1, V2 }

V2/V1

V3/V1

{ V1, V2, V3 }

0.5 1.0 1.5V1 (V)

P3/P1

V2/V1

V3/V1

V4/V1

0.5 1.0 1.5V1 (V)

P4/P1

{ V1, V2, V3, V4 }

The more VDD’s, the less power, but the effect saturates.Power reduction effect will be decreased as VDD’s are scaled.Optimum V2/V1 is around 0.7.

Optimum Numbers of Supplies

[Hamada, CICC’01]

8

15

carrygen.

partialsum

gpgen.

5:1MUX

ain

bin

carry

s0/s1

sum

sumb (long loop-back bus)

clk

clock gen.

: VDDH circuit

: VDDL circuit

INV1INV2

0.5pF

sumsel.

2:1MUX

9:1MUX

logicalunit

9:1MUX

ain0

ALU Block Diagram

16

sum

keeperpc

sumb

VDDH

VDDL

INV1 INV2

domino level converter (9:1 MUX)

ain0sel(VDDH)

VDDH

VDDL

Delay of INV1 does not increase

INV2 is placed near 9:1 MUX to increase noise immunity

Level conversion is done by a domino 9:1 MUX

Low Swing Bus & Level Converter

9

17

Ene

rgy

[pJ]

TCYCLE [ns]

Room temp.

The dual-supply technique expands the power-delay optimization space

200

300

400

500

600

700

800

0.6 0.8 1.0 1.2 1.4 1.6

Single-supply

Shared well(VDDH=1.8V)

1.16GHz

VDDL=1.4VEnergy:-25.3% Delay :+2.8%

VDDL=1.2VEnergy:-33.3% Delay :+8.3%

Measured Results: Energy & Delay

18

i1 o1

VDDHVDDL

VSS

Conventional

VDDH circuit VDDL circuit

Distributing Multiple Supply VoltagesDistributing Multiple Supply Voltages

i2 o2i1 o1

VDDH

VDDL

VSS

Shared N-well

VDDH circuit VDDL circuit

i2 o2

10

19

VDDH circuit

VDDH VDDL

VSS

N-well isolation

VDDL circuit

Conventional

(a) Dedicated row

(b) Dedicated region

VDDL Row

VDDH Row

VDDH Row

VDDL Row

VDDHRegion

VDDLRegion

20

VDDH circuit

VDDH

VDDL

VSS

Shared N-well

VDDL circuit

Shared-Well

(a) Floor plan image

VDDL circuit

VDDH circuit

11

21

Reducing the Supply Voltage:Concurrency versus Clock Speed

Example: reference datapath

22

Parallel Data Path

12

23

Pipelined Data Path

24

A Simple Data Path: Summary

13

25

MAC

Unit

Addr

Gen

µP

Prog Mem

Embedded Processor

(lpArm)

Direct MappedHardware

EmbeddedFPGA

DSP(e.g. TI 320CXX )

Fle

xibi

lity

Energy

ReconfigurableProcessors

(Maia)Factor of 100-1000

100-1000 MOPS/mW

10-100MOPS/mW

.5-5MIPS/mW

Brodersen & Rabaey

Architecture Choices

26

Two Types of Processing

Fixed-rate processing (e.g. signal processing for multimedia or communications)

Stream-based computation

No advantage in obtaining throughput in excess of the real-time constraint

Variable-rate or burst-mode computation (e.g. general purpose computation)

mostly idle (or low-load) with bursts of computation

Faster is better

14

27

Workload

Ene

rgy

VTmax/V

DD

Variable-rate processingVoltage as a design variable

Adapting voltage to workload yields cubic reduction!

28

Common Design Approach: Fixed VDD

Compute ASAP:

Deliv

ered

Thr

ough

put

Clock Frequency Reduction:

Excessthroughput

Always high throughput

Energy/operation remains unchanged…while throughput scaled down with fCLK

fCLKReduced

time

time

15

29

Dynamic Voltage Scaling (DVS)

time

• Dynamically scale energy/operation with throughput.• Always minimize speed → minimize average energy/operation.• Extend battery life up to 10x with the exact same hardware!

Vary fCLK,VDD

Deliv

ered

Thro

ughp

ut

1 2 Dynamically adapt

BurdISSCC’00

30

Adaptive Supply Voltages

16

31

Variable Algorithmic Workload

32

Typical MPEG IDCT Histogram

17

33

Required speed ∝ ƒ0 0.2 0.4 0.6 0.8 1

No

rmal

ized

pow

er P

∝ƒV

2

0

0.2

0.4

0.6

0.8

1

Dynamic Power Reduction ThroughSoftware-Hardware Cooperation

ControllerController

Clock & VDD

Requiredspeed

Processor

Software

S. Lee et al, DAC, June 2000

If you don’t need to hustle, relax and save power.

HardwareSuper-linear

34

Processor: Converter Loop Sets VDD, fCLK

RST

Counter

Latch

Digital Loop Filter

L

CDD

VDD

PENAB

NENABΣ FERR

FMEAS

f1MHz

0110

100 FDES

+Register

fCLK

Ring Oscillator

V BAT

Processor

IDD

• Feedback loop sets VDD so that FERR → 0.• Ring oscillator delay-matched to CPU critical paths.• Custom loop implementation → Can optimize CDD.

7

Buck converter

Set byO.S.

BurdISSCC’00

18

35

100

80

60

40

20

00 1 2 3 4 5 6

Dhry

ston

e 2.

1 M

IPS

Energy (mW/MIPS)

85 MIPS @5.6 mW/MIPS

(3.8V)

6 MIPS @0.54 mW/MIPS

(1.2V)

• Dynamic operation can increase energy efficiency > 10x.

x

Static VDD

Dynamic VDD

BurdISSCC’00

Measured System Performance & Energy

36

1.0

3.5

VDD( fCLK)∝

1.0

3.5

VDD( fCLK)∝

Max.Speed

Idle Low Speed & Idle

Increased speed forshorter process deadlines

200ms/div 200ms/div

• User-interface process: very bursty computation.

• High-latency computation done @ low speed/energy.

Compute ASAP: With Voltage Scheduler:

BurdISSCC’00

DVS for Real Applications

19

37

• ZERO is implemented heuristic algorithm.• Difficult to optimize compute-intensive code (MPEG).•Big drop in energy when less speed required (3.3-4.5x)

MPEG UI AUDIO

Compute ASAP

Optimal

ZERO

Algorithm

Benchmarks

100 % 100 % 100 %

67 % 25 % 16 %

89 % 30 % 22 %

(Normalized Energy)

BurdISSCC’00

Measured Benchmark Energy Consumption

38

Recent DVS-Enabled Microprocessors

Xscale: 180nm 1.8V bulk-CMOS [Intel00]0.7-1.75V, 200-1000MHz, 55-1500mW (typ) Max. Energy Efficiency: ~23 MIPS/mW

PowerPC: 180nm 1.8V bulk-CMOS [Nowka02] 0.9-1.95V, 11-380MHz, 53-500mW (typ)Max. Energy Efficiency : ~11 MIPS/mq

Crusoe: 130nm 1.5V bulk-CMOS [Transmeta03]0.8-1.3V, 300-1000MHz, 0.85-7.5W (peak)

Pentium M: 130nm 1.5V bulk-CMOS [Intel03]0.95-1.5V, 600-1600MHz, 4.2-31W (peak)

20

39

VDD-Hopping

MPEG-4 encoding

No

rmal

ized

pow

er0

0.2

0.4

0.6

0.8

1

2 3 8

# of frequency levels1

Transition time

between ƒlevels

= 200µs

Time

n-th slice finished hereNext milestone

#n #n+1

Application slicing and software feedback guarantee real-time operation.

Two hopping levels are sufficient.

40

Challenge: Design over Wide Range of Voltages

• Circuit design constraints. (Functional verification)

• Circuit delay variation. (Timing verification)

• Noise margin reduction. (Power grid, coupling)

• Delay sensitivity. (Local power distribution)

Design verification complexity similar to

high-performance processor design @ fixed VDD

21

41

Relative Delay Variation

+40

+20

0

-20

Perc

ent D

elay

Var

iatio

n

VDDVT 2VT 3VT 4VT

• Timing verification only needed at min. & max. VDD.• Should also consider Vdd variations

Delay relative to ring oscillator

Gate

Interconnect

DiffusionSeries

Four extreme cases ofcritical paths:

All vary monotonically with VDD.

BurdISSCC’00

42

VDDVT 2VT 3VT 4VT

1

0.8

0.6

0.4

0.2

0Norm

aliz

ed ∂

Dela

y / D

elay

• Design of local power grid (for timing constraints) only need to consider VDD ≈ 2VT.

RVIVVDelay

VV

DelayDelayDelay

DDDDDD

DD

DD⋅=∆∆⋅

∂∂≈∂

)(,)(

BurdISSCC’00

Delay Sensitivity

22

43

• Static CMOS logic.

• Ring oscillator.

• Dynamic logic (& tri-state busses).

• Sense amp (& memory cell).

Max. allowed |dVDD/dt| → Min. CDD = 100nF (0.6µm)

Circuits continue to properly operate as VDD changes

Design for Dynamically Varying VDD

44

VDD

• Static CMOS robustly operates with varying VDD.

Vin = 0 Vout = VDDrds|PMOS

CL

Vout

max. τ = 4ns

0.6µm CMOS: |dVDD/dt| < 200V/µs

Static CMOS Logic

23

45

Ring Oscillator

• Output fCLK instantaneously adapts to new VDD.

60 80 100 120 140 160 180 200 220 240 260

0

1

2

3

4

Volts

Time (ns)

fCLK

VDD

Simulated with dVDD/dt = 20V/µs

46

VDD

Vout

Vin

clk

clk

Volts

Time

VoutVDDFalse logic low: ∆VDD > VTP

Latch-up: ∆VDD > Vbe

Errors

• Cannot gate clock in evaluation state.

• Tri-state busses fail similarly → Use hold circuit.

0.6µm CMOS: |dVDD/dt| < 20V/µs

clk = 1

∆VDD

−∆VDD

Dynamic Logic

24

47

•• Locality of referenceLocality of reference

•• DemandDemand--driven / Datadriven / Data--driven computationdriven computation

•• ApplicationApplication--specific processingspecific processing

•• Preservation of data correlationsPreservation of data correlations

•• Distributed processingDistributed processing

System-Level Issues: Reducing Waste

Avoid switching any capacitance unneededlySharing increases capacitance

48

Clock gating

Requires careful skew control ...Fortunately well handled in todays EDA tools

25

49

DSP/HIF

DEU

MIF

VDE

896Kb SRAM

10

8.5mW

0 155

30.6mW

20 25

Without clock gating

With clock gating

Power [mW]

Clock-gating efficiently reduces power, NOW

Courtesy M. Ohashi, Matsushita, ISSCC 2002, Paper #22.1

90% of F/F’s were clock-gated.

70% power reduction by clock-gating alone.

MPEG4 decoder

50

Pre-computation

Other options:• guarded evaluation• set output directly

Inputs xi … xn are not appliedif pre-computing holds

26

51

Circuit-level Activity Reduction

52

Circuit-Level Activity Encoding

Conditional InversionCoding for Interconnect

27

53

Eliminating Redundant Computations

54

Eliminating Redundant Computations

28

55

Number Representation

56

Number Representation -Accumulator Example

29

57

Two’s Complement vs Sign-Magnitude

58

Reducing Activity by Reordering Inputs

30

59

Resource Sharing Can Increase Activity

60

Ad

d

Ad

d

Reg

iste

r

Application Specific Processing Reduces

“Implementation Overhead”

Application-Specific Processing

31

61

The Architectural Trade-off

108

19.6

5.5

0.022

16-State ViterbiDecoder

Energy per Decoded bit (nJ)

10

4.3

1.8

2,200

64-point FFT

Transforms per second per unit area

(Trans/ms/mm2)

AreaEnergy

16-State ViterbiDecoder

Decode rate per unit area (kb/s/mm2)

64-point FFT

Energy per Transform (nJ)

1501700High-Performance DSP

50436Low-Power DSP

100683FPGA

200,0001.78Direct-Mapped Hardware

(numbers taken from vendor-published benchmarks)Orders of magnitude lower efficiency

even for an optimized processor architecture

62

Towards Heterogeneous Architectures for SOC

Xilinx Vertex ProXilinx Vertex Pro

JanusJanus Chip Chip -- ST Micro and ParadesST Micro and Parades

Berkeley PleiadesBerkeley Pleiades

32

63

• Voltage as a Design VariableMatch voltage and frequency to required performance

• Minimize waste (or reduce switching capacitance)Match computation and architecture Preserve locality inherent in algorithmExploit signal statisticsEnergy (performance) on demand

More easily accomplished in applicationMore easily accomplished in application--specific thanspecific thanprogrammable devicesprogrammable devices

Reducing Active Dissipation:Summary

EE241 - Spring 2005bwrcs.eecs.berkeley.edu/Classes/icdesign/ee241_s05/... · 2005. 3. 1. · •...

Documents

Transcript of EE241 - Spring 2005bwrcs.eecs.berkeley.edu/Classes/icdesign/ee241_s05/... · 2005. 3. 1. · •...