ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

52
ELEN 468 Lecture 29 1 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design

Transcript of ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

Page 1: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 1

ELEN 468Advanced Logic Design

Lecture 29Low Power Design

Page 2: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 2

Power Dissipation

P6Pentium ® proc

486

3862868086

80858080

80084004

0.1

1

10

100

1971 1974 1978 1985 1992 2000Year

Po

wer

(W

atts

)

Power increases despite Vdd decreasePower increases despite Vdd decrease

Courtesy, Intel

Page 3: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 3

Power Density

40048008

80808085

8086

286386

486Pentium® proc

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010

Year

Po

wer

Den

sity

(W

/cm

2)

Hot Plate

NuclearReactor

RocketNozzle

Courtesy, Intel

Page 4: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 4

Why Power Increased

Growing die size, fast frequency scaling

Clock Frequency (MHz)

10

100

1000

10000

85 87 89 91 93 95 97 99 01 03 05

Page 5: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 5

Gate Power Dissipation

Leakage power Dynamic power Short circuit power

Page 6: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 6

Dynamic Power

Occurs at each switching Pd = CL●Vdd

2●fp

fp switching frequency

out

Vdd

out

Vdd

Saturation

Linear

Page 7: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 7

Leakage Power

Static Leakage current = a ● Vdd

Leakage current = b/Vt

Killer to CMOS technology

out

Vdd

out

Vdd

Saturation

Linear

Leakage

Leakage

Page 8: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 8

Short Circuit Power

During switching, there is a short moment when both PMOS and CMOS are partially onPs = Q●(Vdd-Vt)3●tr●fp

tr rising time

out

Vdd

out

Vdd

Input rising

Input falling

Page 9: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 9

Where Does Power Go?

Power percentages

Core transistor leakage

Gate leakageCache leakage

Active power

0%10%20%30%40%50%60%70%80%90%

100%

Scalable X86 CPU Design for 90nmLow VT devices are <1% of total non-memory transistor width[J. Schultz and C. Webb, ISSCC 2004]

Total chip power based on ITRS roadmapIn 2004, we are just breaking even[Kim, et al, Computer 2003]

Power percentages

Core transistor leakage

Gate leakageCache leakage

Active power

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Page 10: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 10

Energy – Performance Space

Every design is a point on a 2-D plane

Performance

En

erg

y

Page 11: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 11

Low Power DesignReduce dynamic power : clock gating, sleep mode C: small transistors (esp. on clock), short wires VDD: lowest suitable voltage f: lowest suitable frequency

Reduce static power Selectively use low Vt devices Power gating, MTCMOS Stacked devices Body bias

Page 12: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 12

Clock GatingGate off clock to idle functional units e.g., floating point units need logic to generate

disable signal increases complexity of control logic consumes power timing critical to avoid clock glitches

at OR gate output additional gate delay on clock signal

gating OR gate can replace a buffer in the clock distribution tree

Reg

clock

disable

Functionalunit

Page 13: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 13

Active Power Reduction - Supply Voltage Reduction

Static Dynamic

Pros:• Always active in saving

Cons:• Additional power delivery network• Needs special care of interface between power domains• signals close to Vt – excessive leakage and reduced noise margins

Adjusting operation voltage and frequency to performance requirements:• High performance – high Vdd & frequency• Power saving – low Vdd & frequency

Pros:• Doesn’t limit performance

Cons:• Penalty of transition between different power states can be high (in performance and power)• Additional control logic

Slow SlowFastHigh

Supply Voltage

Low Supply Voltage

Page 14: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 14

Voltage Islands (Multi-Vdd)

Allow both macro and cell voltage assignmentAllow different voltage islands in the same circuit rowLift unnatural layout restrictionsMinimal placement disturbance

Lackey+ICCAD’02

Usami+JSSC’98

Vddh

Vddl

GVIDAC’03

Page 15: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 15

Level Converter

Interface circuit when Vddl drives Vddh to avoid leakage

VddH

VddL

weak on!

Vddh

Vddl

IN

OUT

Conventional dual supply level converter

Vddh

IN

OUT

New single supply level converter

Page 16: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 16

Adjacency Metrics for Clustering

Logic adjacency metric (LAM): Vddl fanin cone of level shifter without going through Vddh

LC1

Vddh

Vddl

LC2

LC3

Vddh

Vddl

LC2

LC3

Physical adjacency metric (PAM): for each candidate Vddl cell, compute total size of its neighbor Vddl cells

LAM to guide logic aware voltage assignmentLAM to guide logic aware voltage assignment PAM to guide placement aware voltage re-assignmentPAM to guide placement aware voltage re-assignment

Page 17: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 17

Level Converter Optimizations

Logic replacement (or gate sizing)

ZMUX1

LC

LC

LC

LC

DEC

ZMUX2

DEC

B A B ALC LC

LC/Buffer co-optimization

Page 18: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 18

Placement to Form Voltage Islandswith Power Grid Co-design

Based on Vddl and Vddh cell

placement after voltage assignment, define Vddl/Vddh

power grids on demand

Detailed placement to form Vddl/Vddh voltage islands that

can hit their corresponding power supplies

Vddh

Power grids on demand

Vddl Vddh Vddl Vddh Vddl Vddh

Vddl

Page 19: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 19

Example of Voltage Islands

Vddl = 1.2V

Vddh = 1.5V

No timing degradation, no area increase!No timing degradation, no area increase!

- IBM Cu11 - 0.13um- 400 MHz

(courtesy IBM)

Page 20: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 20

Dynamic Frequency and Voltage Scaling

Always run at the lowest supply voltage that meets the timing constraints

DFS (dynamic frequency scaling) saves only power DVS (dynamic voltage scaling) + DFS saves both energy and

power

A DVS+DFS system requires the following A programmable clock generator (PLL)

PLL from 200MHz 700MHz in increments of 33MHz A supply regulation loop that sets the minimum VDD necessary

for operation at the desired frequency 32 levels of VDD from 1.1V to 1.6V

An operating system that sets the required frequency + supply voltage to meet the task completion deadlines

heavier load ramp up VDD, when stable speed up clock lighter load slow down clock, when PLL locks onto new rate,

ramp down VDD

Page 21: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 22

Leakage Reduction Techniques

pullup (Vdd)

Vx

stack effect

Wu

Wl

High Vt devicesLow Vt devices

dual Vt

partitioning

Vnwell ≥ Vdd

Vpwell ≤ 0

variable threshold(VTCMOS)

low Vt

logic

sleep

sleep

Vdd

virtual Vdd

HVT

virtual Gnd

multi-threshold(MTCMOS)

HVT

Vdd

Page 22: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 23

Natural Transistor Stacks

• Reduce the leakage by stacking the devices• Reduced Vds• Negative Vgs• Negative Vbs

How?

Page 23: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 24

Design with Dual Vth

Dual Vth design Two flavors of transistors: slow – high Vth, fast – low

Vth

Low Vth are faster, but have ≈10X leakage

Dual Vth evaluation

Page 24: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 25

Impacts of Variable VT Reducing the VT increases the sub-threshold leakage current (exponentially)

VT = VT0 + ( F + VSB - F )

where VT0 is the threshold voltage at VSB = 0, VSB is the source- bulk (substrate) voltage, is the body-effect coefficientBut, reducing VT decreases gate delay (increases performance)

Page 25: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 26

Variable VT through Body Bias

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

-2.5 -2 -1.5 -1 -0.5 0

VSB (V)

VT (

V)

For NMOS, the substrate is normally tied to ground (VSB = 0)

A negative bias on VSB causes VT to increase

Adjusting the substrate bias at runtime is called adaptive body-biasing (ABB) or dynamic threshold scaling (DTS)

Requires a triple well fab process

VSB,p

VSB,n

Page 26: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 27

Forward/Reverse Body Biasing

RBB (Reverse Body Bias): zero body bias in active mode, a deep reverse bias in standby mode.

FBB (Forward Body Bias): high Vth in standby mode, forward body biasing to achieve better current drive in active mode.

Disadvantages:• Increase PN junction reverse leakage• Scaling down technology worsen short channel effects and weaken the Vth modulation capability

Disadvantages:• Larger junction capacitance• High body effect for stack devices

Page 27: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 28

Implementation of Dynamic Vth Scaling (DTS)

• The lowest Vth is delivered (NBB-no body bias) if the highest performance is required. • When the performance demand is low, clock frequency is lowered and Vth is raised via RBB to reduce the run time leakage power dissipation.

How?• When critical path replica frequency is less then reference CLK, adjust bias to decrease Vth.• Otherwise adjust bias to increase Vth.

Results:

Page 28: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 29

Power Gating Using Sleep Transistors

Or can reduce leakage by gating the supply rails when the circuit is in sleep mode

in normal mode, sleep = 0 and the sleep transistors must present as small a resistance as possible (via sizing)

in sleep mode, sleep = 1, the transistor stack effect reduces leakage by orders of magnitude

Or can eliminate leakage by switching off the power supply (but lose the memory state)

Page 29: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 30

Example of Power Gating

EmbeddedPower

Switches

Rows ofStandard

Cells

Power SwitchControl Signals

Can reduce power 1000XSmaller voltage swing (IR drop on sleep transistors)

Lower performance Increased noise

coupling Local power grid

design

Page 30: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 31

Power Dissipation on Variation Tolerance

Conventional variation tolerance Using large timing safety margin Implies aggressive timing target Greater power dissipation

Observation Near-worst-case variations occur rarely Safety margin is applied continuously to

guard the small chance of variations Poor power efficiency

Page 31: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 32

Question..

Can we deal with errors instead preventing them from occurring by conservative binning/clocking?

How fast can we speed up the circuit with error rate in manageable range?

Page 32: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 33

Fault tolerant system

Begin with reference values

Introduce redundancy Hardware: Triple Modular Redundancy Time: Repeated process Information: Code Software: various algorithm

How about for delay fault? how do we detect (may be correct?) errors?

Page 33: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 34

Delay fault tolerant system

Delay fault detection Redundant timing margin in signal path +: Second sampling at increase clock period - : Decrease delay of reference signal between

pipeline registers

t1 t2

Timing margin

2nd sampling

t

Page 34: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 35

Delay fault tolerant system

Delay fault removal Reference signal (SR) Reprocessing at slower clock period (t’)

t1 t2

Timing margin

t

SR

t’

Page 35: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 36

Delay fault tolerant system: Example

RAZOR* Dynamic Voltage Scaling Design Reduce power voltage down to

manageable failure rate

t1 t2

Timing margin

* Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003

Page 36: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 37

RAZOR continued Implemented to 120MHz clock frequency But for high speed circuits…

Managing two clocks Minimum path delay constraint Delay of MUX

Delay fault tolerant system: Example

Page 37: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 38

Delay fault tolerant system: Example

Parity coding Parity generation based on output correlation Avoid well-correlated outputs for pairing

Timing margin

t

Page 38: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 39

Now.. Let’s look at delay distribution(s)

Page 39: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 40

Clock speed achieved for contained error rate

Page 40: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 41

Delay fault tolerant system: Example

Parity coding (continued) Complexity Example: C449 ISCAS Benchmark

Page 41: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 42

Recently Proposed Design

Fault detection Partial hardware and time redundancy

Timing margin

t

Ln Ln+1

g0 gm

L'n+1

FL BL

gm

BL'

gi

Page 42: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 43

Proposed Design

Fault removal Pipeline flush & reprocessing at lower

clock

Ln Ln+1

g0 gm

L'n+1

FL BL

gm

BL'

gi

Page 43: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 44

Proposed Design

Division of FL an BL

PI PO

Latch

FL BL

CP

Error?BL

Page 44: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 45

Proposed Design

Division of FL an BL Considerations

The effects on the original circuit should be minimal.

Maximize delay fault detection coverage Minimize added complexity

Page 45: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 46

Proposed Design

Division of FL an BL First, POs to BL

Gate with longest delay to gate with shortest delay

For the gates connected to BL, Choose the gate with maximum delay

Then, any gate whose number of fanout> number of fanin

Page 46: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 47

Proposed Design

Delay fault detection coverage dFL: delay from PI to any gate in FL

di: delay from PI to any gate in original circuit

max{ }1

max{ }

FLF

i

dC

d

Add graphical view

Page 47: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 48

Proposed Design

Delay simulation SPICE simulation

TSMC 0.18um tech. Vcc=1.6V Gate delay for rising and falling signal Load: inverter Different input combinations are considered

Delay simulation Randomly generated test vectors 106~108 according to number of primary inputs (PI)

Page 48: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 49

Proposed Design

Area complexity Ngate: Number of gates in the original circuit

Nff : Number of ffs in each pipeline, (NPI+NPO)/2

Ngate_BL: Number of gates in BL

Ngate_CP: Number of gates in comparison block

NLatch: Number of latches=Number of connections between FL and BL

w: Complexity ratio of flipflop to gate_ _gate BL gate CP LatchA

gate ff

N N NC

N w N

Page 49: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 50

Fault Coverage vs. ComplexityFault Detection Coverage vs. Added Complexity : C499

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fault detection Coverage CF

Add

ed C

ompl

exity

CA

Fault Detection Coverage vs. Added Complexity : C432

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6

Fault detection Coverage CF

Add

ed C

ompl

exity

CA

Fault Detection Coverage vs. Added Complexity : C880

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6

Fault detection Coverage CF

Add

ed C

ompl

exity

CA

Fault Detection Coverage vs. Added Complexity : C6288

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6

Fault detection Coverage CF

Add

ed C

ompl

exity

CA

Page 50: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 51

Complexity

Effective complexity penalty Depends on application

More than half of area is cacheSpeed critical part: integer unit

0.5

AE A A

Appicable areaC C C

Total chip area

Page 51: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 52

Estimation of Complexity

& AGUDataCache

AlignMux

RegistersALUs

Intel® Pentium® 4 Processor on 90 nm

Process

Page 52: ELEN 468 Lecture 291 ELEN 468 Advanced Logic Design Lecture 29 Low Power Design.

ELEN 468 Lecture 29 53

Conclusion

Delay fault tolerant design is proposed Possible operation clock frequency gain is

estimated from modeling and experiments Delay fault detection coverage and complexity

are analyzed for optimal implementation It shows that 10% clock frequency gain is

possible with proposed design at a moderate (8-25%) complexity increase