Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept....

Reducing Leakage Power in Peripheral Circuits of L2 Caches

Houman Homayoun and Alex VeidenbaumDept. of Computer Science, UC Irvine

{hhomayou, alexv}@ics.uci.edu

ICCD 2007

L2 Caches and Power

L2 caches in high-performance processors are large 2 to 4 MB is common

They are typically accessed relatively infrequently

Thus L2 cache dissipates most of its power via leakage Much of it was in the SRAM cells

Many architectural techniques proposed to remedy this

Today, there is also significant leakage in the peripheral circuits of an SRAM (cache)

In part because cell design has been optimized

The problem

How to reduce power dissipation in the peripheral circuits of the L2 cache?

Seek an architectural solution with a circuit assist

Approach: Reduce peripheral leakage when circuits are unused

By applying “sleep transistor” techniques Use architectural techniques to minimize “wakeup” time

During an L2 miss service, for instance

Will assume that an SRAM cell design is already optimized and will attempt to save power in cells

DL1 miss rate

L2miss rate

% loads

DL1 miss rate

L2miss rate

% loads

ammp 0.046 0.1872 0.22 lucas 0.0970.6657 0.15applu 0.056 0.6572 0.26 mcf 0.2390.4284 0.34apsi 0.027 0.2778 0.22 mesa 0.0030.2674 0.26art 0.414 0.0001 0.17 mgrid 0.0360.4587 0.30bzip2 0.017 0.0417 0.24 parser 0.0200.0688 0.22crafty 0.002 0.0087 0.28 perlbmk 0.0050.4576 0.31eon 0.000 1 0.26 sixtrack 0.0120.0012 0.22equake 0.017 0.6727 0.25 swim 0.0890.6308 0.21facerec 0.034 0.3121 0.21 twolf 0.0540.0003 0.23galgel 0.037 0.0057 0.22 vortex 0.0030.2314 0.24gap 0.007 0.5506 0.21 vpr 0.0230.1476 0.30gcc 0.046 0.0367 0.21 wupwise0.0120.674 0.17gzip 0.007 0.0468 0.20 Average 0.0520.3131680.24

DL1 miss rate

L2miss rate

% loads

DL1 miss rate

L2miss rate

% loads

ammp 0.046 0.1872 0.22 lucas 0.0970.6657 0.15applu 0.056 0.6572 0.26 mcf 0.2390.4284 0.34apsi 0.027 0.2778 0.22 mesa 0.0030.2674 0.26art 0.414 0.0001 0.17 mgrid 0.0360.4587 0.30

DL1 miss rate

L2miss rate

% loads

DL1 miss rate

L2miss rate

% loads

ammp 0.046 0.1872 0.22 lucas 0.0970.6657 0.15applu 0.056 0.6572 0.26 mcf 0.2390.4284 0.34apsi 0.027 0.2778 0.22 mesa 0.0030.2674 0.26art 0.414 0.0001 0.17 mgrid 0.0360.4587 0.30bzip2 0.017 0.0417 0.24 parser 0.0200.0688 0.22crafty 0.002 0.0087 0.28 perlbmk 0.0050.4576 0.31eon 0.000 1 0.26 sixtrack 0.0120.0012 0.22equake 0.017 0.6727 0.25 swim 0.0890.6308 0.21facerec 0.034 0.3121 0.21 twolf 0.054

bzip2 0.017 0.0417 0.24 parser 0.0200.0688 0.22crafty 0.002 0.0087 0.28 perlbmk 0.0050.4576 0.31eon 0.000 1 0.26 sixtrack 0.0120.0012 0.22equake 0.017 0.6727 0.25 swim 0.0890.6308 0.21facerec 0.034 0.3121 0.21 twolf 0.0540.0003 0.23galgel 0.037 0.0057 0.22 vortex 0.0030.2314 0.24gap 0.007 0.5506 0.21 vpr 0.0230.1476 0.30gcc 0.046 0.0367 0.21 wupwise0.0120.674 0.17gzip 0.007 0.0468 0.20 Average 0.0520.3131680.24

0.0003 0.23galgel 0.037 0.0057 0.22 vortex 0.0030.2314 0.24gap 0.007 0.5506 0.21 vpr 0.0230.1476 0.30gcc 0.046 0.0367 0.21 wupwise0.0120.674 0.17gzip 0.007 0.0468 0.20 Average 0.0520.3131680.24

Miss rates and load frequencies

SPEC2K benchmarks 128KB L1 cache

5% average L1 miss rate, Loads are 25% of instr.

In many benchmarks the L2 is mostly idle

In some L1 miss rate is high

Much waiting for data L2 and CPU idle?

SRAM Leakage Sources

SRAM cell Sense Amps Multiplexers Local and Global Drivers (including the wordline driver) Address decoder

addr0

addr1

addr2

addr3

Predecoder and Global Wordline Drivers

Decoder

addr

Global WordlineLocal Wordline

Bitline BitlineAddr Input Global Drivers

Sense amp

Global Output Drivers

global data input drivers

25%

others 9%

local data output drivers

20% global row predecoder

7% local row decoders

1%

global data output drivers 24%

global address

input drivers 14%

Leakage Energy Break Down in L2 Cache

Large, more leaky transistors used in peripheral circuits High Vth, less leaky transistors in memory cells

Circuit Techniques for Leakage Reduction

Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS

Forward Body Biasing (FBB), RBB

Typically target cache SRAM cell design

But are also applicable to peripheral circuits

Architectural Techniques

Way Prediction, Way Caching, Phased Access Predict or cache recently access ways, read tag first

Drowsy Cache Keeps cache lines in low-power state, w/ data retention

Cache Decay Evict lines not used for a while, then power them down

Applying DVS, Gated Vdd, Gated Vss to memory cell Many architectural support to do that.

All target cache SRAM memory cell

What else can be done?

Architectural Motivation: A load miss in the L2 cache takes a long time to service

prevents dependent instructions from being issued

dispatch

issue

When dependent instructions cannot issue After a number of cycles the instruction window is full

ROB, Instruction Queue, Store Queue

The processor issue stalls and performance is lost

At the same time, energy is lost as well! This is an opportunity to save energy

IPC during an L2 miss

Cumulative over the L2 miss service time for a program Decreases significantly compared to program average

ammp

equakegap

gcc

gzip

mesa

ammp

applu

apsi

art

bzip2

crafty

eon

equake

facerec

galgel

gap

gcc

gzip

lucas

mcf

mesa

mgrid

parser

perlbmk

sixtrack

swim

twolf

vortex

vpr

wupwise

lucas mcf

applu

apsi

art

bzip2 crafty

eon

facerec

galgel

mgrid

sixtrack

swimparser

perlbmk

twolf

vpr

vortex

wupwise

Average

Average

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3

3.25

Issu

e R

ate

average issue rate during cache miss period

program average issue rate

A New Technique

Idle time Management (IM) Assert an L2 sleep signal (SLP) after an L2 cache miss

Puts L2 peripheral circuits into a low-power state L2 cannot be accessed while in this state

De-assert SLP when the cache miss completes

Can also apply to the CPU Use SLP for DVFS, for instance But L2 idle time is only 200 to 300 clocks

It currently takes longer than that for DVFS

A Problem

• Disabling the L2 as soon as the miss is detected • Prevents the issue of independent instructions• In particular, of loads that may hit or miss in the L2• This may impact the performance significantly

• Up to a 50% performance loss

ammp

applu

apsi equake

gcc

lucas

mcf

mgrid

perlbmk

swim

vpr

wupwise

Average

twolfvortex

sixtrack

mesaparsergzip

gapfacerec

galgeleon

craftyartbzip2

0

10

20

30

40

50

60

Percentage (%)

What are independent instructions?

Independent instructions do not depend on a load miss Or any other miss occuring during the L2 miss service Independent instructions can execute during miss service

ammp

applu

apsi

art

crafty

eon

equake

facerec

galgel gzip

lucas mcf

mesa

mgrid

perlbmk

sixtrack

swim

twolf

vortex

bzip2

parser

vpr

wupwise

Average

gap

gcc

0.001

0.01

0.1

1

10

100 Logarithmic Percentages (log %)

Two Idle Mode Algorithms

Static algorithm (SA) put L2 in stand-by mode N cycles after a cache miss occurs enable it again M cycles before the miss is expected to compete Independent instructions execute during the L2 miss service L2 can be accesses during the N+M cycles

L1 misses are buffered in an L2 buffer during stand-by

Adaptive algorithm (AA) Monitor the issue logic and functional units of the processor

after an L2 miss Put the L2 into stand-by mode if no instructions are issued AND functional units have not executed any instructions in K cycles

The algorithm attempts to detect that there are no more instructions that may access the L2

Sometimes the L2 is not accessed much and is mostly idle

In this case it is best to use the Stand-By Mode (SM) Start the L2 cache in stand-by, low-power mode “Wake it up” on an L1 cache miss and service the miss

Return the L2 to stand-by mode right after the L2 access However, this is likely to lead to performance loss

L1 misses are often clustered, there is a wake-up delay… A better solution:

Keep the L2 awake for J cycles after it was turned on increases energy consumption, but improves performance

A Second Leakage Reduction Technique

Hardware Support Add appropriately sized sleep transistors in global

drivers Add delayed-access buffer to L2

allows L1 misses to be issued and stored in this buffer at L2

-Delayed Access Buffer10 entries(10*8bits)

Access L2 when it get enabled

SLP

Write Buffer

L2 Cache

Cell Array

Read Buffer

Pre

-dec

oder

assert SLP signal, insert forthcoming Loads and stores into Delayed Access Buffer

System Description

L1 I-cache 128KB, 64 byte/line, 2 cycles

L1 D-cache 128KB, 64 byte/line, 2 cycles, 2 R/W ports

L2 cache 4MB, 8 way, 64 byte/line, 20 cycles

issue 4 way out of order

Branch predictor 64KB entry g-share,4K-entry BTB

Reorder buffer 96 entry

Instruction queue 64 entry (32 INT and 32 FP)

Register file 128 integer and 128 floating point

Load/store queue 32 entry load and 32 entry store

Arithmetic unit 4 integer, 4 floating point units

Complex unit 2 INT, 2 FP multiply/divide units

Pipeline 15 cycles (some stages are multi-cycles)

Performance Evaluation

% Time L2 Turned ON

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SM_200 SM_500 SM_750 SM_1000 SM_1500 IM/SA IM/AA

INT

FP

Fraction of total execution time L2 cache was active using IM & SM

IPC Degradation

0%

1%

2%

3%

4%

5%

6%

7%

8%

SM_200 SM_500 SM_750 SM_1000 SM_1500 IM/SA IM/AA

INT

FP

IPC loss due to L2 not being accessible under IM & SM

Power-Performance Trade Off

Performance Degradation

%

4%

8%

12%

16%

20%

IM/SAIM/AASM

(c)

Total Energy-Delay Reduction

%

20%

40%

60%

80%

100%

IM/SAIM/AASM

(b)

Leakage Power Savings

0%

20%

40%

60%

80%

100%

IM/SA

IM/AA

SM

(a)

(IM): 18 to 22% leakage power reduction with 1% performance loss (SM) : 25% leakage power reduction with 2% performance loss

Conclusions

Study break down of leakage in L2 cache components, show peripheral circuit leaking considerably.

Architectural techniques address reducing leakage in memory cell.

Present an architectural study on what is happening after an L2 cache miss occurred.

Present two architectural techniques to reduce leakage in the L2 peripheral circuits; IM and SM. (IM) achieves 18 or 22% average leakage power reduction, with a 1% average IPC reduction. (SM) achieves a 25% average savings with a 2% average IPC reduction.

two techniques benefit different benchmarks, indicates a possibility adaptively selecting the best technique. This is subject of our ongoing research

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept....

Documents

Transcript of Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept....