PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the...

43
PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 12 th Workshop on Power and Timing Modeling, Optimization and Simulation, Sevilla, Spain, September 12, 2002

Transcript of PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the...

Page 1: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Energy-Efficient Design of the Reorder Buffer*

*supported in part by DARPA through the PAC-C program and NSF

Dmitry Ponomarev, Gurhan Kucuk, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

12th Workshop on Power and Timing Modeling, Optimization and Simulation,Sevilla, Spain, September 12, 2002

Page 2: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Presentation Outline

ROB complexities and sources of power dissipation

Low-power ROB design:Dynamic ROB resizing

Use of energy-efficient comparators

Use of zero-byte encoding

Results

Concluding remarks

Page 3: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

What This Work is All About

In some of today’s processors, physical registers are implemented as the Reorder Buffer (ROB) slots

Example: Pentium IIIConsequences

ROB is a complex, multi-ported structure, dissipating a non-trivial fraction of the total chip power

Main goal of this work:Reduce power dissipation of the ROB without sacrificing performance

Page 4: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Page 5: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

ROB Structures and Complexities

Reorder Buffer (ROB) is used for:Supporting precise interruptsMaintaining speculative register values

A large number of read and write ports is required. For a W-way CPU:

W write ports to set up entries

W read ports for instruction commitment 2W read ports for reading the source operandsW write ports for writing the results

Page 6: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Sources of ROB Power Dissipation

Establishment of ROB entries for dispatched instructions

Readout of the valid sources from the ROB, including the associative search

Writing the results into the ROB slots

Instruction commitment

Clearing the ROB on mispredictions (this is small)

Page 7: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Sources of ROB Power Dissipation

operand reads

writeback

commitment

entry setup

21.5%

34.5%

35.7%

8.1%

Page 8: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

ROBs in Modern CPUs: Summary

80 entries or more in current implementations

5W ports for a W-way CPU

Large fraction of total chip power is dissipated within the ROB (27% according to Folegnani and Gonzalez, ISCA’01).

It is important to explore mechanisms for the ROB power minimization

Page 9: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

What Do We Propose ?

Three relatively independent techniques to reduce the power dissipation within the ROB:

Dynamic ROB resizing

Use of energy-efficient comparators

Use of zero-byte encoding

Page 10: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

ROB Usage in Superscalar Datapath: Example (fpppp)

Occupancy changes in the ROB

0

10

20

30

40

50

60

70

80

1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177

Simulation Cycle (M)

Main idea: Where ROB is underutilized, parts of it can be turned off to save power.

Page 11: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Incremental ROB Allocation/Deallocation

The ROB is implemented as a set of independent partitions

Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses

All partitions have associative addressing logic

Page 12: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Partitioned ROB Organization

Bitlines oraddress lines

within a partition

Precharger array

Input/output driversBypass switch array

Non-associative partAssociative

part

Precharger array

Input/output driversBypass switch array

Associative part Non-associative part

BitlinesAddress lines

Through line

Bypass switch

Par

titi

on 1

Par

titi

on 2

Precharger array

Input/output driversBypass switch array

Associative part Non-associative part

Par

titi

on 3

Page 13: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Sampling and Downsizing Strategies

Downsizing decisions are taken at the end of update period

Update periods have a fixed duration of UP cycles

Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles

cycles

SP

UP

Page 14: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

Act

ual o

ccup

ancy

0

8

16

24

All

ocat

ed e

ntri

es 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP SP / UP

0

A Resizing Example (SP=4, UP=16)

Page 15: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP SP / UP

0

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16)

Page 16: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP SP / UP

0

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16)

Page 17: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16)

Page 18: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

1 2 3 4 Avg.

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16)

Page 19: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Upsizing Strategy

Count the number of cycles when dispatch blocks because the ROB is full.

If the counter exceeds OT (Overflow Threshold), add one partition

- upsizing is more aggressive than downsizing – reduces hit on performance

Reset the overflow counter to 0 at the beginning of a new UP (Update Period)

Page 20: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

1 2 3 4 Avg.

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16)

Page 21: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

A Resizing Example (SP=4, UP=16, OT=4)

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

Page 22: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16, OT=4)

Page 23: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

1

A Resizing Example (SP=4, UP=16, OT=4)

Page 24: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

1 2

A Resizing Example (SP=4, UP=16, OT=4)

Page 25: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

1 2

A Resizing Example (SP=4, UP=16, OT=4)

Page 26: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0

8

16

24

32

0

8

16

24

32

SP SPSPSP / UPSP SP SP SP / UP

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

1 2 3

A Resizing Example (SP=4, UP=16, OT=4)

Page 27: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

1 2 3 4

A Resizing Example (SP=4, UP=16, OT=4)

OT = 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP SP / UP

Page 28: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

1 2 3 4

A Resizing Example (SP=4, UP=16, OT=4)

OT = 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP

Page 29: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16, OT=4)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP

1 2 3 4

Page 30: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16, OT=4)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP

1 2 3 4

Page 31: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

0

8

16

24

32

0

8

16

24

32

Act

ual o

ccup

ancy

All

ocat

ed e

ntri

es

A Resizing Example (SP=4, UP=16, OT=4)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

SP SPSPSP / UPSP SP SP

1 2 3 4

Page 32: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Summary of the Control Strategy

Only three parameters used for control:OT (Overflow Threshold)UP (Update Period)SP (Sample Period)

Less than 1% power overhead for control logic

Advantages:Can easily achieve a desired power/performance tradeoff by adjusting OT and UPMonitoring on a cycle-by-cycle basis is avoided – done once every SP cycles

Page 33: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

The Use of Energy-Efficient Comparators

Traditional pull-down comparators dissipate energy (through discharging the output node) on a mismatch in any bit position.

If mismatches are much more frequent than matches, this is energy-inefficient

We proposed a number of dissipate-on-match comparator designs (Kucuk et.al., ISLPED’01 and Ergin et.al., ICCD’02)

Page 34: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

The Use of Energy-Efficient Comparators

If an associative addressing is used within the ROB, the architectural register ids are compared.

Number of bits matching

% of total cases

2 LSBs 4 LSBs All 6 bits

Avg. SPECint 95 23% 14% 12%Avg. SPECfp 95 26% 16% 11%Avg. all SPEC 95 25% 15% 11.5%

Page 35: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

The Use of Energy-Efficient Comparators

As seen from the distribution, the mismatches occur more frequently than matches if the architectural register addresses are compared within the ROB

To exploit this, we used the design of Kucuk et.al., ISLPED’01.

Two-stage Domino logicThe first stage compares the 4 LSBs. Unless they match (only 12% of the cases), no dissipation occursSignificant energy reduction results!Other designs can be used to speed things up by avoiding domino-style logic (Ergin et.al., ICCD’02).

Page 36: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

The Use of Zero-Byte Encoding

A large percentage of bytes travelling on result, commit and dispatch buses contain all zeroes.This can be exploited by not writing such bytes into the ROB and not reading them from the ROB.A separate bit (Zero Indicator Bit, ZIB) is used to distinguish such bytes. If a byte contains all zeroes, only the ZIB bit is read and written instead of 8 bits. Circuits are similar to those presented in Ghose et.al. (Koolchips, 2000) and Zhang et.al. (MICRO’00).

Page 37: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Percentage of Bytes Containing All 0’s: Results

On the average across all SPEC 95 benchmarks:

In sources being read from the ROB: 43%

In the result values written into the ROB and committed from the ROB: 41.5%

Page 38: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Percentage of Bytes Containing All 0’s: Results

01020304050607080

com

pre

ss

vo

rtex

m8

8k

sim

gcc go

ijp

eg

lisp

per

l

turb

3d

fpp

pp

ap

si

ap

plu

hy

dro

2d

mg

rid

su2

cor

swim

tom

catv

wa

ve5

SP

EC

int

95

SP

EC

fp 9

5

SP

EC

95

dispatch bus

result bus

commit bus

Page 39: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Experimental Setup (AccuPower, DATE’02)

CompiledSPEC

benchmarks

Datapathspecs

Performance stats

VLSI layoutdata

SPICEdeck

SPICE

MicroarchitecturalSimulator

Energy/PowerEstimator

Power/energystats

SPICE measures ofEnergy per transition

Transition counts,Context information

Page 40: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Summary of the Results (SPEC 95 averages)

Dynamic ROB resizing:

UP=2048 cycles

SP=32 cycles

IPC drop 0.06% for OT=128

IPC drop 3.14% for OT=2048

Power savings range from 56% to 63%

Page 41: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Summary of the Results (SPEC 95 averages)

Comparators: 41% comparator power savings and 13% overall ROB power savings

Zero-byte encoding: 17% power savings

Three techniques combined: 70-76% power savings with negligible impact on performance

Page 42: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Concluding Remarks

Significant power reduction within the ROB can be realized by:

Dynamic ROB resizing

Use of dissipate-on-match comparators

Use of zero-byte encoding

Combined Power savings are in the range of 70-76% with very small impact on performance.

Finally, all three techniques increase the ROB complexity. Can the ROB complexity be reduced?

Page 43: PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

PATMOS’02

Concluding Remarks

Yes ! With small IPC drop, we can totally eliminate the ROB read ports needed for reading the source operand values.

2W out of the 5W ROB ports are these

Details are in Kucuk, Ponomarev and Ghose, ICS’02.

Combined, the techniques presented here and the solution of ICS’02 can make the case for reconsidering the architecture integrating physical register file and the ROB as a choice for implementing future high-performance microprocessors.