PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the...
-
Upload
howard-martin -
Category
Documents
-
view
217 -
download
0
Transcript of PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the...
PATMOS’02
Energy-Efficient Design of the Reorder Buffer*
*supported in part by DARPA through the PAC-C program and NSF
Dmitry Ponomarev, Gurhan Kucuk, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
12th Workshop on Power and Timing Modeling, Optimization and Simulation,Sevilla, Spain, September 12, 2002
PATMOS’02
Presentation Outline
ROB complexities and sources of power dissipation
Low-power ROB design:Dynamic ROB resizing
Use of energy-efficient comparators
Use of zero-byte encoding
Results
Concluding remarks
PATMOS’02
What This Work is All About
In some of today’s processors, physical registers are implemented as the Reorder Buffer (ROB) slots
Example: Pentium IIIConsequences
ROB is a complex, multi-ported structure, dissipating a non-trivial fraction of the total chip power
Main goal of this work:Reduce power dissipation of the ROB without sacrificing performance
PATMOS’02
Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
PATMOS’02
ROB Structures and Complexities
Reorder Buffer (ROB) is used for:Supporting precise interruptsMaintaining speculative register values
A large number of read and write ports is required. For a W-way CPU:
W write ports to set up entries
W read ports for instruction commitment 2W read ports for reading the source operandsW write ports for writing the results
PATMOS’02
Sources of ROB Power Dissipation
Establishment of ROB entries for dispatched instructions
Readout of the valid sources from the ROB, including the associative search
Writing the results into the ROB slots
Instruction commitment
Clearing the ROB on mispredictions (this is small)
PATMOS’02
Sources of ROB Power Dissipation
operand reads
writeback
commitment
entry setup
21.5%
34.5%
35.7%
8.1%
PATMOS’02
ROBs in Modern CPUs: Summary
80 entries or more in current implementations
5W ports for a W-way CPU
Large fraction of total chip power is dissipated within the ROB (27% according to Folegnani and Gonzalez, ISCA’01).
It is important to explore mechanisms for the ROB power minimization
PATMOS’02
What Do We Propose ?
Three relatively independent techniques to reduce the power dissipation within the ROB:
Dynamic ROB resizing
Use of energy-efficient comparators
Use of zero-byte encoding
PATMOS’02
ROB Usage in Superscalar Datapath: Example (fpppp)
Occupancy changes in the ROB
0
10
20
30
40
50
60
70
80
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177
Simulation Cycle (M)
Main idea: Where ROB is underutilized, parts of it can be turned off to save power.
PATMOS’02
Incremental ROB Allocation/Deallocation
The ROB is implemented as a set of independent partitions
Each partition is a register file, complete with its own sensing and precharge/write logic, multiple ports and through busses
All partitions have associative addressing logic
PATMOS’02
Partitioned ROB Organization
Bitlines oraddress lines
within a partition
Precharger array
Input/output driversBypass switch array
Non-associative partAssociative
part
Precharger array
Input/output driversBypass switch array
Associative part Non-associative part
BitlinesAddress lines
Through line
Bypass switch
Par
titi
on 1
Par
titi
on 2
Precharger array
Input/output driversBypass switch array
Associative part Non-associative part
Par
titi
on 3
PATMOS’02
Sampling and Downsizing Strategies
Downsizing decisions are taken at the end of update period
Update periods have a fixed duration of UP cycles
Within an update period, multiple samples of the occupancies are taken at regular intervals of SP cycles
cycles
SP
UP
PATMOS’02
0
8
16
24
32
Act
ual o
ccup
ancy
0
8
16
24
All
ocat
ed e
ntri
es 32
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP SP / UP
0
A Resizing Example (SP=4, UP=16)
PATMOS’02
0
8
16
24
32
0
8
16
24
32
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP SP / UP
0
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16)
PATMOS’02
0
8
16
24
32
0
8
16
24
32
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP SP / UP
0
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
1 2 3 4 Avg.
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16)
PATMOS’02
Upsizing Strategy
Count the number of cycles when dispatch blocks because the ROB is full.
If the counter exceeds OT (Overflow Threshold), add one partition
- upsizing is more aggressive than downsizing – reduces hit on performance
Reset the overflow counter to 0 at the beginning of a new UP (Update Period)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
1 2 3 4 Avg.
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
A Resizing Example (SP=4, UP=16, OT=4)
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
1
A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
1 2
A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
1 2
A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’02
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0
8
16
24
32
0
8
16
24
32
SP SPSPSP / UPSP SP SP SP / UP
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
1 2 3
A Resizing Example (SP=4, UP=16, OT=4)
PATMOS’02
0
8
16
24
32
0
8
16
24
32
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
1 2 3 4
A Resizing Example (SP=4, UP=16, OT=4)
OT = 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP SP / UP
PATMOS’02
0
8
16
24
32
0
8
16
24
32
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
1 2 3 4
A Resizing Example (SP=4, UP=16, OT=4)
OT = 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP
PATMOS’02
0
8
16
24
32
0
8
16
24
32
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16, OT=4)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP
1 2 3 4
PATMOS’02
0
8
16
24
32
0
8
16
24
32
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16, OT=4)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP
1 2 3 4
PATMOS’02
0
8
16
24
32
0
8
16
24
32
Act
ual o
ccup
ancy
All
ocat
ed e
ntri
es
A Resizing Example (SP=4, UP=16, OT=4)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
SP SPSPSP / UPSP SP SP
1 2 3 4
PATMOS’02
Summary of the Control Strategy
Only three parameters used for control:OT (Overflow Threshold)UP (Update Period)SP (Sample Period)
Less than 1% power overhead for control logic
Advantages:Can easily achieve a desired power/performance tradeoff by adjusting OT and UPMonitoring on a cycle-by-cycle basis is avoided – done once every SP cycles
PATMOS’02
The Use of Energy-Efficient Comparators
Traditional pull-down comparators dissipate energy (through discharging the output node) on a mismatch in any bit position.
If mismatches are much more frequent than matches, this is energy-inefficient
We proposed a number of dissipate-on-match comparator designs (Kucuk et.al., ISLPED’01 and Ergin et.al., ICCD’02)
PATMOS’02
The Use of Energy-Efficient Comparators
If an associative addressing is used within the ROB, the architectural register ids are compared.
Number of bits matching
% of total cases
2 LSBs 4 LSBs All 6 bits
Avg. SPECint 95 23% 14% 12%Avg. SPECfp 95 26% 16% 11%Avg. all SPEC 95 25% 15% 11.5%
PATMOS’02
The Use of Energy-Efficient Comparators
As seen from the distribution, the mismatches occur more frequently than matches if the architectural register addresses are compared within the ROB
To exploit this, we used the design of Kucuk et.al., ISLPED’01.
Two-stage Domino logicThe first stage compares the 4 LSBs. Unless they match (only 12% of the cases), no dissipation occursSignificant energy reduction results!Other designs can be used to speed things up by avoiding domino-style logic (Ergin et.al., ICCD’02).
PATMOS’02
The Use of Zero-Byte Encoding
A large percentage of bytes travelling on result, commit and dispatch buses contain all zeroes.This can be exploited by not writing such bytes into the ROB and not reading them from the ROB.A separate bit (Zero Indicator Bit, ZIB) is used to distinguish such bytes. If a byte contains all zeroes, only the ZIB bit is read and written instead of 8 bits. Circuits are similar to those presented in Ghose et.al. (Koolchips, 2000) and Zhang et.al. (MICRO’00).
PATMOS’02
Percentage of Bytes Containing All 0’s: Results
On the average across all SPEC 95 benchmarks:
In sources being read from the ROB: 43%
In the result values written into the ROB and committed from the ROB: 41.5%
PATMOS’02
Percentage of Bytes Containing All 0’s: Results
01020304050607080
com
pre
ss
vo
rtex
m8
8k
sim
gcc go
ijp
eg
lisp
per
l
turb
3d
fpp
pp
ap
si
ap
plu
hy
dro
2d
mg
rid
su2
cor
swim
tom
catv
wa
ve5
SP
EC
int
95
SP
EC
fp 9
5
SP
EC
95
dispatch bus
result bus
commit bus
PATMOS’02
Experimental Setup (AccuPower, DATE’02)
CompiledSPEC
benchmarks
Datapathspecs
Performance stats
VLSI layoutdata
SPICEdeck
SPICE
MicroarchitecturalSimulator
Energy/PowerEstimator
Power/energystats
SPICE measures ofEnergy per transition
Transition counts,Context information
PATMOS’02
Summary of the Results (SPEC 95 averages)
Dynamic ROB resizing:
UP=2048 cycles
SP=32 cycles
IPC drop 0.06% for OT=128
IPC drop 3.14% for OT=2048
Power savings range from 56% to 63%
PATMOS’02
Summary of the Results (SPEC 95 averages)
Comparators: 41% comparator power savings and 13% overall ROB power savings
Zero-byte encoding: 17% power savings
Three techniques combined: 70-76% power savings with negligible impact on performance
PATMOS’02
Concluding Remarks
Significant power reduction within the ROB can be realized by:
Dynamic ROB resizing
Use of dissipate-on-match comparators
Use of zero-byte encoding
Combined Power savings are in the range of 70-76% with very small impact on performance.
Finally, all three techniques increase the ROB complexity. Can the ROB complexity be reduced?
PATMOS’02
Concluding Remarks
Yes ! With small IPC drop, we can totally eliminate the ROB read ports needed for reading the source operand values.
2W out of the 5W ROB ports are these
Details are in Kucuk, Ponomarev and Ghose, ICS’02.
Combined, the techniques presented here and the solution of ICS’02 can make the case for reconsidering the architecture integrating physical register file and the ROB as a choice for implementing future high-performance microprocessors.