Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko...
Transcript of Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko...
![Page 1: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/1.jpg)
Pentium Pro Case Study
Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti
![Page 2: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/2.jpg)
Pentium Pro Case Study
• Microarchitecture
– Order-3 Superscalar
– Out-of-Order execution
– Speculative execution
– In-order completion
• Design Methodology
• Performance Analysis
• Retrospective
![Page 3: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/3.jpg)
Goals of P6 Microarchitecture
IA-32 Compliant
Performance (Frequency - IPC)
Validation
Die Size
Schedule
Power
![Page 4: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/4.jpg)
P6 – The Big Picture
MOB
DCU
IEU1AGU0 IEU0
Fadd
Fmul
Imul
Div
AGU1
01234
Reservation Station (20)
Dispatch
Decode
Fetch
2 Cycles
4 Cycles
2 Cycles
BTB/ICU
BAC/Rename
Allocation
2 Cycles
ROB
RRF
JEU(40x157)
2 cyc
![Page 5: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/5.jpg)
Memory Hierarchy
• Level 1 instruction and data caches - 2 cycle access time
• Level 2 unified cache - 6 cycle access time
• Separate level 2 cache and memory address/data bus
ICache(8KB)
DCache(8Kb)
BIUL2 Cache(256Kb)
Main Memory PCI
CPU
64 bit
16bytes
![Page 6: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/6.jpg)
Instruction Fetch
Cache
I n s t
. B
u f
I n s t
. R
o t a
t o r
Inst. Length
Decoder
Length
Inst.
Marks
P r e
d i c
t i o
n
M a r k
s
Instruction
Victim
ICache
Stream
TLB
Buffer
Physical Addr.
L2 Cache (256Kb)
F e t
c h A
d d r e
s s
Next Addr. Logic
Other Fetch
Requests
Branch Target Buffer (512) P r e
d i c
t i o
n
To Decode
1 6 b
y t e
s
1 6 b
y t e
s +
m a r
k s
I n s t
r u c t
i o n D
a t a
M u x
I n s t
r u c t
i o n D
a t a
(8Kb)
2 cycle
Branch Target
![Page 7: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/7.jpg)
Instruction Cache Unit
Stream Buffer
ICache(8 Kb)
Victim Cache
BusInterface
Unit
Data MuxInstructionTag Array
ITLBHit/MissInstruction
Data
Fetch Address
Lower 12 bits
Lower 12 bits
Upper20 bits
![Page 8: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/8.jpg)
Branch Target Buffer
F e t c
h A
d d
r . T
a g
4 - b
i t B
H R
B r .
O f f
s e t
4 - b
i t B
H R
s p
e c .
T a r
g e t
A d
d r .
Way 0 1
2 8
S e
t s
F e t c
h A
d d
r . T
a g
4 - b
i t B
H R
B r .
O f f
s e t
4 - b
i t B
H R
s p
e c .
T a r
g e t
A d
d r .
Way 1
F e t c
h A
d d
r . T
a g
4 - b
i t B
H R
B r .
O f f
s e t
4 - b
i t B
H R
s p
e c .
T a r
g e t
A d
d r .
Way 3
Tag Compare
PHT
1 6
e n
t r i e
s / s e
t
Return Stack
Prediction Control Logic
Prediction &
Target Addr.
Fetch Address
Pattern History Table (PHT) is not speculatively updated
A speculative Branch History Register (BHR) and prediction state is maintained
Uses speculative prediction state if it exist for that branch
![Page 9: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/9.jpg)
Branch Prediction Algorithm
Current prediction updates the speculative history prior to the next instance of the branch instruction
Branch History Register (BHR) is updated during branch execution
Branch recovery flushes front-end and drains the execution core
Branch mis-prediction resets the speculative branch history state to match BHR
0 0 1 0
1
0 1 0 1
Speculative History
Br. History
0000
0001
0010
0011
0100
0101
0110
1110
1111
1 0
1
.
.
.
Pattern TableStateMachine
0101
0010
11
10
Spec. Pred.
Branch Execution
Br. Pred.
![Page 10: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/10.jpg)
Instruction Decode - 1
Branch instruction detection
Branch address calculation - Static prediction and branch always execution
One branch decode per cycle (break on branch)
Instruction Buffer 16 bytes
Macro-Instruction Bytes from IFU
Decoder0
Decoder1
Decoder2
BranchAddress
Calc.
To NextAddress
Calc.
4 uops 1 uop 1 uop
Up to 3 uops Issued to dispatch
uop Queue (6)
uROM
![Page 11: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/11.jpg)
Instruction Decode - 2
Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled
Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0
Instruction Buffer 16 bytes
Macro-Instruction Bytes from IFU
Decoder0
Decoder1
Decoder2
BranchAddress
Calc.
To NextAddress
Calc.
4 uops 1 uop 1 uop
Up to 3 uops Issued to dispatch
uop Queue (6)
uROM
![Page 12: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/12.jpg)
What is a uop?
Small two-operand instruction - Very RISC like.
IA-32 instruction
add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx)
Uop decomposition:
ld guop0, (eax) guop0 <- MEM(eax)
ld guop1, (ebx) guop1 <- MEM(ebx)
add guop0,guop1 guop0 <- guop0 + guop1
sta eax
std guop0 MEM(eax) <- guop0
![Page 13: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/13.jpg)
Instruction Dispatch
Register Renaming
Allocation requirements
“3-or-none” Reorder buffer entries
Reservation station entry
Load buffer or store buffer entry
Dispatch buffer “probably” dispatches all 3 uops before re-fill
Renaming
Allocator
u o
P Q
u e u
e ( 6
)
D i s
p a
t c h
B u
f f e r
( 3 )
Mux Logic
To Reservation
Station
Retirement Info
2 cycles
![Page 14: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/14.jpg)
Register Renaming - 1
Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags
The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register
Execution results are stored in the ROB
Integer RATEAXEBXECX
Floating Point RATFST0FST1FST2
FST7
GuoP0GuoP1
Real Register File (RRF) Reorder Buffer (ROB)0123456789
39
EAXEBXECX
FST0FST1
GuoP0GuoP1
IuoP(0-3)CC/Events
8
8
12
49
![Page 15: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/15.jpg)
Register Renaming - Example
© Shen, Lipasti 15
Integer RATEAXEBXECX
Floating Point RATFST0FST1FST2
FST7
GuoP0GuoP1
Real Register File (RRF) Reorder Buffer (ROB)0123456789
39
EAXEBXECX
FST0FST1
GuoP0GuoP1
IuoP(0-3)CC/Events
8
8
12
49
Dispatching:
add eax, ebx
add eax, ecx
fxch f0, f1
Completing:
sub eax, ecx
AllocComp sub
![Page 16: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/16.jpg)
Challenges to Register Renaming Integer RAT
EAXEBXECX
Floating Point RATFST0FST1FST2
FST7
GuoP0GuoP1
Real Register File (RRF) Reorder Buffer (ROB)0123456789
39
EAXEBXECX
FST0FST1
GuoP0GuoP1
IuoP(0-3)CC/Events
8
8
12
49
8-bit code
mov AL, #data1
mov AH, #data2
add AL, #data3
add AL, #data4
Byte addressable registers
![Page 17: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/17.jpg)
Out-of-Order Execution Engine
• In-order branch issue and execution
• In-order load/store issue to address generation units
• Instruction execution and result bus scheduling
• Is the reservation station “truly” centralized & what is “binding”?
IEU1AGU0 IEU0
Fadd
Fmul
Imul
Div
AGU1
MOB
DCU (8Kb)
01234
Reservation Station (20) 2 Cycles
JEU
RSbypass
![Page 18: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/18.jpg)
Reservation Station
• Cycle 1
– Order checking
– Operand availability
• Cycle 2
– Writeback bus scheduling
Cycle 1
Cycle 2
To Execution Units
Port 0Port 1Port 2Port 3Port 4
From Dispatch Queue
![Page 19: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/19.jpg)
Memory Ordering Buffer (MOB)
• Load buffer retains loads until completed, for coherency checking
• Store forwarding out of store buffers
• 2 cycle latency through MOB
• “Store Coloring” - Load instructions are tagged by the last store
ConflictLogic
Load Buffer Store DataBuffer (12)
Store AddressBuffer (12)(16)
Data Cache Unit (8Kb)
BypassLogicControl
Load Data Result
AGU0 AGU1 R/S
MOB
2 cycle
2 cycle
![Page 20: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/20.jpg)
Instruction Completion
• Handles all exception/interrupt/trap conditions
• Handles branch recovery
– OOO core drains out right-path instructions, commits to RRF
– In parallel, front end starts fetching from target/fall-through
– However, no renaming is allowed until OOO core is drained
– After draining is done, RAT is reset to point to RRF
– Avoids checkpointing RAT, recovering to intermediate RAT state
• Commits execution results to the architectural state in-order
– Retirement Register File (RRF)
– Must handle hazards to RRF (writes/reads in same cycle)
– Must handle hazards to RAT (writes/reads in same cycle)
• “Atomic” IA-32 instruction completion
– uops are marked as 1st or last in sequence
– exception/interrupt/trap boundary
• 2 cycle retirement
![Page 21: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/21.jpg)
Pentium Pro Design Methodology - 1
Logic Design
Structural Model
Conception
Silicon Debug
Behavioral Model
Circuit Design
Layout
Performance Model
Microarchitecture
Start 1990
Finish 4Q
1 year
1.75 years
1 year
1994
0.75 years
![Page 22: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/22.jpg)
Pentium Pro Performance Analysis
• Observability
– On-chip event counters
– Dynamic analysis
• Benchmark Suite
– BAPco Sysmark32 - 32-bit Windows NT applications
– Winstone97 - 32-bit Windows NT applications
– Some SPEC95 benchmarks
![Page 23: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/23.jpg)
Performance – Run Times
avs
microstation
Pwave
PhotoShop
MSVCCorel excel
PageMaker
paradoxPowerPnt
word
WordProSYSexcelSYSword7SYSCorel6
SYSflw96SYSexcel7
SYSpowerpnt7SYSParadox7
SYSpagemaker6
compress
ligo
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
avs
microstation
Pwave
PhotoShop
MSVCCorel excel
PageMaker
paradoxPowerPnt
word
WordProSYSexcelSYSword7SYSCorel6
SYSflw96SYSexcel7
SYSpowerpnt7SYSParadox7
SYSpagemaker6
compress
ligo
Unhalted User-Mode Processor Cycles
Total of 27.5 billion cycles
User-Mode Processor Cycles
![Page 24: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/24.jpg)
Performance – IPC vs. uPC
WS-AVS
WS-Microstation
WS-PWave
WS-PhotoShop
WS-MSVC++
WS-CorelDraw
WS-Excel
WS-PageMaker
WS-Paradox
WS-PowerPnt
WS-Word
WS-WordPro
SYS-ExcelSYS-Word
SYS-Core lDraw
SYS-FlW96SYS-Excel
SYS-PowerPnt
SYS-Paradox
SYS-PagemakerSPEC-Compress
SPEC-Li SPEC-Go
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
# Retired Per Cycle
WS-AVS
WS-Microstation
WS-PWave
WS-PhotoShop
WS-MSVC++
WS-CorelDraw
WS-Excel
WS-PageMaker
WS-Paradox
WS-PowerPnt
WS-Word
WS-WordPro
SYS-ExcelSYS-Word
SYS-Core lDraw
SYS-FlW96SYS-Excel
SYS-PowerPnt
SYS-Paradox
SYS-PagemakerSPEC-Compress
SPEC-Li SPEC-Go
Instructions and UOPs Retired Per Cycle
Inst retired
UOPS retired
2 uops/instruction
Instructions and Uops retired per cycle
![Page 25: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/25.jpg)
Performance – IPC vs. uPC
3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.
0
10
20
30
40
50
60
70
80
90
100
Percent of Cycles where n=0-3 UOPs or
Instructions were Retired
3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.
Inst. or UOPS retired per cycle
AVS
PageMaker
Paradox
PowerPoint
WordPro
Inst. or Uops retired per cycle
![Page 26: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/26.jpg)
Performance – Cache Misses
WS-AVS
WS-Microstation
WS-PWave
WS-PhotoShop
WS-MSVC++
WS-CorelDraw
WS-Excel
WS-PageMaker
WS-ParadoxWS-PowerPnt
WS-Word
WS-WordPro
SYS-ExcelSYS-Word
SYS-CorelDraw
SYS-FlW96SYS-Excel
SYS-PowerPntSYS-Paradox
SYS-PagemakerSPEC-Compress
SPEC-LiSPEC-Go
0
0.5
1
1.5
2
2.5
% Unhalted Cycles
WS-AVS
WS-Microstation
WS-PWave
WS-PhotoShop
WS-MSVC++
WS-CorelDraw
WS-Excel
WS-PageMaker
WS-ParadoxWS-PowerPnt
WS-Word
WS-WordPro
SYS-ExcelSYS-Word
SYS-CorelDraw
SYS-FlW96SYS-Excel
SYS-PowerPntSYS-Paradox
SYS-PagemakerSPEC-Compress
SPEC-LiSPEC-Go
Cache Misses Per Cycle
IFU Misses / Cycle
DCU Misses / Cycle
L2 Misses / Cycle
IL1 - 0.51%DL1 - 1.15%L2 - 0.25%
Cache Misses Per Cycle
![Page 27: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/27.jpg)
Performance – Branch Prediction
W S-AVS
W S-M icrostat ion
W S-PW ave
WS-P hoto Shop
WS -MS VC+ +
W S-C orelD raw
W S-Exce l
W S-Pag eMa ker
W S-P aradoxWS- Pow erPnt
W S-Wo rd
W S-W ordPro
SYS -ExcelSY S-W ord
SYS-C orelD raw
SYS-F lW 96SYS-E xcel
SYS -Pow erPntSY S-Parado x
S YS-Page makerSPEC -Co mpre ss
SPEC -LiSPEC -Go
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
% Branches that Miss in BTB
W S-AVS
W S-M icrostat ion
W S-PW ave
WS-P hoto Shop
WS -MS VC+ +
W S-C orelD raw
W S-Exce l
W S-Pag eMa ker
W S-P aradoxWS- Pow erPnt
W S-Wo rd
W S-W ordPro
SYS -ExcelSY S-W ord
SYS-C orelD raw
SYS-F lW 96SYS-E xcel
SYS -Pow erPntSY S-Parado x
S YS-Page makerSPEC -Co mpre ss
SPEC -LiSPEC -Go
BTB Miss Rate
WS-AVS
WS-Microstation
WS-PWave
WS-PhotoShop
WS-MSVC++
WS-CorelDraw
WS-Excel
WS-PageMaker
WS-ParadoxWS-PowerPnt
WS-Word
WS-WordPro
SYS-ExcelSYS-Word
SYS-CorelDraw
SYS-FlW96SYS-Excel
SYS-PowerPntSYS-Paradox
SYS-PagemakerSPEC-Compress
SPEC-LiSPEC-Go
0%
2%
4%
6%
8%
10%
12%
14%
16%
% Branches Mispredicted
WS-AVS
WS-Microstation
WS-PWave
WS-PhotoShop
WS-MSVC++
WS-CorelDraw
WS-Excel
WS-PageMaker
WS-ParadoxWS-PowerPnt
WS-Word
WS-WordPro
SYS-ExcelSYS-Word
SYS-CorelDraw
SYS-FlW96SYS-Excel
SYS-PowerPntSYS-Paradox
SYS-PagemakerSPEC-Compress
SPEC-LiSPEC-Go
Branch Mispredict Rate
BTB Miss Rate
Branch Mispredict Rate
6.8% avg
![Page 28: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/28.jpg)
Conclusions
IA-32 Compliant
Performance (Frequency - IPC) 366.0 ISpec92
283.2 FSpec92
8.09 SPECint95
6.70 SPECfp95
Validation
Die Size - Fabable Schedule - 1 year late Power -
![Page 29: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/29.jpg)
Retrospective • Most commercially successful
microarchitecture in history
• Evolution
– Pentium II/III, Xeon, etc.
• Derivatives with on-chip L2, ISA extensions, etc.
– Replaced by Pentium 4 as flagship in 2001
• High frequency, deep pipeline, extreme speculation
– Resurfaced as Pentium M in 2003
• Initially a response to Transmeta in laptop market
• Pentium 4 derivative (90nm Prescott) delayed, slow, hot
– Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 © Shen, Lipasti 29
![Page 30: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/30.jpg)
Microarchitectural Updates • Pentium M (Banias), Core Duo (Yonah)
– Micro-op fusion (also in AMD K7/K8) • Multiple uops in one: (add eax,[mem] => ld/alu), sta/std
• These uops decode/dispatch/commit once, issue twice
– Better branch prediction • Loop count predictor
• Indirect branch predictor
– Slightly deeper pipeline (12 stages) • Extra decode stage for micro-op fusion
• Extra stage between issue and execute (for RS/PLRAM read)
– Data-capture reservation station (payload RAM) • Clock gated for 32 (int) , 64 (fp), and 128 (SSE) operands
© Shen, Lipasti 30
![Page 31: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/31.jpg)
Microarchitectural Updates • Core 2 Duo (Merom)
– 64-bit ISA from AMD K8
– Macro-op fusion • Merge uops from two x86 ops
• E.g. cmp, jne => cmpjne
– 4-wide decoder (Complex + 3x Simple) • Peak x86 decode throughput is 5 due to macro-op fusion
– Loop buffer • Loops that fit in 18-entry instruction queue avoid fetch/decode
overhead
– Even deeper pipeline (14 stages)
– Larger reservation station (32), instruction window (96)
– Memory dependence prediction © Shen, Lipasti 31
![Page 32: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/32.jpg)
Microarchitectural Updates • Nehalem (Core i7/i5/i3)
– RS size 36, ROB 128
– Loop cache up to 28 uops
– L2 branch predictor
– L2 TLB
– I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M
– Simultaneous multithreading
– RAS now renamed (repaired)
– 6 issue, 48 load buffers, 32 store buffers
– New system interface (QPI) – finally dropped front-side bus
– Integrated memory controller (up to 3 channels)
– New STTNI instructions for string/text handling
© Shen, Lipasti 32
![Page 33: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/33.jpg)
Microarchitectural Updates • Sandybridge/Ivy Bridge (2nd-3rd generation Core i7)
– On-chip integrated graphics (GPU)
– Decoded uop cache up to 1.5K uops, handles loops, but more general
– 54-entry RS, 168-entry ROB
– Physical register file: 144 FP, 160 integer
– 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle
– 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128-bit load path from L1 D$
© Shen, Lipasti 33
![Page 34: Pentium Pro Case Study - ECE/CS 752 Fall 2019 · 2019-11-25 · Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed76ff99ab714631510fd3f/html5/thumbnails/34.jpg)
Microarchitectural Updates • Haswell/Broadwell/Skylake: wider & deeper
– 8-wide issue (up from 6 wide)
– 4th integer ALU, third AGU, second branch unit
– 60-entry RS, 192-entry ROB
– 72-entry load queue/42-entry store queue
– Physical register file: 168 FP, 168 integer
– Doubled FP throughput (32 SP/16 DP)
– Load/store bandwidth to L1 doubled (64B/32B)
– TSX (transactional memory)
– Integrated voltage regulator
© Shen, Lipasti 34