IBM's POWER5 Microprocessor Design and Methodology
Transcript of IBM's POWER5 Microprocessor Design and Methodology
© 2003 IBM Corporation
IBM's POWER5 Microprocessor Design and Methodology
Ron KallaIBM Systems Group
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Outline
§ Motivation§ Background§ Threading Fundamentals§ Enhanced SMT Implementation in POWER5§ Memory Subsystem Enhancements § Additional SMT Considerations§ Summary
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Microprocessor Design Optimization Focus Areas
§ Memory latency4 Increased processor speeds make memory appear further away
4 Longer stalls possible§ Branch processing
4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power
4 Predication drives increased power and larger chip area§ Execution Unit Utilization
4 Currently 20-25% execution unit utilization common§ Simultaneous multi-threading (SMT) and POWER architecture
address these areas
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
POWER4 --- Shipped in Systems December 2001§ Technology: 180nm lithography,
Cu, SOI4 POWER4+ shipping in 130nm today
4 267mm2 185M transistors§ Dual processor core§ 8-way superscalar
4 Out of Order execution
4 2 Load / Store units
4 2 Fixed Point units
4 2 Floating Point units
4 Logical operations on Condition Register
4 Branch Execution unit§ > 200 instructions in flight§ Hardware instruction and data
prefetch
L3
Dir
ecto
ry/C
on
tro
lL2 L2 L2
LSU LSUIFUBXU
IDU IDU
IFUBXU
FPU FPU
FX
U
FXUISU ISU
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
POWER5 --- The Next Step
§ Technology: 130nm lithography, Cu, SOI§ 389mm2 276M Transistors§ Dual processor core§ 8-way superscalar§ Simultaneous multithreaded
(SMT) core4 Up to 2 virtual processors per
real processor
4 Natural extension to POWER4 design
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Multi-threading Evolution
Thread 1 ExecutingThread 0 Executing No Thread Executing
FX0FX1FP0FP1LS0LS1BRXCRL
Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL
Coarse Grain Threading
FX0FX1FP0FP1LS0LS1BRXCRL
Fine Grain Threading
FX0FX1FP0FP1LS0LS1BRXCRL
Simultaneous Multi-Threading
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
BR, CRLUnits
FPUsFPUs
FXUs,LSUs
CR, LR,CTR
FPRs
GPRsIntegerIssue Qs
FPIssue Qs
BR/CRUIssue Qs
RegisterRename
FetchUnit
I-Cache
Decode
PC
BR, CRLUnits
FPUsFPUs
FXUs,LSUs
CR, LR,CTR
FPRs
GPRs
PC
IntegerIssue Qs
FPIssue Qs
BR/CRUIssue Qs
RegisterRename
FetchUnit
I-Cache
Decode
Changes Going From ST to SMT Core§ SMT easily added to Superscalar Micro-architecture
4 Second Program Counter (PC) added to share I-fetch bandwidth4 GPR/FPR rename mapper expanded to map second set of registers
(High order address bit indicates thread)4 Completion logic replicated to track two threads4 Thread bit added to most address/tag buses
DataCache
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Resource Sizes
Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0
~~
§ Analysis done to optimize every micro-architectural resource size4GPR/FPR rename pool size
4I-fetch buffers
4Reservation Station
4SLB/TLB/ERAT
4I-cache/D-cache§Many Workloads examined§ Associativity also examined
50 60 70 80 90 100 110 120 13080 120
Number of GPR Renames
IPC
ST SMT
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
POWER5 Resources Size Enhancements§ Enhanced caches and translation resources
4 I-cache: 64 KB, 2-way set associative, LRU
4 D-cache: 32 KB, 4-way set associative, LRU
4 First level Data Translation: 128 entries, fully associative, LRU
4 L2 Cache: 1.92 MB, 10-way set associative, LRU§ Larger resource pools
4 Rename registers: GPRs, FPRs increased to 120 each4 L2 cache coherency engines: increased by 100%
§ Enhanced data stream prefetching§ Memory controller moved on chip
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Resource Sharing
Results based on simulation of an online transaction processing application
§ Threads share many resources
4 Global Completion Table, BHT, TLB, . . .
§ Higher performance realized when resources balanced across threads
4 Tendency to drift toward extremes accompanied by reduced performance
§ Solution: Dynamically adjust resource utilization
0
2
4
6
8
10
Rel
ativ
e O
ccu
rren
ce
0 5 10 15 20
05
1015
20
Thread 0
Threa
d 1
With dynamic resource utilization
adjustment
0
2
4
6
8
10
Rel
ativ
e O
ccu
rren
ce
0 5 10 15 20
05
1015
20
Thread 0
Threa
d 1
Without dynamic resource utilization
adjustment
Global Completion Table Occupancy
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Thread Priority§ Instances when unbalanced
execution desirable4 No work for opposite thread
4 Thread waiting on lock
4 Software determined non uniform balance
4 Power management
4 …
§ Solution: Control instruction decode rate
4 Software/hardware controls 8 priority levels for each thread
0
0
0
1
1
1
1
1
2
2
IPC
-5 -3 -1 0 1 3 50,7 7,0 1,1
Thread 1 Priority - Thread 0 Priority
Thread 0 IPC Thread 1 IPCPower Save Mode
Single Thread Mode
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Dynamic Thread Switching
§ Used if no task ready for second thread to run§ Allocates all machine resources to
one thread§ Initiated by software§ Dormant thread wakes up on:
4 External interrupt
4 Decrementer interrupt
4 Special instruction from active thread
Active
Dormant
Null
software
hardware or software
software
software
Thread States
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Single Thread Operation
§ Advantageous for execution unit limited applications
4 Floating or fixed point intensive workloads
§ Execution unit limited applications provide minimal performance leverage for SMT4 Extra resources necessary for SMT
provide higher performance benefit when dedicated to single thread
§ Determined dynamically on a per processor basis
PO
WE
R4+
PO
WE
R5
ST
Matrix Multiply
IPC
PO
WE
R5
SM
T
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Modifications to POWER4 System Structure
P P
L2
Memory
Fab Ctl
P P
L2
Memory
Fab Ctl
L3 L3
Mem Ctl Mem Ctl
L3
P P
L2
Memory
Fab Ctl
P P
L2
L3
Memory
Fab Ctl
Mem Ctl Mem Ctl
POWER4Systems
POWER5Systems
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Memory
POWER5L3
POWER5 POWER5 POWER5
Memory Memory
MCM
I/O I/O I/O I/O
L3 L3 L3
Memory
16-way Building Block
MCM
I/O
POWER5
Memory
L3
BookI/OMemory
POWER5L3
I/O
POWER5
Memory
L3
I/O
POWER5
Memory
L3
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
POWER5 Multi-chip Module
§ 95mm % 95mm§ Four POWER5 chips§ Four cache chips§ 4,491 signal I/Os§ 89 layers of metal
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
64-way SMP Interconnection
Interconnection exploits enhanced distributed switch§ All chip interconnections operate at half processor frequency
and scale with processor frequency
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
POWER4 and POWER5 Storage Hierarchy
Processor speed½ processor speedIntra-MCM data buses
TypeEnhanced distributed switch
Distributed switch
10-way, LRU8-way, LRUAssociativity, replacement
1.92 MB, 128 B line1.44 MB, 128 B lineCapacity, line size
1024 GB (1 TB) maximum
512 GB maximumMemory
½ processor speed½ processor speedInter-MCM data buses
Chip interconnect
12-way, LRU8-way, LRUAssociativity, replacement
36 MB, 256 B line32 MB, 512 B lineCapacity, line size
Off-chip L3 Cache
L2 Cache
POWER5POWER4
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Other SMT Considerations
§ Power Management4 SMT Increases execution unit utilization
4 Dynamic power management does not impact performance§ Debug tools / Lab bring-up
4 Instruction tracing
4 Hang detection
4 Forward progress monitor§ Performance Monitoring§ Serviceability
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Autonomic Computing Enhancements
2001
POWER4 2006*
POWER6
65 nm
L2 caches
Ultra high frequency cores
AdvancedSystem Features
Chip Multi Processing- Distributed Switch- Shared L2Dynamic LPARs (16)
180 nm
1.3 GHzCore
1.3 GHzCore
Distributed Switch
Shared L2
2002-3POWER4+
1.7 GHz Core
1.7 GHz Core
130 nm
Reduced sizeLower powerLarger L2More LPARs (32)
Shared L2
Distributed Switch
2004*POWER5
2005*POWER5+
Simultaneous multi-threadingSub-processor partitioningDynamic firmware updatesEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem
90 nm
Shared L2
>> GHz Core
>> GHz Core
Distributed Switch
130 nm
> GHzCore
> GHz Core
Distributed Switch
Shared L2
POWER Server Roadmap
**Planned to be offered by IBM. All statements about IBM’s future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.
© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003
IBM’s POWER5 Micro Processor Design and Methodology
Summary
§ POWER5 SMT implementation is more than SMT4 Good ROI for silicon area: Performance gain > Area increase
4 Resource sizes optimized
4 Dynamic feedback enhances instruction throughput
4 Software controlled priority exploits machine architecture
4 Dynamic ST to/from SMT mode capability optimizes system resources
§ SMT impacts pervasive throughout chip
§ Storage subsystem scalable to 64 Processor/ 128 Threads
§ Operating in laboratory4 AIX, Linux and OS/400 booted and running