IBM's POWER5 Microprocessor Design and Methodology

© 2003 IBM Corporation

IBM's POWER5 Microprocessor Design and Methodology

Ron KallaIBM Systems Group

© 2003 IBM CorporationUT Computer Architecture Seminar, November 2003

IBM’s POWER5 Micro Processor Design and Methodology

Outline

§ Motivation§ Background§ Threading Fundamentals§ Enhanced SMT Implementation in POWER5§ Memory Subsystem Enhancements § Additional SMT Considerations§ Summary



Microprocessor Design Optimization Focus Areas

§ Memory latency4 Increased processor speeds make memory appear further away

4 Longer stalls possible§ Branch processing

4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power

4 Predication drives increased power and larger chip area§ Execution Unit Utilization

4 Currently 20-25% execution unit utilization common§ Simultaneous multi-threading (SMT) and POWER architecture

address these areas



POWER4 --- Shipped in Systems December 2001§ Technology: 180nm lithography,

Cu, SOI4 POWER4+ shipping in 130nm today

4 267mm2 185M transistors§ Dual processor core§ 8-way superscalar

4 Out of Order execution

4 2 Load / Store units

4 2 Fixed Point units

4 2 Floating Point units

4 Logical operations on Condition Register

4 Branch Execution unit§ > 200 instructions in flight§ Hardware instruction and data

prefetch

L3

Dir

ecto

ry/C

on

tro

lL2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPU

FX

U

FXUISU ISU



POWER5 --- The Next Step

§ Technology: 130nm lithography, Cu, SOI§ 389mm2 276M Transistors§ Dual processor core§ 8-way superscalar§ Simultaneous multithreaded

(SMT) core4 Up to 2 virtual processors per

real processor

4 Natural extension to POWER4 design



Multi-threading Evolution

Thread 1 ExecutingThread 0 Executing No Thread Executing

FX0FX1FP0FP1LS0LS1BRXCRL

Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL

Coarse Grain Threading


Fine Grain Threading


Simultaneous Multi-Threading



BR, CRLUnits

FPUsFPUs

FXUs,LSUs

CR, LR,CTR

FPRs

GPRsIntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

RegisterRename

FetchUnit

I-Cache

Decode

PC

BR, CRLUnits

FPUsFPUs

FXUs,LSUs

CR, LR,CTR

FPRs

GPRs

PC

IntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

RegisterRename

FetchUnit

I-Cache

Decode

Changes Going From ST to SMT Core§ SMT easily added to Superscalar Micro-architecture

4 Second Program Counter (PC) added to share I-fetch bandwidth4 GPR/FPR rename mapper expanded to map second set of registers

(High order address bit indicates thread)4 Completion logic replicated to track two threads4 Thread bit added to most address/tag buses

DataCache



Resource Sizes

Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0

~~

§ Analysis done to optimize every micro-architectural resource size4GPR/FPR rename pool size

4I-fetch buffers

4Reservation Station

4SLB/TLB/ERAT

4I-cache/D-cache§Many Workloads examined§ Associativity also examined

50 60 70 80 90 100 110 120 13080 120

Number of GPR Renames

IPC

ST SMT



POWER5 Resources Size Enhancements§ Enhanced caches and translation resources

4 I-cache: 64 KB, 2-way set associative, LRU

4 D-cache: 32 KB, 4-way set associative, LRU

4 First level Data Translation: 128 entries, fully associative, LRU

4 L2 Cache: 1.92 MB, 10-way set associative, LRU§ Larger resource pools

4 Rename registers: GPRs, FPRs increased to 120 each4 L2 cache coherency engines: increased by 100%

§ Enhanced data stream prefetching§ Memory controller moved on chip



Resource Sharing

Results based on simulation of an online transaction processing application

§ Threads share many resources

4 Global Completion Table, BHT, TLB, . . .

§ Higher performance realized when resources balanced across threads

4 Tendency to drift toward extremes accompanied by reduced performance

§ Solution: Dynamically adjust resource utilization

0

2

4

6

8

10

Rel

ativ

e O

ccu

rren

ce

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

With dynamic resource utilization

adjustment

0

2

4

6

8

10

Rel

ativ

e O

ccu

rren

ce

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

Without dynamic resource utilization

adjustment

Global Completion Table Occupancy



Thread Priority§ Instances when unbalanced

execution desirable4 No work for opposite thread

4 Thread waiting on lock

4 Software determined non uniform balance

4 Power management

4 …

§ Solution: Control instruction decode rate

4 Software/hardware controls 8 priority levels for each thread

0

0

0

1

1

1

1

1

2

2

IPC

-5 -3 -1 0 1 3 50,7 7,0 1,1

Thread 1 Priority - Thread 0 Priority

Thread 0 IPC Thread 1 IPCPower Save Mode

Single Thread Mode



Dynamic Thread Switching

§ Used if no task ready for second thread to run§ Allocates all machine resources to

one thread§ Initiated by software§ Dormant thread wakes up on:

4 External interrupt

4 Decrementer interrupt

4 Special instruction from active thread

Active

Dormant

Null

software

hardware or software

software

software

Thread States



Single Thread Operation

§ Advantageous for execution unit limited applications

4 Floating or fixed point intensive workloads

§ Execution unit limited applications provide minimal performance leverage for SMT4 Extra resources necessary for SMT

provide higher performance benefit when dedicated to single thread

§ Determined dynamically on a per processor basis

PO

WE

R4+

PO

WE

R5

ST

Matrix Multiply

IPC

PO

WE

R5

SM

T



Modifications to POWER4 System Structure

P P

L2

Memory

Fab Ctl

P P

L2

Memory

Fab Ctl

L3 L3

Mem Ctl Mem Ctl

L3

P P

L2

Memory

Fab Ctl

P P

L2

L3

Memory

Fab Ctl

Mem Ctl Mem Ctl

POWER4Systems

POWER5Systems



Memory

POWER5L3

POWER5 POWER5 POWER5

Memory Memory

MCM

I/O I/O I/O I/O

L3 L3 L3

Memory

16-way Building Block

MCM

I/O

POWER5

Memory

L3

BookI/OMemory

POWER5L3

I/O

POWER5

Memory

L3

I/O

POWER5

Memory

L3



POWER5 Multi-chip Module

§ 95mm % 95mm§ Four POWER5 chips§ Four cache chips§ 4,491 signal I/Os§ 89 layers of metal



64-way SMP Interconnection

Interconnection exploits enhanced distributed switch§ All chip interconnections operate at half processor frequency

and scale with processor frequency



POWER4 and POWER5 Storage Hierarchy

Processor speed½ processor speedIntra-MCM data buses

TypeEnhanced distributed switch

Distributed switch

10-way, LRU8-way, LRUAssociativity, replacement

1.92 MB, 128 B line1.44 MB, 128 B lineCapacity, line size

1024 GB (1 TB) maximum

512 GB maximumMemory

½ processor speed½ processor speedInter-MCM data buses

Chip interconnect

12-way, LRU8-way, LRUAssociativity, replacement

36 MB, 256 B line32 MB, 512 B lineCapacity, line size

Off-chip L3 Cache

L2 Cache

POWER5POWER4



Other SMT Considerations

§ Power Management4 SMT Increases execution unit utilization

4 Dynamic power management does not impact performance§ Debug tools / Lab bring-up

4 Instruction tracing

4 Hang detection

4 Forward progress monitor§ Performance Monitoring§ Serviceability



Autonomic Computing Enhancements

2001

POWER4 2006*

POWER6

65 nm

L2 caches

Ultra high frequency cores

AdvancedSystem Features

Chip Multi Processing- Distributed Switch- Shared L2Dynamic LPARs (16)

180 nm

1.3 GHzCore

1.3 GHzCore

Distributed Switch

Shared L2

2002-3POWER4+

1.7 GHz Core

1.7 GHz Core

130 nm

Reduced sizeLower powerLarger L2More LPARs (32)

Shared L2

Distributed Switch

2004*POWER5

2005*POWER5+

Simultaneous multi-threadingSub-processor partitioningDynamic firmware updatesEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem

90 nm

Shared L2

>> GHz Core

>> GHz Core

Distributed Switch

130 nm

> GHzCore

> GHz Core

Distributed Switch

Shared L2

POWER Server Roadmap

**Planned to be offered by IBM. All statements about IBM’s future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.



Summary

§ POWER5 SMT implementation is more than SMT4 Good ROI for silicon area: Performance gain > Area increase

4 Resource sizes optimized

4 Dynamic feedback enhances instruction throughput

4 Software controlled priority exploits machine architecture

4 Dynamic ST to/from SMT mode capability optimizes system resources

§ SMT impacts pervasive throughout chip

§ Storage subsystem scalable to 64 Processor/ 128 Threads

§ Operating in laboratory4 AIX, Linux and OS/400 booted and running

IBM's POWER5 Microprocessor Design and Methodology

Documents

Transcript of IBM's POWER5 Microprocessor Design and Methodology