Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems...

15
© 2003 IBM Corporation Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Transcript of Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems...

Page 1: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM Corporation

Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor

Ron Kalla, Balaram Sinharoy, Joel TendlerIBM Systems Group

Page 2: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Outline

MotivationBackgroundThreading FundamentalsEnhanced SMT Implementation in POWER5 Additional SMT ConsiderationsSummary

Page 3: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Microprocessor Design Optimization Focus Areas

Memory latencyIncreased processor speeds make memory appear further away

Longer stalls possibleBranch processing

Mispredict more costly as pipeline depth increases resulting in stalls and wasted power

Predication drives increased power and larger chip areaExecution Unit Utilization

Currently 20-25% execution unit utilization commonSimultaneous multi-threading (SMT) and POWER architecture address these areas

Page 4: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

POWER4 --- Shipped in Systems December 2001

Technology: 180nm lithography, Cu, SOI

POWER4+ shipping in 130nm todayDual processor core8-way superscalar

Out of Order execution

2 Load / Store units

2 Fixed Point units

2 Floating Point units

Logical operations on Condition Register

Branch Execution unit> 200 instructions in flightHardware instruction and data prefetch L3

Dire

ctor

y/C

ontr

ol

L2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPU

FXU

FXUISU ISU

Page 5: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

POWER5 --- The Next Step

Technology: 130nm lithography, Cu, SOIDual processor core8-way superscalarSimultaneous multithreaded (SMT) core

Up to 2 virtual processors per real processor

24% area growth per core for SMT

Natural extension to POWER4 design

Page 6: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Multi-threading Evolution

FX0FX1FP0FP1LS0LS1BRXCRL

Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL

Coarse Grain Threading

FX0FX1FP0FP1LS0LS1BRXCRL

Fine Grain Threading

FX0FX1FP0FP1LS0LS1BRXCRL

Simultaneous Multi-Threading

Thread 1 ExecutingThread 0 Executing No Thread Executing

Page 7: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Changes Going From ST to SMT CoreSMT easily added to Superscalar Micro-architecture

Second Program Counter (PC) added to share I-fetch bandwidthGPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread)Completion logic replicated to track two threadsThread bit added to most address/tag buses

BR, CRLUnits

FPUsFPUs

FXUs,LSUs

CR, LR,CTR

FPRs

GPRsIntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

RegisterRename

FetchUnit

I-Cache

Decode

PC

BR, CRLUnits

FPUsFPUs

FXUs,LSUs

CR, LR,CTR

FPRs

GPRs

PC

IntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

RegisterRename

FetchUnit

I-Cache

Decode

DataCache

Page 8: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Resource Sizes

~~

Analysis done to optimize every micro-architectural resource size

GPR/FPR rename pool size

I-fetch buffers

Reservation Station

SLB/TLB/ERAT

I-cache/D-cacheMany Workloads examinedAssociativity also examined

50 60 70 80 90 100 110 120 13080 120

Number of GPR Renames

IPC

ST SMT

Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0

Page 9: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Resource Sharing

Results based on simulation of an online transaction processing application

Threads share many resources Global Completion Table, BHT, TLB, . . .

Higher performance realized when resources balanced across threads

Tendency to drift toward extremes accompanied by reduced performance

Solution: Dynamically adjust resource utilization

0

2

4

6

8

10

Rel

ativ

e O

ccur

renc

e

0 5 10 15 20

05

1015

20

Thread 0

Thread 1

With dynamic resource utilization

adjustment

0

2

4

6

8

10

Rela

tive

Occ

urre

nce

0 5 10 15 20

05

1015

20

Thread 0

Thread 1

Without dynamic resource utilization

adjustment

Global Completion Table Occupancy

Page 10: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Resource Sharing

0

2

4

6

8

10

Re

lati

ve

Oc

cu

rre

nc

e

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

0

2

4

6

8

10

Re

lati

ve

Oc

cu

rre

nc

e

0 5 10 15 20

05

1015

20

Thread 0

Thr

ead

1

Threads share many resources Global Completion Table, BHT, TLB, . . .

Higher performance realized when resources balanced across threads

Tendency to drift toward extremes accompanied by reduced performance

Solution: Dynamically adjust resource utilization

Global Completion Table Occupancy

Results based on simulation of an online transaction processing application

Page 11: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Thread PriorityInstances when unbalanced execution desirable

No work for opposite thread

Thread waiting on lock

Software determined non uniform balance

Power management

…Solution: Control instruction decode rate

Software/hardware controls 8 priority levels for each thread

0001111122

IPC

-5 -3 -1 0 1 3 50,7 7,0 1,1

Thread 1 Priority - Thread 0 Priority

Thread 0 IPC Thread 1 IPCPower Save Mode

Single Thread Mode

Page 12: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Dynamic Thread Switching

Thread StatesUsed if no task ready for second thread to runAllocates all machine resources to one threadInitiated by softwareDormant thread wakes up on:

External interruptDecrementer interruptSpecial instruction from active thread

Active

Dormant

Null

software

hardware or software

software

software

Page 13: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Single Thread Operation

Advantageous for execution unit limited applications

Floating or fixed point intensive workloads

Execution unit limited applications provide minimal performance leverage for SMT

Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread

Determined dynamically on a per processor basis

POW

ER4+

POW

ER5

ST

POW

ER5

SMT

IPC

Matrix Multiply

Page 14: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Other SMT Considerations

Power ManagementSMT Increases execution unit utilization

Dynamic power management does not impact performanceDebug tools / Lab bring-up

Instruction tracing

Hang detection

Forward progress monitorPerformance MonitoringServiceability

Page 15: Power5: IBM's Next Generation Power Microprocessor · 2013-05-17 · POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today

© 2003 IBM CorporationHotchips 15, August 2003

SMT Implementation in POWER5

Summary

POWER5 SMT implementation is more than SMTGood ROI for silicon area: Performance gain > Area increaseResource sizes optimized Dynamic feedback enhances instruction throughputSoftware controlled priority exploits machine architectureDynamic ST to/from SMT mode capability optimizes system resources

SMT impacts pervasive throughout chipOperating in laboratory

AIX, Linux and OS/400 booted and running