POWER5, Federation and Linux: System Architecture

23
1 © 2004 IBM Charles Grassl IBM August, 2004 POWER5, Federation and Linux: System Architecture 2 © 2004 IBM Corporation Agenda Architecture Processors POWER4 POWER5 System Nodes Systems High Performance Switch

Transcript of POWER5, Federation and Linux: System Architecture

Page 1: POWER5, Federation and Linux: System Architecture

1

© 2004 IBM

Charles GrasslIBM

August, 2004

POWER5, Federation and Linux:System Architecture

2 © 2004 IBM Corporation

Agenda

Architecture–Processors

–POWER4–POWER5

–System–Nodes–Systems

High Performance Switch

Page 2: POWER5, Federation and Linux: System Architecture

2

3 © 2004 IBM Corporation

Processor Design

POWER4Two processors per chipL1:–32 kbyte data–64 kbyte instructions

L2:–1.45 Mbyte per processor pair

L3:–32 Mbyte per processor pair–Shared by all processors

POWER5Two processors per chipL1:–32 kbyte data–64 kbyte instructions

L2:–1.9 Mbyte per processor pair

L3:–36 Mbyte per processor pair–Shared by processor pair–Extension of L2 cache

Simultaneous multithreadingEnhanced memory loads and storesAdditional rename registers

4 © 2004 IBM Corporation

POWER4:Multi-processor Chip

Each processor:– L1 caches– Registers– Functional units

Each chip (shared):– L2 cache– L3 cache

– Shared across node– Path to memory

Processors

L2 Cache

Page 3: POWER5, Federation and Linux: System Architecture

3

5 © 2004 IBM Corporation

POWER5:Multi-processor Chip

Each processor:– L1 caches– Registers– Functional units

Each chip (shared):– L2 cache– L3 cache

– NOT shared– Path to memory– Memory controller

Processors

L2 CacheMemory

Controller

6 © 2004 IBM Corporation

POWER4: Features

Multi-processor chipHigh clock rate– High computation burst rate

Three cache levels– Bandwidth– Latency hiding

Shared Memory– Large memory size

Page 4: POWER5, Federation and Linux: System Architecture

4

7 © 2004 IBM Corporation

POWER5 Features

Performance–Registers

–Additional rename registers–Cache improvements

–L3 runs at ½ freq. (POWER4: 1/3 freq.)–Simultaneous Multi Threading (SMT)–Memory Bandwidth

Virtualization–Partitioning

Dynamic power management

8 © 2004 IBM Corporation

POWER4, POWER5 Differences

389mm

1/10 of processor

Yes

~16Gbyte / sec /Chip

36MB12-way Associative~80 Clock Cycles

10-way Associative1.9MB

4-way AssociativeLRU

POWER5 Design

Better usage of processor resources1 processorProcessor

Addressing

50% more transistors in the same space412mmSize

Better processor utilizationNo

Simultaneous Multi-

Threading

4X improvementFaster memory access

4GBbyte / sec /Chip

Memory Bandwidth

Better Cache performance

32MB8-way Associative118 Clock Cycles

L3 Cache

Fewer L2 Cache missesBetter performance

8-way Associative1.44MBL2 cache

Improved L1 Cacheperformance

2-way AssociativeFIFOL1 Cache

BenefitPOWER4 Design

Page 5: POWER5, Federation and Linux: System Architecture

5

9 © 2004 IBM Corporation

L3

Modifications to POWER4 System Structure

P P

L2

Memory

L3

Fab Ctl

P P

L2

L3

Memory

L3

Fab Ctl

L3 L3

Mem Ctl Mem Ctl

Mem Ctl Mem Ctl

10 © 2004 IBM Corporation

POWER5 Challenges

Scaling from 32-way to logical 128-way– Large system effects, 2Tbyte of memory

1.9MB L2 shared across 4 threads (SMT)–L2 miss rates

36 Mbyte private L3 versus 128 Mbyte shared L3– Will have to watch multi-programming level– Read-only data will get replicated

Local/Remote memory latencies–Worst case remote memory latency is 60% higher than local

Page 6: POWER5, Federation and Linux: System Architecture

6

11 © 2004 IBM Corporation

Effect of Features: Registers

Additional rename registers–POWER4: 72 total registers–POWER5: 110 total registers

Matrix multiply:–POWER4: 65% of burst rate–POWER5: 90% of burst rate

Enhances performance of computationally intensive kernels

12 © 2004 IBM Corporation

Effect of Rename Registers

0

1000

2000

3000

4000

5000

6000

7000

POWER41.5 GHz

POWER41.5 GHz

POWER51.65 GHz

POWER51.65 GHz

Mflo

p/s

PolynomialMATMUL

Page 7: POWER5, Federation and Linux: System Architecture

7

13 © 2004 IBM Corporation

Effect of Rename Registers

“Tight” algorithms were previously register boundIncreased floating point efficiencyExamples–Matrix multiply

– LINPACK–Polynomial calculation

– Horner’s Rule–FFTs

0

1000

2000

3000

4000

5000

6000

7000

14 © 2004 IBM Corporation

Memory Bandwidth

0

2000

4000

6000

8000

10000

12000

14000

0 100000 200000 300000 400000

Mby

te/s

POWER5POWER4

s=s+A(i)1.65 GHz POWER51.5 GHz POWER4

Page 8: POWER5, Federation and Linux: System Architecture

8

15 © 2004 IBM Corporation

Memory Bandwidth: L1 – L2 Cache Regime

0

2000

4000

6000

8000

10000

12000

14000

0 100000 200000 300000 400000

Mby

te/s

POWER5POWER4

L1 Cache

L2 Cache

16 © 2004 IBM Corporation

Memory Bandwidth: L2 Cache Regime

0

2000

4000

6000

8000

10000

12000

14000

0 1000000 2000000 3000000 4000000

Mby

te/s

POWER5POWER4

-----------------------------------L2 Cache-------------------------------

Page 9: POWER5, Federation and Linux: System Architecture

9

17 © 2004 IBM Corporation

L2 and L3 Cache Improvement

L3 cache enhances bandwidth–Previously only hid

latency–New L3 cache is like a “2.5

cache”Larger blocking sizes

0

2000

4000

6000

8000

10000

12000

14000

0 1000000 2000000 3000000 4000000

-----------------------------------L2 Cache-------------------------------

18 © 2004 IBM Corporation

Simultaneous Multi-Threading

Two threads per processor– Threads execute concurrently

Threads share:– Caches– Registers– Functional units

Page 10: POWER5, Federation and Linux: System Architecture

10

19 © 2004 IBM Corporation

POWER5 Simultaneous Multi-Threading (SMT)

Thread1 active

Thread0 activeLegend

No Thread active

Processor Cycles

FX0FX1LS0LS1FP0FP1BRXCRL

POWER5 (Simultaneous Multi-threaded)

● Presents SMP programming model to software● Natural fit with superscalar out-of-order execution core

POWER4 (Single Threaded)

Processor Cycles

FX0FX1LS0LS1FP0FP1BRXCRL

20 © 2004 IBM Corporation

Conventional Multi-Threading

Threads alternate– Nothing shared

FunctionalUnits

FX0FX1FP0FP1LS0LS1BRXCRL

Thread 0 Thread 1

Time

Page 11: POWER5, Federation and Linux: System Architecture

11

21 © 2004 IBM Corporation

Simultaneous Multi-Threading

Simultaneous execution– Shared registers– Shared functional units

FX0FX1FP0FP1LS0LS1BRXCRL

Thread 0 Thread 1

0 1

22 © 2004 IBM Corporation

Manipulating SMT

Administrator functionDynamicSystem-wide

/usr/sbin/smtctl –m on nowOn

/usr/sbin/smtctl –m off nowOff

/usr/sbin/smtctlView

CommandFunction

Page 12: POWER5, Federation and Linux: System Architecture

12

23 © 2004 IBM Corporation

Dynamic Power Management

SingleThread

No Power Management

Photos taken with thermal sensitive camera while prototype POWER5 chip was undergoing tests

SimultaneousMulti-threading

Dynamic Power Management

24 © 2004 IBM Corporation

Multi-Chip Modules (MCMs)

Four processor chips are mounted on a module–Unit of manufacture–Smallest freestanding system

Multiple modules–Node

Also available: Single Chip Modules–Lower cost–Lower bandwidth

Page 13: POWER5, Federation and Linux: System Architecture

13

25 © 2004 IBM Corporation

POWER4 Multi-Chip Module (MCM)

4 chips– 8 processors

– 2 processors/chip>35 Gbyte/s in each chip-to-chip connection2:1 MHz chip-to-chip busses6 connections per MCMLogically shared L2's, L3'sData Bus (Memory, inter-module):3:1 MHz MCM to MCM

26 © 2004 IBM Corporation

POWER5 Multi-Chip Module (MCM)

95mm % 95mmFour POWER5 chipsFour cache chips4,491 signal I/Os89 layers of metal

Page 14: POWER5, Federation and Linux: System Architecture

14

27 © 2004 IBM Corporation

Multi Chip Module (MCM) Architecture

POWER4:– 4 processor chips

– 2 processors per chip– 8 off-module L3 chips

– L3 cache is controlled by MCM and logically shared across node

– 4 Memory control chips----------------------------------------------– 16 chips

POWER5:– 4 processor chips

– 2 processors per chip– 4 L3 cache chips

– L3 cache is used by processor pair– “Extension” of L2

----------------------------------------------– 8 chips

L3 ChipMem. Controller

28 © 2004 IBM Corporation

System Design with POWER4 and with POWER5

Multi Chip Module (MCM)–Multiple MCMs per system node

Other issues:–L3 cache sharing–Memory controller

Page 15: POWER5, Federation and Linux: System Architecture

15

29 © 2004 IBM Corporation

A

Core Core

Store Cache Store Cache

1.4M L2 Cache

32 Mbyte L3 Cache(portion of 128M shared) FBC

X

A

Z

Y

GX,NCU

MC

Power 4

Memory

POWER4Processor

Chip

L3Cache

30 © 2004 IBM Corporation

GX Bus

Shared 128MB L3

GX Bus

GX BusGX Bus

Chip-chip communication

Chip-chip communication

Chip-chip communication

Shared L21.5MB

Chip-chip communication

cpu0

M

E

M

O

R

Y

Cpu 1

Cpu 6Cpu 5Cpu 4

Cpu 3Cpu 2

Cpu 7

Shared L21.5MB

Shared L21.5MB

Shared L21.5MB

L2 & L3 D

irs

L2 & L3 D

irsL2 &

L3 Dirs

L2 & L3 D

irs

M

C

M

C

M

E

M

O

R

Y

MCM to MCM

MCM to MCM

MCM to MCM

MCM to MCM

POWER4 Multi-chip Module

Page 16: POWER5, Federation and Linux: System Architecture

16

31 © 2004 IBM Corporation

MCM 0

S

T

U

V

MCM 1

S

T

U

V

MCM2

S

T

U

V

MCM 3 T

U

V

S

POWER4 32-way Plane Topology

Uni-directional links2-beat Addr @ 850Mhz = 425M A/s 8B Data @ 850Mhz (2:1) = 7 GB/s

32 © 2004 IBM Corporation

B

A

Core Core

Store Cache Store Cache

1.9M L2 Cache

FBC

X

A

Z

Y

MC,GXNCU

Power 5 Thread 0 Thread 1

Memory

POWER5

36M L3 Cache(victim)

Page 17: POWER5, Federation and Linux: System Architecture

17

33 © 2004 IBM Corporation

GX Bus

L336 MB

MEMORY

MEMORY

L3

L3 L3

GX Bus

GX BusGX Bus

RIO-2High Performance Switch

InfiniBand

MCM to MCMBook to Book

MCM to MCMBook to Book

MCM to MCMBook to Book

MCM to MCMBook to Book

Chip-chip communication

Shared L2

MemCtrl

L3Dir

Chip-chip communication

Shared L2

MemCtrl

L3Dir

Chip-chip communication

Shared L2

MemCtrl

L3Dir

Chip-chip communication

Shared L2

MemCtrl

L3Dir

CPUCPU6, 76, 7

CPUCPU4, 54, 5

CPUCPU2, 32, 3

CPUCPU0, 10, 1

CPUCPU10,1110,11

CPUCPU8, 98, 9

CPUCPU14,1514,15

CPUCPU12,1312,13

POWER5 Multi Chip Module

34 © 2004 IBM Corporation

Uni-directional links2-beat Addr @ 1GHz= 500M A/s

8B Data @ 1 GHz (2:1)= 8 GB/s

Vertical node linksUsed to reduce latencyand increase bandwidthfor data transfers.

Used as alternate routefor address operationsduring dynamic ringre-configuration (e.g.concur repair/upgrade).

MCM 7

S

T

U

V

MCM 6T

V

MCM 5T

V

MCM 0

S

T

U

V

MCM 1 T

V

MCM 2 T

V

MCM 3 T

VMCM 4T

V

S S

S

U

S

US

U

S

U

U U

POWER5 64-way Plane Topology

Page 18: POWER5, Federation and Linux: System Architecture

18

35 © 2004 IBM Corporation

Upcoming POWER5 Systems

Early introductions are iSeries–Commercial

Next:–Small nodes–2,4,8 processors

–SF, MLNext, next–iH–H

–Up to 64 processors

36 © 2004 IBM Corporation

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

Planned POWER5 iH

2/4/6/8 – way POWER52U rack ChassisRack: 24” x 43” deep

Page 19: POWER5, Federation and Linux: System Architecture

19

37 © 2004 IBM Corporation

Auxiliary

38 © 2004 IBM Corporation

Switch Technology

Internal network–In lieu of, e.g. Gig Ethernet

Multiple links per node–Match number of links to number of processors POWER4 →

POWER5HPS (Federation)

POWER3 →POWER4

SP Switch 2 (Colony)

POWER2 →POWER3SP Switch

POWER2HPS Switch

ProcessorsGeneration

Page 20: POWER5, Federation and Linux: System Architecture

20

39 © 2004 IBM Corporation

HPS Software

MPI-LAPI (PE V4.1)–Uses LAPI as the reliable transport–Library uses threads, not signals for async activities

Existing applications binary compatibleNew performance characteristics–Improved collective communication

40 © 2004 IBM Corporation

High Performance Switch (HPS)

Also Known As “Federation”Follow on to SP Switch2–Also known as “Colony”

Specifications:–2 Gbyte/s (bidirectional)–5 microsecond latency

Configuration:–Four adaptors per node

–2 links per adaptor–16 Gbyte/s per node

Page 21: POWER5, Federation and Linux: System Architecture

21

41 © 2004 IBM Corporation

HPS Specifications

Bandwidth, multiple

[Mbyte/s]

Bandwidth, single

[Mbyte/s]

Latency[microsec.]

5

15

1700

350

200HPS(“Federation”)

550SP Switch 2(“Colony”)

42 © 2004 IBM Corporation

HPS Packaging

4U, 24-inch drawer –Fits in the bottom position of p655 or p690 frame

Switch only frame (7040-W42 frame)–Up to eight HPS switches

16 ports for server-to-switch16 ports for switch-to-switch connections Host attachment directly to server bus via 2-link or 4-link Switch Network Interface (SNI) –Up to two links per pSeries 655–Up to eight links per pSeries 690

Page 22: POWER5, Federation and Linux: System Architecture

22

43 © 2004 IBM Corporation

HPS Scalability

Supports up to 16 p690 or p655 servers and 32 links at GAIncreased support for up to 64 servers –32 can be pSeries 690 servers–128 links planned for July, 2004

Higher scalability limits available by special order

44 © 2004 IBM Corporation

HPS Performance: Single and Multiple Messages

0

500

1000

1500

2000

2500

0 250000 500000 750000 1000000

Bytes

Mby

te/s

SingleMultiple

Page 23: POWER5, Federation and Linux: System Architecture

23

45 © 2004 IBM Corporation

HPS Performance: POWER5 Shared Memory MPI

0

500

1000

1500

2000

2500

3000

3500

4000

0 200000 400000 600000 800000 1000000

Bytes

Mby

te/s

46 © 2004 IBM Corporation

Summary

Processor technology:–POWER4 → POWER5

–Multiprocessor chips–Enhanced caches–SMT

Cluster systems–P655–P690