Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same...

46
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Multiprocessors Lecture 10 6/19/2006 Dr Steve Hunter

Transcript of Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same...

Page 1: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

Architecture of Parallel ComputersCSC / ECE 506

Summer 2006

Scalable MultiprocessorsLecture 10

6/19/2006

Dr Steve Hunter

Page 2: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 2Arch of Parallel Computers

What is a Multiprocessor?

• A collection of communicating processors– Goals: balance load, reduce inherent communication and extra work

• A multi-cache, multi-memory system– Role of these components essential regardless of programming model

– Programming model and communication abstraction affect specificperformance tradeoffs

P P P...

Proc Proc Proc

...

NodeController

NodeController

NodeController

Interconnect

Cache Cache Cache

Page 3: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 3Arch of Parallel Computers

Scalable Multiprocessors

• Study of machines which scale from 100’s to 1000’s of processors.

• Scalability has implications at all levels of system design and all aspects must scale

• Areas emphasized in text:– Memory bandwidth must scale with number of processors– Communication network must provide scalable bandwidth at reasonable latency– Protocols used for transferring data and synchronization techniques must scale

• A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system.

For example:– How does the bandwidth/throughput of the system when adding processors?– How does the latency or time per operation increase?– How does the cost of the system increase?– How are the systems packaged?

Page 4: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 4Arch of Parallel Computers

Scalable Multiprocessors

• Basic metrics affecting the scalability of a computer system from an application perspective are (Hwang 93):

– Machine size: the number of processors– Clock rate: determines the basic machine cycle– Problem size: amount of computational workload or the number of data points– CPU time: the actual CPU time in seconds– I/O Demand: the input/output demand in moving the program, data, and results– Memory capacity: the amount of main memory used in a program execution– Communication overhead: the amount of time spent for interprocessor

communication, synchronization, remote access, etc.– Computer cost: the total cost of hardware and software resources required to

execute a program– Programming overhead: the development overhead associated with an

application program

• Power (watts) and cooling are also becoming inhibitors to scalability

Page 5: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 5Arch of Parallel Computers

Scalable Multiprocessors

• Some other recent trends:– Multi-core processors on a single socket– Reduced focus on increasing the processor clock rate– System-on-Chip (SoC) combining processor cores, integrated interconnect,

cache, high-performance I/O, etc.– Geographically distributed applications utilizing Grid and HPC technologies– Standardizing of high-performance interconnects (e.g., Infiniband, Ethernet) and

focus by Ethernet community to reduce latency– For example, Force 10’s recently announced 10Gb Ethernet switch

» S2410 data center switch has set industry benchmarks for 10 Gigabit price and latency

» Designed for high performance clusters, 10 Gigabit Ethernet connectivity to the server and Ethernet-based storage solutions, the S2410 supports 24 line-rate 10 Gigabit Ethernet ports with ultra low switching latency of 300 nanoseconds at an industry-leading price point.

» The S2410 eliminates the need to integrate Infiniband or proprietary technologies into the data center and opens the high performance storage market to 10 Gigabit Ethernet technology. Standardizing on 10 Gigabit Ethernet in the data core, edge and storage radically simplifies management and reduces total network cost.

Page 6: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 6Arch of Parallel Computers

Bandwidth Scalability

• What fundamentally limits bandwidth?– Number of wires, clock rate

• Must have many independent wires or high clock rate• Connectivity through bus or switches

P M M P M M P M M P M M

S S S S

Typical switches

Bus

Multiplexers

Crossbar

Page 7: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 7Arch of Parallel Computers

Some Memory Models

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$

Inter connection network

$

Pn

Mem Mem

Shared Cache

Centralized MemoryDance Hall, UMA

P1

$

Inter connection network

$

Pn

Mem Mem

Distributed Memory (NUMA)

Page 8: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 8Arch of Parallel Computers

Generic Distributed Memory Organization

° ° °

Scalable network

CA

P

$

Switch

M

Switch Switch

• Network bandwidth requirements?– independent processes?– communicating processes?

• Latency?

Page 9: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 9Arch of Parallel Computers

Some Examples

Page 10: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 10Arch of Parallel Computers

AMD Opteron Processor Technology

Page 11: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 11Arch of Parallel Computers

AMD Opteron Architecture

• AMD Opteron™ Processor Key Architectural Features– Single-Core and Dual-Core AMD Opteron processors– Direct Connect Architecture– Integrated DDR DRAM Memory Controller– HyperTransport™ Technology– Low-Power

Page 12: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 12Arch of Parallel Computers

AMD Opteron Architecture• Direct Connect Architecture

– Addresses and helps reduce the real challenges and bottlenecks of system architectures– Memory is directly connected to the CPU optimizing memory performance– I/O is directly connected to the CPU for more balanced throughput and I/O– CPUs are connected directly to CPUs allowing for more linear symmetrical

multiprocessing• Integrated DDR DRAM Memory Controller

– Changes the way the processor accesses main memory, resulting in increased bandwidth, reduced memory latencies, and increased processor performance

– Available memory bandwidth scales with the number of processors– 128-bit wide integrated DDR DRAM memory controller capable of supporting up to eight

(8) registered DDR DIMMs per processor– Available memory bandwidth up to 6.4 GB/s (with PC3200) per processor

• HyperTransport™ Technology– Provides a scalable bandwidth interconnect between processors, I/O subsystems, and

other chipsets– Support of up to three (3) coherent HyperTransport links, providing up to 24.0 GB/s peak

bandwidth per processor– Up to 8.0 GB/s bandwidth per link providing sufficient bandwidth for supporting new

interconnects including PCI-X, DDR, InfiniBand, and 10G Ethernet– Offers low power consumption (1.2 volts) to help reduce a system’s thermal budget

Page 13: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 13Arch of Parallel Computers

AMD Processor Architecture• Low-Power Processors

– The AMD Opteron processor HE offers industry-leading performance per watt making it an ideal solution for rack-dense 1U servers or blades in datacenter environments as well ascooler, quieter workstation designs.

– The AMD Opteron processor EE provides maximum I/O bandwidth currently available in a single-CPU controller making it a good fit for embedded controllers in markets such as NAS and SAN.

• Other features of the AMD Opteron processor include:– 64-bit wide key data and address paths that incorporate a 48-bit virtual address space and

a 40-bit physical address space

– ECC (Error Correcting Code) protection for L1 cache data, L2 cache data and tags, and DRAM with hardware scrubbing of all ECC protected arrays

– 90nm SOI (Silicon on Insulator) process technology for lower thermal output levels and improved frequency scaling

– Support for all instructions necessary to be fully compatible with SSE2 technology

– Two (2) additional pipeline stages (compared to AMD’s seventh generation architecture) for increased performance and frequency scalability

– Higher IPC (Instructions per Clock) achieved through additional key features, such as larger TLBs (Translation Look-aside Buffers), flush filters, and enhanced branch prediction algorithm

Page 14: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 14Arch of Parallel Computers

AMD vs Intel• Performance

– SPECint® rate2000 – the Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 28 percent

– SPECfp® rate2000 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 76 percent

– SPECjbb®2005 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz by 13 percent

• Processor Power (Watts)– Dual-Core AMD Opteron™ processors at 95 watts, consume far less than the

competition’s dual-core x86 server processors which according to their published data, have a thermal design power of 135 watts and a max power draw of 150 watts.

– Can result in 200 percent better performance-per-watt than the competition.

– Even greater performance-per-watt can be achieved with lower-power processors that are (55 watt).

Page 15: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 15Arch of Parallel Computers

IBM POWER Processor Technology

Page 16: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 16Arch of Parallel Computers

IBM POWER4+ Processor Architecture

Page 17: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 17Arch of Parallel Computers

IBM POWER4+ Processor Architecture

• Two processor cores on one chip as shown• Clock frequency of the POWER4+ is 1.5--1.9 GHz • The L2 cache modules are connected to the processors by the Core Interface

Unit (CIU) switch, a 2×3 crossbar with a bandwidth of 40 B/cycle per port. • This enables to ship 32 B to either the L1 instruction cache or the data cache of

each of the processors and to store 8 B values at the same time.• Also, for each processor there is a Non-cacheable Unit that interfaces with the

Fabric Controller and that takes care of non-cacheable operations. • The Fabric Controller is responsible for the communication with three other

chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs.

• The bandwidths at 1.7 GHz are 13.6, 9.0, and 6.8 GB/s, respectively. The chip further still contains a variety of devices: the L3 cache directory and the L3 and Memory Controller that should bring down the off-chip latency considerably

• The GX Controller is responsible for the traffic on the GX bus which transports data to/from the system and in practice is used for I/O. The maximum size of the L3 cache is 32 MB

Page 18: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 18Arch of Parallel Computers

IBM POWER5 Processor Architecture

Page 19: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 19Arch of Parallel Computers

IBM POWER5 Processor Architecture

• Like the POWER4(+), the POWER5 has two processor cores on a chip• Clock frequency of the POWER5 is 1.9 GHz. • However the higher density on the chip (the POWER5 is built in 130 nm

technology instead of 180 nm used for the POWER4+) more devices could be placed on the chip and they could also be enlarged.

• The L2 caches of two neighboring chips are connected and the L3 caches are directly connected to the L2 caches.

• Both are larger than their respective counterparts of the POWER4: 1.875 MB against 1.5 MB for the L2 cache and 36 MB against 32 MB for the L3 cache.

• In addition the speed of the L3 cache has gone up from about 120 cycles to 80 cycles. Also the associativity of the caches has improved: from 2-way to 4-way for the L1 cache, from 8-way to 10-way for the L2 cache, and from 8 to 12-way for the L3 cache.

• A big difference is also the improved bandwidth from memory to the chip: it has increased from 4 GB/s for the POWER4+ to approximately 16 GB/s for the POWER5

Page 20: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 20Arch of Parallel Computers

Intel (Future) Processor Technology

Page 21: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

DP Server ArchitectureDP Server Architecture

CONSTANTLY ANALYZING THE REQUIREMENTS,CONSTANTLY ANALYZING THE REQUIREMENTS,THE TECHNOLOGIES, AND THE TRADEOFFSTHE TECHNOLOGIES, AND THE TRADEOFFS

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Energy Energy

Perf Perf

AMB AMB AMB

AMB

AMB AMB AMB

AMB

AMB AMB AMB

AMB

AMB AMB AMB

AMB

Bensley PlatformBensley Platform

BlackfordBlackford

17 GB/s17 GB/s

64 GB64 GB

FSB ScalingFSB Scaling800MHz800MHz1067MHz1067MHz1333MHz1333MHz

Point to PointPoint to PointInterconnectInterconnect

Local and RemoteLocal and RemoteMemory LatenciesMemory Latencies

ConsistentConsistent

Central CoherencyCentral CoherencyResolutionResolution

Sustained &Sustained &BalancedBalanced

ThroughputThroughput

Easy CapacityEasy CapacityExpansionExpansion

LargeLargeShared CachesShared Caches

PlatformPlatformPerformance:Performance:ItIt’’s all abouts all about

BandwidthBandwidth &&LatencyLatency

Page 22: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

17,066 Flops/Watt17,066 Flops/Watt467 Flops/Dollar467 Flops/Dollar

Energy Efficient Performance Energy Efficient Performance –– High EndHigh End

ASC PurpleASC Purple6 MWatt6 MWatt100 TFlops goal100 TFlops goal12K+ cpus 12K+ cpus –– Power5Power5

$230M$230M

DATACENTERDATACENTER““ENERGY LABELENERGY LABEL””

Source: LLNLSource: LLNL

Source: NASASource: NASA

NASA ColumbiaNASA Columbia2 MWatt2 MWatt60 TFlops goal60 TFlops goal10,240 cpus 10,240 cpus –– Itanium IIItanium II

$50M$50M

30,720 Flops/Watt30,720 Flops/Watt1,288 Flops/Dollar1,288 Flops/Dollar

ComputationalComputationalEfficiencyEfficiency

Page 23: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CoreCore™™ Microarchitecture Advances WithMicroarchitecture Advances WithQuad CoreQuad Core

Quad CoreQuad Core

KentsfieldKentsfield

ClovertownClovertown

ServerServer

DesktopDesktop

Paxville DPPaxville DP

WoodcrestWoodcrest

IrwindaleIrwindale

DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate

at the Platform Levelat the Platform Level

Dempsey MVDempsey MV

1X1X

2X2X

3X3X

4X4X

H2 H2 ‘‘0606

H1 H1 ‘‘0606

H2 H2 ‘‘0505

H1 H1 ‘‘0505

Source: IntelSource: Intel®®

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Clovertown Clovertown H1 H1 ‘‘0707

EnergyEnergy PerformancePerformanceEfficientEfficient

Page 24: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

POWER

35%

Woodcrest for ServersWoodcrest for Servers

……relative to relative to IntelIntel®® XeonXeon®® 2.8GHz 2x2MB2.8GHz 2x2MB

Source: Intel based on estimated Source: Intel based on estimated SPECintSPECint*_rate_base2000 and thermal design power *_rate_base2000 and thermal design power

PERFORMANCE

80%

Page 25: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

OverOver--clockedclocked(+20%)(+20%)

1.00x1.00x

Relative singleRelative single--core frequency and core frequency and VccVcc

1.73x1.73x

1.13x1.13x

Max FrequencyMax Frequency

PowerPower

PerformancePerformance

MultiMulti--CoreCoreEnergyEnergy--Efficient PerformanceEfficient Performance

DualDual--corecore((--20%)20%)

1.02x1.02x

1.73x1.73xDualDual--CoreCore

Page 26: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

Intel MultiIntel Multi--Core TrajectoryCore Trajectory

DualDual--CoreCore

QuadQuad--CoreCore

20062006 20072007

Page 27: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 27Arch of Parallel Computers

Blade Architectures - General

Interconnect

BladeServer

…..BladeServer

BladeServer

• Blades interconnected by common fabrics– Infiniband, Ethernet, Fibre Channel are most common– Redundant interconnect available for failover– Links from interconnect provide external connectivity

• Each blade contains multiple processors, memory and network interfaces– Some options may exist such as for memory, network connectivity, etc.

• Power, cooling, management overhead optimized within chassis– Multiple chassis connected together for greater number of nodes

Page 28: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 28Arch of Parallel Computers

I/O Bridge• e.g., Ethernet, Fibre

Channel, Passthru

• Dual 4x (16 wire) wiring

internally to each HSSM

High-speed Switch• Ethernet or Infiniband

• 4x (16 wire) blade links

• 4x (16 wire) bridge links

• 1x (4 wire) Mgmt links

• Uplinks: Up to 12x links

for IB and at least four

10Gb links for Ethernet

IBM BladeCenter H Architecture

Switch Module 2

Blade 2

Blade 14

Blade 1

...

Mgmt Mod 2

Mgmt Mod 1

Switch Module 1

I/O Bridge

HS Switch 2

HS Switch 1

I/O Bridge

I/O Bridge 4 / SM4

HS Switch 4

HS Switch 3

I/O Bridge 3/ SM3

Page 29: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 29Arch of Parallel Computers

Blade 2

Blade 14

Blade 1

...

Blade 2

Blade 14

Blade 1

...

Blade 2

Blade 14

Blade 1

...

... Inte

rcon

nect

...

• External high performance interconnect(s) for multiple chassis

• Independent scaling of blades and I/O

• Scales for large clusters

• Architecture used for Barcelona Supercomputer Center (MareNostrum #8)

IBM BladeCenter H Architecture

Page 30: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 30Arch of Parallel Computers

Cray (Octigabay) Blade Architecture

• MPI offloaded in hardware throughput 2900 MB/s and latency 1.6us• Processor and communication interface is Hyper Transport• Dedicated link and communication chip per processor• FPGA Accelerator available for additional offload

5.4 GB/s

(DDR 333)

5.4 GB/s

(DDR 333)

6.4 GB/sec

(HT)

8 GB/sPer

Link

Memory Opteron Opteron Memory

RAP RAP AcceleratorRapid Array

Communications Processor

includes MPI hardware

offload capabilities FPGA for application offload

Page 31: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 31Arch of Parallel Computers

CtrlSys Power

Switch

Monitoring &Control ASIC

Chassis Board Options

SATA via 8111 HT/SATAbridges

2nd switch cardoptional

4 slots PCI-Expressvia 2 8131 HT/PCIbridges; each attach to one blade

Switch

PCI SATA

Cray Blade Architecture

Blade Characteristics• Two 2.2 GHz Opteron processors

– Dedicated memory per processor

• Two Rapid Array Communication Processors– One dedicated link each– One redundant link each

• Application Accelerator FPGA• Local Hard Drive

Shelf Characteristics• One or two IB 4x switches• Twelve or twenty four external links• Additional I/O:

– Three high speed I/O links– Four PCI-X bus slots– 100Mb Ethernet for management

• Active management system

Page 32: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 32Arch of Parallel Computers

Cray Blade Architecture

• Six blades per 3U shelf• Twelve 4x IB external links for primary switch• An additional twelve links are available with optional redundant switch

100 Mb Ethernet

Rapid Array Interconnect(24 x 24 IB 4x Switch)

5.4 GB/s

(DDR 333)5.4 GB/s

(DDR 333)

6.4 GB/sec

(HT)

8 GB/s

Memory Opteron Opteron Memory

Accelerator

Rapid Array Interconnect(24 x 24 IB 4x Switch)

RAP includes MPI offload capabilities

Active MgmtSystem

High-Speed I/O PCI-X

RAP RAP

Page 33: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 33Arch of Parallel Computers

Cray Blade Architecture

• With up to 24 external links per Octigabay 12K shelf, a variety of configurations can be achieved depending on the applications

• OctigaBay suggests interconnecting shelves by mesh, tori, fat trees, and fully-connected shelves for systems that fit in one rack. Fat tree configurations require extra switches, which OctigaBay terms “spine switches.”

• Mellanox Infiniband technology used for interconnect• Up to 25 shelves can be directly connected, yielding a 300 Opteron system

Interconnect

. . .

Page 34: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 34Arch of Parallel Computers

IBM BlueGene/L Architecture – Compute Card

• The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing.

• The individual speed of the processor has been traded in favor of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz.

• Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high.

• The CPUs have 32 KB of instruction cache and of data cache on board. In favorable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.

Page 35: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 35Arch of Parallel Computers

IBM BlueGene/L Architecture

Page 36: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 36Arch of Parallel Computers

IBM BlueGene/L Overview

• BlueGene/L boasts a peak speed of over 360 teraOPS, a total memory of 32 tebibytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes. Multiple communications networks enable extreme application scaling:

• Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications

• A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes

• Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds

• 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk

• The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network.

• The torus network is used for most general communication patterns. • The tree network is used for often occurring collective communication patterns

like broadcasting, reduction operations, etc.

Page 37: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 37Arch of Parallel Computers

IBM’s X3 Architecture

Page 38: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz

PCI-X 2.0266 MHz I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O BridgeScalability Controller

Memory Controller

Scalability Ports to Other Xeon MP Processors

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs

X3 Chipset - Scalable Intel MP Server . . .

Page 39: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz

PCI-X 2.0266 MHz I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O BridgeScalability Controller

Memory Controller

Scalability Ports to Other Xeon MP Processors

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs

X3 Chipset – Low latency

108 ns

Page 40: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz

PCI-X 2.0266 MHz I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O BridgeScalability Controller

Memory Controller

Scalability Ports to Other Xeon MP Processors

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs

X3 Chipset – Low latency

222 ns

Page 41: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz

PCI-X 2.0266 MHz I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O BridgeScalability Controller

Memory Controller

Scalability Ports to Other Xeon MP Processors

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs

X3 Chipset – High Bandwidth

21.3 GB/s

15 G

B/s

6.4

GB

/s10.6 GB/s

Page 42: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

42 IBM SystemsIBM Confidential

IBM System x

X3 Chipset – Snoop Filter

EM64T EM64T

I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O Bridge

MemoryInterface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface

Internal Cache Miss

Others

EM64T EM64T

I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O Bridge

Scalability Controller

Memory Controller

MemoryInterface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface

X3No traffic on FSB

! Cache from EACH processor is mirrored on hurricane

! Relieves traffic on FSB! Faster access to main memory

! Cache from EACH processor must be snooped

! Creates traffic along FSB

Node Controller

Memory Controller

IBM System x

Page 43: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

43 IBM SystemsIBM Confidential

IBM System x

X3 Chipset – Snoop Filter

EM64T EM64T

I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O Bridge

MemoryInterface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface

Others

EM64T EM64T

I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O Bridge

Scalability Controller

Memory Controller

MemoryInterface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface Memory

Interface

MemoryInterface

X3

! Cache from EACH processor must be snooped

! Creates traffic along FSB

! Cache from EACH processor is mirrored on hurricane

! Relieves traffic on FSB! Faster access to main memory

Node Controller

Memory Controller

Page 44: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

IBM System x

IBM Confidential

Multi-node Scalability – Putting it together

Hurri-cane

* Snoop filter and Remote Directory work together in multi-node configurations* Local processor cache miss is broadcast to all memory controllers* Only the node owning latest copy of data responds* Maximizes system bus bandwidth

Hurri-cane

Hurri-cane

Hurri-caneRequester

Cached address maps to Main memory on this node

This node owns the requested cache line Data

NullNull

Page 45: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

IBM System x

X3 Chipset – Scalability Ports

4 Way

4 Way

4 Way

4 WayCabledScalabilityPorts

16 Way – Single OS Image MP

X3 Scales to 32 Way Dual Core Capable – 64 Cores

Page 46: Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7

CSC / ECE 506 46Arch of Parallel Computers

The End