Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same...

Architecture of Parallel ComputersCSC / ECE 506

Summer 2006

Scalable MultiprocessorsLecture 10

6/19/2006

Dr Steve Hunter

CSC / ECE 506 2Arch of Parallel Computers

What is a Multiprocessor?

• A collection of communicating processors– Goals: balance load, reduce inherent communication and extra work

• A multi-cache, multi-memory system– Role of these components essential regardless of programming model

– Programming model and communication abstraction affect specificperformance tradeoffs

P P P...

Proc Proc Proc

...

NodeController

NodeController

NodeController

Interconnect

Cache Cache Cache


Scalable Multiprocessors

• Study of machines which scale from 100’s to 1000’s of processors.

• Scalability has implications at all levels of system design and all aspects must scale

• Areas emphasized in text:– Memory bandwidth must scale with number of processors– Communication network must provide scalable bandwidth at reasonable latency– Protocols used for transferring data and synchronization techniques must scale

• A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system.

For example:– How does the bandwidth/throughput of the system when adding processors?– How does the latency or time per operation increase?– How does the cost of the system increase?– How are the systems packaged?



• Basic metrics affecting the scalability of a computer system from an application perspective are (Hwang 93):

– Machine size: the number of processors– Clock rate: determines the basic machine cycle– Problem size: amount of computational workload or the number of data points– CPU time: the actual CPU time in seconds– I/O Demand: the input/output demand in moving the program, data, and results– Memory capacity: the amount of main memory used in a program execution– Communication overhead: the amount of time spent for interprocessor

communication, synchronization, remote access, etc.– Computer cost: the total cost of hardware and software resources required to

execute a program– Programming overhead: the development overhead associated with an

application program

• Power (watts) and cooling are also becoming inhibitors to scalability



• Some other recent trends:– Multi-core processors on a single socket– Reduced focus on increasing the processor clock rate– System-on-Chip (SoC) combining processor cores, integrated interconnect,

cache, high-performance I/O, etc.– Geographically distributed applications utilizing Grid and HPC technologies– Standardizing of high-performance interconnects (e.g., Infiniband, Ethernet) and

focus by Ethernet community to reduce latency– For example, Force 10’s recently announced 10Gb Ethernet switch

» S2410 data center switch has set industry benchmarks for 10 Gigabit price and latency

» Designed for high performance clusters, 10 Gigabit Ethernet connectivity to the server and Ethernet-based storage solutions, the S2410 supports 24 line-rate 10 Gigabit Ethernet ports with ultra low switching latency of 300 nanoseconds at an industry-leading price point.

» The S2410 eliminates the need to integrate Infiniband or proprietary technologies into the data center and opens the high performance storage market to 10 Gigabit Ethernet technology. Standardizing on 10 Gigabit Ethernet in the data core, edge and storage radically simplifies management and reduces total network cost.


Bandwidth Scalability

• What fundamentally limits bandwidth?– Number of wires, clock rate

• Must have many independent wires or high clock rate• Connectivity through bus or switches

P M M P M M P M M P M M

S S S S

Typical switches

Bus

Multiplexers

Crossbar


Some Memory Models

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$

Inter connection network

$

Pn

Mem Mem

Shared Cache

Centralized MemoryDance Hall, UMA

P1

$

Inter connection network

$

Pn

Mem Mem

Distributed Memory (NUMA)


Generic Distributed Memory Organization

° ° °

Scalable network

CA

P

$

Switch

M

Switch Switch

• Network bandwidth requirements?– independent processes?– communicating processes?

• Latency?


Some Examples


AMD Opteron Processor Technology


AMD Opteron Architecture

• AMD Opteron™ Processor Key Architectural Features– Single-Core and Dual-Core AMD Opteron processors– Direct Connect Architecture– Integrated DDR DRAM Memory Controller– HyperTransport™ Technology– Low-Power


AMD Opteron Architecture• Direct Connect Architecture

– Addresses and helps reduce the real challenges and bottlenecks of system architectures– Memory is directly connected to the CPU optimizing memory performance– I/O is directly connected to the CPU for more balanced throughput and I/O– CPUs are connected directly to CPUs allowing for more linear symmetrical

multiprocessing• Integrated DDR DRAM Memory Controller

– Changes the way the processor accesses main memory, resulting in increased bandwidth, reduced memory latencies, and increased processor performance

– Available memory bandwidth scales with the number of processors– 128-bit wide integrated DDR DRAM memory controller capable of supporting up to eight

(8) registered DDR DIMMs per processor– Available memory bandwidth up to 6.4 GB/s (with PC3200) per processor

• HyperTransport™ Technology– Provides a scalable bandwidth interconnect between processors, I/O subsystems, and

other chipsets– Support of up to three (3) coherent HyperTransport links, providing up to 24.0 GB/s peak

bandwidth per processor– Up to 8.0 GB/s bandwidth per link providing sufficient bandwidth for supporting new

interconnects including PCI-X, DDR, InfiniBand, and 10G Ethernet– Offers low power consumption (1.2 volts) to help reduce a system’s thermal budget


AMD Processor Architecture• Low-Power Processors

– The AMD Opteron processor HE offers industry-leading performance per watt making it an ideal solution for rack-dense 1U servers or blades in datacenter environments as well ascooler, quieter workstation designs.

– The AMD Opteron processor EE provides maximum I/O bandwidth currently available in a single-CPU controller making it a good fit for embedded controllers in markets such as NAS and SAN.

• Other features of the AMD Opteron processor include:– 64-bit wide key data and address paths that incorporate a 48-bit virtual address space and

a 40-bit physical address space

– ECC (Error Correcting Code) protection for L1 cache data, L2 cache data and tags, and DRAM with hardware scrubbing of all ECC protected arrays

– 90nm SOI (Silicon on Insulator) process technology for lower thermal output levels and improved frequency scaling

– Support for all instructions necessary to be fully compatible with SSE2 technology

– Two (2) additional pipeline stages (compared to AMD’s seventh generation architecture) for increased performance and frequency scalability

– Higher IPC (Instructions per Clock) achieved through additional key features, such as larger TLBs (Translation Look-aside Buffers), flush filters, and enhanced branch prediction algorithm


AMD vs Intel• Performance

– SPECint® rate2000 – the Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 28 percent

– SPECfp® rate2000 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 76 percent

– SPECjbb®2005 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz by 13 percent

• Processor Power (Watts)– Dual-Core AMD Opteron™ processors at 95 watts, consume far less than the

competition’s dual-core x86 server processors which according to their published data, have a thermal design power of 135 watts and a max power draw of 150 watts.

– Can result in 200 percent better performance-per-watt than the competition.

– Even greater performance-per-watt can be achieved with lower-power processors that are (55 watt).


IBM POWER Processor Technology


IBM POWER4+ Processor Architecture


IBM POWER4+ Processor Architecture

• Two processor cores on one chip as shown• Clock frequency of the POWER4+ is 1.5--1.9 GHz • The L2 cache modules are connected to the processors by the Core Interface

Unit (CIU) switch, a 2×3 crossbar with a bandwidth of 40 B/cycle per port. • This enables to ship 32 B to either the L1 instruction cache or the data cache of

each of the processors and to store 8 B values at the same time.• Also, for each processor there is a Non-cacheable Unit that interfaces with the

Fabric Controller and that takes care of non-cacheable operations. • The Fabric Controller is responsible for the communication with three other

chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs.

• The bandwidths at 1.7 GHz are 13.6, 9.0, and 6.8 GB/s, respectively. The chip further still contains a variety of devices: the L3 cache directory and the L3 and Memory Controller that should bring down the off-chip latency considerably

• The GX Controller is responsible for the traffic on the GX bus which transports data to/from the system and in practice is used for I/O. The maximum size of the L3 cache is 32 MB


IBM POWER5 Processor Architecture


IBM POWER5 Processor Architecture

• Like the POWER4(+), the POWER5 has two processor cores on a chip• Clock frequency of the POWER5 is 1.9 GHz. • However the higher density on the chip (the POWER5 is built in 130 nm

technology instead of 180 nm used for the POWER4+) more devices could be placed on the chip and they could also be enlarged.

• The L2 caches of two neighboring chips are connected and the L3 caches are directly connected to the L2 caches.

• Both are larger than their respective counterparts of the POWER4: 1.875 MB against 1.5 MB for the L2 cache and 36 MB against 32 MB for the L3 cache.

• In addition the speed of the L3 cache has gone up from about 120 cycles to 80 cycles. Also the associativity of the caches has improved: from 2-way to 4-way for the L1 cache, from 8-way to 10-way for the L2 cache, and from 8 to 12-way for the L3 cache.

• A big difference is also the improved bandwidth from memory to the chip: it has increased from 4 GB/s for the POWER4+ to approximately 16 GB/s for the POWER5


Intel (Future) Processor Technology

DP Server ArchitectureDP Server Architecture

CONSTANTLY ANALYZING THE REQUIREMENTS,CONSTANTLY ANALYZING THE REQUIREMENTS,THE TECHNOLOGIES, AND THE TRADEOFFSTHE TECHNOLOGIES, AND THE TRADEOFFS

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Energy Energy

Perf Perf

AMB AMB AMB

AMB

AMB AMB AMB

AMB

AMB AMB AMB

AMB

AMB AMB AMB

AMB

Bensley PlatformBensley Platform

BlackfordBlackford

17 GB/s17 GB/s

64 GB64 GB

FSB ScalingFSB Scaling800MHz800MHz1067MHz1067MHz1333MHz1333MHz

Point to PointPoint to PointInterconnectInterconnect

Local and RemoteLocal and RemoteMemory LatenciesMemory Latencies

ConsistentConsistent

Central CoherencyCentral CoherencyResolutionResolution

Sustained &Sustained &BalancedBalanced

ThroughputThroughput

Easy CapacityEasy CapacityExpansionExpansion

LargeLargeShared CachesShared Caches

PlatformPlatformPerformance:Performance:ItIt’’s all abouts all about

BandwidthBandwidth &&LatencyLatency

17,066 Flops/Watt17,066 Flops/Watt467 Flops/Dollar467 Flops/Dollar

Energy Efficient Performance Energy Efficient Performance –– High EndHigh End

ASC PurpleASC Purple6 MWatt6 MWatt100 TFlops goal100 TFlops goal12K+ cpus 12K+ cpus –– Power5Power5

$230M$230M

DATACENTERDATACENTER““ENERGY LABELENERGY LABEL””

Source: LLNLSource: LLNL

Source: NASASource: NASA

NASA ColumbiaNASA Columbia2 MWatt2 MWatt60 TFlops goal60 TFlops goal10,240 cpus 10,240 cpus –– Itanium IIItanium II

$50M$50M

30,720 Flops/Watt30,720 Flops/Watt1,288 Flops/Dollar1,288 Flops/Dollar

ComputationalComputationalEfficiencyEfficiency

CoreCore™™ Microarchitecture Advances WithMicroarchitecture Advances WithQuad CoreQuad Core

Quad CoreQuad Core

KentsfieldKentsfield

ClovertownClovertown

ServerServer

DesktopDesktop

Paxville DPPaxville DP

WoodcrestWoodcrest

IrwindaleIrwindale

DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate

at the Platform Levelat the Platform Level

Dempsey MVDempsey MV

1X1X

2X2X

3X3X

4X4X

H2 H2 ‘‘0606

H1 H1 ‘‘0606

H2 H2 ‘‘0505

H1 H1 ‘‘0505

Source: IntelSource: Intel®®

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Clovertown Clovertown H1 H1 ‘‘0707

EnergyEnergy PerformancePerformanceEfficientEfficient

POWER

35%

Woodcrest for ServersWoodcrest for Servers

……relative to relative to IntelIntel®® XeonXeon®® 2.8GHz 2x2MB2.8GHz 2x2MB

Source: Intel based on estimated Source: Intel based on estimated SPECintSPECint*_rate_base2000 and thermal design power *_rate_base2000 and thermal design power

PERFORMANCE

80%

OverOver--clockedclocked(+20%)(+20%)

1.00x1.00x

Relative singleRelative single--core frequency and core frequency and VccVcc

1.73x1.73x

1.13x1.13x

Max FrequencyMax Frequency

PowerPower

PerformancePerformance

MultiMulti--CoreCoreEnergyEnergy--Efficient PerformanceEfficient Performance

DualDual--corecore((--20%)20%)

1.02x1.02x

1.73x1.73xDualDual--CoreCore

Intel MultiIntel Multi--Core TrajectoryCore Trajectory

DualDual--CoreCore

QuadQuad--CoreCore

20062006 20072007


Blade Architectures - General

Interconnect

BladeServer

…..BladeServer

BladeServer

• Blades interconnected by common fabrics– Infiniband, Ethernet, Fibre Channel are most common– Redundant interconnect available for failover– Links from interconnect provide external connectivity

• Each blade contains multiple processors, memory and network interfaces– Some options may exist such as for memory, network connectivity, etc.

• Power, cooling, management overhead optimized within chassis– Multiple chassis connected together for greater number of nodes


I/O Bridge• e.g., Ethernet, Fibre

Channel, Passthru

• Dual 4x (16 wire) wiring

internally to each HSSM

High-speed Switch• Ethernet or Infiniband

• 4x (16 wire) blade links

• 4x (16 wire) bridge links

• 1x (4 wire) Mgmt links

• Uplinks: Up to 12x links

for IB and at least four

10Gb links for Ethernet

IBM BladeCenter H Architecture

Switch Module 2

Blade 2

Blade 14

Blade 1

...

Mgmt Mod 2

Mgmt Mod 1

Switch Module 1

I/O Bridge

HS Switch 2

HS Switch 1

I/O Bridge

I/O Bridge 4 / SM4

HS Switch 4

HS Switch 3

I/O Bridge 3/ SM3


Blade 2

Blade 14

Blade 1

...

Blade 2

Blade 14

Blade 1

...

Blade 2

Blade 14

Blade 1

...

... Inte

rcon

nect

...

• External high performance interconnect(s) for multiple chassis

• Independent scaling of blades and I/O

• Scales for large clusters

• Architecture used for Barcelona Supercomputer Center (MareNostrum #8)

IBM BladeCenter H Architecture


Cray (Octigabay) Blade Architecture

• MPI offloaded in hardware throughput 2900 MB/s and latency 1.6us• Processor and communication interface is Hyper Transport• Dedicated link and communication chip per processor• FPGA Accelerator available for additional offload

5.4 GB/s

(DDR 333)

5.4 GB/s

(DDR 333)

6.4 GB/sec

(HT)

8 GB/sPer

Link

Memory Opteron Opteron Memory

RAP RAP AcceleratorRapid Array

Communications Processor

includes MPI hardware

offload capabilities FPGA for application offload


CtrlSys Power

Switch

Monitoring &Control ASIC

Chassis Board Options

SATA via 8111 HT/SATAbridges

2nd switch cardoptional

4 slots PCI-Expressvia 2 8131 HT/PCIbridges; each attach to one blade

Switch

PCI SATA

Cray Blade Architecture

Blade Characteristics• Two 2.2 GHz Opteron processors

– Dedicated memory per processor

• Two Rapid Array Communication Processors– One dedicated link each– One redundant link each

• Application Accelerator FPGA• Local Hard Drive

Shelf Characteristics• One or two IB 4x switches• Twelve or twenty four external links• Additional I/O:

– Three high speed I/O links– Four PCI-X bus slots– 100Mb Ethernet for management

• Active management system



• Six blades per 3U shelf• Twelve 4x IB external links for primary switch• An additional twelve links are available with optional redundant switch

100 Mb Ethernet

Rapid Array Interconnect(24 x 24 IB 4x Switch)

5.4 GB/s

(DDR 333)5.4 GB/s

(DDR 333)

6.4 GB/sec

(HT)

8 GB/s

Memory Opteron Opteron Memory

Accelerator

Rapid Array Interconnect(24 x 24 IB 4x Switch)

RAP includes MPI offload capabilities

Active MgmtSystem

High-Speed I/O PCI-X

RAP RAP



• With up to 24 external links per Octigabay 12K shelf, a variety of configurations can be achieved depending on the applications

• OctigaBay suggests interconnecting shelves by mesh, tori, fat trees, and fully-connected shelves for systems that fit in one rack. Fat tree configurations require extra switches, which OctigaBay terms “spine switches.”

• Mellanox Infiniband technology used for interconnect• Up to 25 shelves can be directly connected, yielding a 300 Opteron system

Interconnect

. . .


IBM BlueGene/L Architecture – Compute Card

• The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing.

• The individual speed of the processor has been traded in favor of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz.

• Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high.

• The CPUs have 32 KB of instruction cache and of data cache on board. In favorable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.


IBM BlueGene/L Architecture


IBM BlueGene/L Overview

• BlueGene/L boasts a peak speed of over 360 teraOPS, a total memory of 32 tebibytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes. Multiple communications networks enable extreme application scaling:

• Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications

• A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes

• Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds

• 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk

• The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network.

• The torus network is used for most general communication patterns. • The tree network is used for often occurring collective communication patterns

like broadcasting, reduction operations, etc.


IBM’s X3 Architecture

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz

PCI-X 2.0266 MHz I/O BridgeI/O Bridge

EM64T EM64T

I/O BridgeI/O BridgeScalability Controller

Memory Controller

Scalability Ports to Other Xeon MP Processors

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs

X3 Chipset - Scalable Intel MP Server . . .

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz


EM64T EM64T


Memory Controller


MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface


X3 Chipset – Low latency

108 ns

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz


EM64T EM64T


Memory Controller


MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface


X3 Chipset – Low latency

222 ns

IBM System x

EM64T EM64T

PCI-X 2.0266 MHz


EM64T EM64T


Memory Controller


MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface

MemoryInterface


X3 Chipset – High Bandwidth

21.3 GB/s

15 G

B/s

6.4

GB

/s10.6 GB/s

42 IBM SystemsIBM Confidential

IBM System x

X3 Chipset – Snoop Filter

EM64T EM64T

I/O BridgeI/O Bridge

EM64T EM64T


MemoryInterface

MemoryInterface Memory

Interface


Interface


Interface

MemoryInterface

Internal Cache Miss

Others

EM64T EM64T


EM64T EM64T


Scalability Controller

Memory Controller

MemoryInterface


Interface


Interface


Interface

MemoryInterface

X3No traffic on FSB

! Cache from EACH processor is mirrored on hurricane

! Relieves traffic on FSB! Faster access to main memory

! Cache from EACH processor must be snooped

! Creates traffic along FSB

Node Controller

Memory Controller

IBM System x

43 IBM SystemsIBM Confidential

IBM System x

X3 Chipset – Snoop Filter

EM64T EM64T


EM64T EM64T


MemoryInterface


Interface


Interface


Interface

MemoryInterface

Others

EM64T EM64T


EM64T EM64T


Scalability Controller

Memory Controller

MemoryInterface


Interface


Interface


Interface

MemoryInterface

X3

! Cache from EACH processor must be snooped

! Creates traffic along FSB

! Cache from EACH processor is mirrored on hurricane

! Relieves traffic on FSB! Faster access to main memory

Node Controller

Memory Controller

IBM System x

IBM Confidential

Multi-node Scalability – Putting it together

Hurri-cane

* Snoop filter and Remote Directory work together in multi-node configurations* Local processor cache miss is broadcast to all memory controllers* Only the node owning latest copy of data responds* Maximizes system bus bandwidth

Hurri-cane

Hurri-cane

Hurri-caneRequester

Cached address maps to Main memory on this node

This node owns the requested cache line Data

NullNull

IBM System x

X3 Chipset – Scalability Ports

4 Way

4 Way

4 Way

4 WayCabledScalabilityPorts

16 Way – Single OS Image MP

X3 Scales to 32 Way Dual Core Capable – 64 Cores


The End

Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same...

Documents

Transcript of Architecture of Parallel Computers CSC / ECE 506 Summer ... · chips that are embedded in the same...