Sgi Hpc Day Kiev 2009 10 Uv

Project Ultraviolet Overview

2Company Confidential

Clusters vs. Shared Memory Architecture

• Each system has own memory and OS

• Batch, not interactive user interface

• Coding required for parallel code execution

• Great for capacity workflows

• SGI® Altix XE x86-64 clusters, Rackable BTO

• All nodes operate on one large shared memory space

• Cache Coherency

• Eliminates data passing between nodes

• Big data sets fit entirely in memory

• Less memory per node required

• Simpler to program

• High Performance, Low Cost, Easy to Deploy

...

Global shared memory

system system

SGI® NUMAflex™ Interconnect

system system

SGI® Altix™ 4000 Family, UV

OS

...

Commodity Interconnect

mem

system+OS

Small Node x86 Clusters

mem

system+OS

mem

system+OS

mem

system+OS

mem

system+OS


Infiniband vs. Numalink™ Interconnect

Interconnect Type Bandwidth (each direction)

Infiniband 4xDDR 2.0 GBytes/s

Infiniband 4xQDR 4.0 GBytes/s

Numalink4 (Altix 4700/450) 3.2 GBytes/s

Numalink5 (UV) 7.5 GBytes/s


Independent Scaling

CPU

Memory

I/O


SGI® Modularity Evolution

ModulesOrigin 2000

BricksOrigin 3000, Altix

SGI BladesAltix 4700/450

1997 2006

Modules Bricks Blades


NUMAlink Interconnect

Physical Memory

Interface

Chip

CACHE

CPU CPU

CACHE

Shared Memory

Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE

SGI Scalable ccNUMA Architecture

Basic Node Structure and Interconnect


SGI Scalable ccNUMA Architecture

Scaling to Large Node Counts

(Local) Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE


CACHE

CPU

Interface

Chip

CPU

CACHE


CACHE

CPU

Interface

Chip

CPU

CACHE


CACHE

CPU

Interface

Chip

CPU

CACHE

NUMAlink

and Routers

…..

Shared Memory (Within an SSI: OpenMP) Shared Memory

…

Shared Memory

….

Globally Addressable Memory (GAM) Within a NUMAlinked System: MPI

8C

om

pa

ny

Co

nfid

en

tial

Ap

plica

tion

s on

Altix 3

00

0

Co

mm

un

icatio

n v

s. Co

mp

uta

tion

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

Nastran/4

Pam-Crash/32

Ls-Dyna/48p

Radioss/96

Fluent/64

StarHPC/32

Fire/32

Gamess/32

Amber/8

CASTEP/128

ADF/32

HOMME/1944

MM5/96

HIRLAM/128

CCM3/64

IFS /120

GeoDepth

Eclipse/52

VIP/32

Co

mp

utatio

nC

om

mu

nicatio

n

CS

MC

CM

CW

OR

ESC

FD

BIO

SP

I


Why MOE (MPI Offload Engine) ?

10Company ConfidentialSlide 10

SGI® Project Ultraviolet – OverviewExtraordinary Capability in an x86 Architecture

• Performance and Productivity for Demanding Workloads

• Highly Data-Efficient – up to Many Terabytes of Data in Memory

• Scales to 2048 Core and 16TB in Single x86 System

• Scales IO to >1TB/s

• Advanced Reliability

• Hardware-enabled Fault Detection, Prevention, Containment

• Enhanced Monitoring and Serviceability

• Low TCO

• X86-64 and Linux Economics

• Industry Leading Rack-level Energy Efficiency

• Easiest System to Administer and Productively Use


UV Architectural Scalability

� 16,384 Nodes (scaling supported by NUMAlink5 node ID)

– 16,384 UV_HUBs

– 32,768K Sockets / 262,144 Cores (with 8-cores per socket)

– >2pflop

� Coherent shared memory

– Xeon: 16TB (44 bits socket PA)

� 8PB coherent get/put memory (53 bits PA w/GRU)

� 16 DIMMs per node (2DIMMs per Channel)

� Intel coherence scheme within node

� SGI coherence scheme between nodes


UV Accelerated PerformanceFor Distributed or Shared Memory Programming

MPI Offload Engine (MOE) frees cpu from MPI activity- MPI Reductions 2-3X faster than competitive clusters/MPPs

- barriers up to 80X+ faster

NUMAlink Advances – industry’s most efficient

interconnect

Massively Memory-mapped I/O

- Big speedup for I/O bound apps

Hold massive datasets in memory

- to 16TB per OS system image, to petascale across systems


UV Accelerated PerformanceFor Distributed or Shared Memory Programming

MPI Offload Engine (MOE) frees cpu from MPI activity- MPI Reductions 2-3X faster than competitive clusters/MPPs

- barriers up to 80X+ faster

NUMAlink Advances- 2-3X MPI latency improvement

Massively Memory-mapped I/O

- Big speedup for I/O bound apps

Hold massive datasets in memory

- to 16TB per OS image, to petascale across systems

- Up to 10X+ speedup for data-intensive applications

0

1

2

3

4

5

6

0 1000 2000

Destination CPU

Lo

ng

est

Path

MP

I L

ate

ncy Altix 4700

Altix ICE

UV


UV Low TCOEconomical to own and operate

� Excellent Price/performance

– x86 economics plus UV performance advantages

– 3-5X compared to today’s Altix

– Can take the place of multiple systems

� Leading Rack-level Power Efficiency

– UV stretch goal = 80%

� Most Economical System

– to administer and use

60%

65%

70%

75%

78%

55%

60%

65%

70%

75%

80%

Origin 2000 Origin 3000 Altix 3000 Altix 4000

Carlsbad

Ultraviolet

UV

Delivered Rack-Level Power Efficiency


Project Ultraviolet Product Design

•Bladed Node Package•Memory or compute-dense blades

•Variety of IO expansion options

•Mix/match resources

•Expand or reconfigure when needed

•Industry-leading Scalability

•Run standard Linux Distros•RedHat, SLES


IRU (Chassis) Packaging and Topology

N+1 PS

1+1 PSForBlowers

24”EIA

Compute node with IO expansion capability

(8) NUMAlink 5 Ports

per Router Cabled

to Network

Paired Nodes

(Dual NUMAlink 5 Cross-

Linked)

(8) NUMAlink 5

Fan-In Ports per

Router

24” IRU Topology

18U

16 blade IRU for 24” rack

2 blade IRU for 19” rack

3U


Ultraviolet Rack

• Blade-based packaging• Air-Cooled electronics

• N+1 12VDC Power Supplies• N+1 Axial Fans

• (2) 60A 200VAC-240VAC 3-Φ IEC 60309 plugs provide17.3 kVA each

• Rack Nameplate 34.5 kVA max

• Optional water-cooling• Leverages SGI® Altix® ICE 8200

• (64) Intel® Xeon® Sockets• (512) Intel Xeon Cores• (512) DDR3 RDIMMs• 128GB / node (w/ 8GB DIMMs)• 4TB / rack (w/ 8GB DIMMs)

• Integrated BaseIO & Boot HDDs• Integrated or External IO Expansion

• SGI® NUMAlink™ 5 network• (1) System Management Node per up to 4-racks

• IO Expansion for higher power or larger form factor cards


UV System Packaging Options

IO Expansion16 blade

chassis

Admin Node

42U, 24 inch rack

64 skts, 512c per rack

4TB memory (8GB DIMM)

Up to 4.65 tflop

Fat Tree, 7.5GB/s/skt bisection

NL Scalable to 16K sockets

Up to 2048core SSI supported

2 blade24/32 core chassis

Storage

Quad Router

42U, 24 inch rack, routerless

64 skts, 512c,


Up to 4.65 tflop

2D Torus, 1.25GB/s/skt bisection

Can be clustered with IB, Gig-e

High Performance Price-performance Midrange Capability19” rack

20U, 19 inch rack

24 skts, 192c


Up to 1.8 tflop per short rack

40U, 19 inch rack

Up to 50 skts, 400c,


Up to 3.5 tflop per rack

Storage

Admin Node

Short Rack


UV-MidrangeSystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

Scalable x86 (IBM. Bull, Unisys)System Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

8S GluelessSystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

IBM P6 570,575, HP IntegritySystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

Capability Comparisons

UV-Midrange Offers More Headroom

SSI, SMax Memory,TB

Max IO, Slots

96

6

64+


UV Nehalem-EX Node Board - Compute Blade

(8) DDR3 RDIMMs

& (4) Millbrook Memory

Buffers per socket

Nehalem-

EX QPI

QPIQPI

UV

HUB

(4) NUMAlink 5

Nehalem-

EX

(2) Directory

FB-DIMMs

RLDRAM

(Snoop Acceleration)

Boxboro

IOH

QPIQPI

Optional

I/O Riser

•SGI® NUMAlink™ 5 = 15.0 GB/s aggregate

•Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s)

•Directory FBD1 = 6.4GB/s Read + 3.2GB/s Write (800MHz DIMMs)

•Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / Socket

•Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket

Each Blade: 8-16 Xeon coresUp to 145gflopUp to 128GB

Single-SocketMemory/IO Expansion

Blade alsoAvailable


UV Single Socket or Memory Expansion Blade

(16) DDR3 RDIMMs

& (4) Millbrook

Memory Buffers per

Single-Socket

QPI

QPI

UV

HUB

(4) NUMAlink 5

Boxboro

IOH

Optional

I/O Riser

Nehalem-

EX

Memory Expansion Blade

SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / SocketIntel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket


UV Nehalem-EX Node Board

UV_HUB

Socket

Socket

Mezzanine Connector (2) Quick Path links to I/O Riser

11.2-in W x 19.5-in L (1/2 Panel Board)

RLDRAM


(4) Integrated IO Riser Options

BaseIO

Externalized IO

(2) PCIe Gen2 x16 Cable Connections to IO Expansion Chassis

Integrated PCIe Gen2

(1) x16 low-profile(1) x8 low-profile

Node Blade or Memory Expansion Blade

(2) Hot plug 2.5” Boot HDD


UV IO Expansion Chassis in DevelopmentFor Full-height and High-Power Card Support

One x16 PCIe G2.0 input connector

1U

Each unit supports up to 4 slots, either PCIxor PCIe


P0- Node, in Engr Test


P0- BaseIO


18U 24-in EIA Individual Rack Unit (IRU)

Front (Node Blade) View Rear (Router Blade) View


Power, Cooling and Facilities


SGI Altix ICE 8200 Water-Cooled Coils

Target Heat Rejection 95% water / 05% air

3/4” (1.91 cm) Coupling

(4) Individual Coils

Chilled-Water Supply 45°F to 60°F (7.2°C to 15.6°C)

14.4 gpm (3.3 m3/hr) Max.

Swivel Coupling to Supply Hose

Branch Feed to Individual Coil

Condensate Drain Pan


UV Rack w/ Top-Feed Water-Cooled Coil

Target Heat Rejection 95% water / 05% air

Chilled-Water Supply 45°F to 65°F (7.2°C to 18.3°C)

16.0 gpm (3.6 m3/hr) Max.

1” (2.54 cm) Coupling

UV Enhancements:

- Reduce water-side pressure drop

- Increase allowable water supply temp

to 65°F (18.3°C)-Enable top-feed water


80 Plus® Organization

Ultraviolet Power Supplies Planned to be Gold Certified

� Mission

– Unique forum that is uniting electric utilities, the computer

industry and consumers in a groundbreaking effort to bring

energy efficient power supplies to desktop computers and

servers

� N+0 desktop power supply certification available today

– SGI worked with 80 Plus to draft N+1 server power supply

specification

� http://www.80plus.org/

80 Plus Bronze Silver Gold

CSCI

Year 1

July-07

Year 2

July-08

Year 3

July-09

20% PSU Load 81% 85% 88%

50% PSU Load 85% 89% 92%

100% PSU Load 81% 85% 88%


Energy Efficiency : Rack Level

Rack

60%

65%

70%

75%

78%

55%

60%

65%

70%

75%

80%

Origin 2000 Origin 3000 Altix 3000 Altix 4000

Carlsbad

Ultraviolet

Net (all-in) Rack Energy Efficiency Roadmap(N.B. even higher efficiency if no water-coil)

stretch goal


UV Rack Power

� 34.5kVA Rack Nameplate

– Used for facilities wire-sizing

� 33.3kW Power Model Roll-Up

– 130W TDP sockets, full memory, fans at altitude with water-

coil impedance

� 30.0kW Estimate Running Linpack

– 90% of Power Model

– “Maximum Measured”

� 22.5kW Estimate Running Applications

– ~75% of Linpack Power

– Used for energy consumption planning (kWh)


Projected UV Performance Advances

Source: Qlogic, Inc.

MP

I and H

PP

C, B

arr

iers

S

peedups w

ith G

RU

Excelle

nt B

W/late

ncy

Pro

file

for

Larg

e J

obs

0

1

2

3

4

5

6

0 1000 2000

Destination CPU

Longest Path

MPI Late

ncy Altix 4700

Altix ICE

UV

IB

NL4

UV-NL 5

Destination CPU

ptrans

FFTE

Ramdom

Access

HPCC Benchmarks

UV with GRU

UV no GRU

0

Single element MPI_reduce

0

5

10

15

20

25

256

512

1,02

4

2,04

8

4,09

6

8,19

2

16,3

84

32,7

68

65,5

36

131,

072

262,

144

number of threads

Tim

e for

MP

I_R

educe (us)

IB

UV

3X

MPI_Reduce

Barrier Latency <1usec (4096 thread)

UV

Typical Cluster Systems

MPI

Bandwidth

vs

Message

Size

Bytes


MPI Latency

UV MPI Half Ping-Pong Latencies

Longest Path

0

200

400

600

800

1000

1200

1400

1600

1800

2000

32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K

Cores

Late

ncy;

ns


UV_HUB / Node Controller Technologies

Processor Interface

• Snoop Acceleration

• Large Number of In-Flight References

Globally Addressable Memory

• Large Shared Address Space

• Extremely Large Coherent Get/Put Space

• AMOs in Coherent Memory

• Coherence Directory

RAS

• Redundant Real-Time Clock

• Built-In Debug and Performance Monitors

• Internal/External Datapath Protection

• Alpha-immune Flip-Flops

Active Memory Unit

• Rich set of Atomic Operations

• AMO cache at memory home

• Multicast

• Message Queues in Coherent Memory

• Page Initialization

GRU Global Reference Unit

• High-BW, Low-Latency Socket Communication

• Update Cache for many AMOs

• Scatter/Gather Operations

• BCOPY Operations

• External TLB with Large Page Support


© 2008 Silicon GraphicsI. All rights reserved. Silicon Graphics, SGI, Altix, XFS, the SGI logo, NUMAflex and the Silicon Graphics cube are registered trademarks and NUMAlink, CXFS are trademarks of SGI in the U.S. and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds in several countries. Intel, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All other trademarks mentioned herein are the property of their respective owners.

Thank You!


Paired UV Nehalem-EX Nodes

Boxboro

IOH

QPIQPI

Optional

I/O Riser

Boxboro

IOH

QPIQPI

Optional

I/O Riser

(2) SGI NUMAlink 5 on Backplane

Nehalem-

EX QPI

QPIQPI

UV

HUB

(2) NUMAlink 5

Nehalem-

EX

Nehalem-

EX QPI

QPIQPI

UV

HUB

(2) NUMAlink 5

Nehalem-

EX

(8) DDR3 RDIMMs &

(4) Millbrook Memory

Buffers per socket

SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / SocketIntel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket


SGI’s Flagship Product Line has 4 Characteristics:

UV - 3 things to know:1. Xeons into the Flagship Product Line WITHOUT COMPROMISE

2. MOE (MPI Offload Engine)

3. Topology Options:

- Selectable Fat-tree sizes

- Vertices within a Torus

- Paired Node Routerless or Routed

- Constellations

1. GAM

2. SSI

3. x/core, where x={I/O, Memory}

4. SWAP (and cooling)

SGI Flagship Platform Evolution


UV HUB/Node Controller FeaturesExtended Capability

•Enabling Enterprise-class scalability and reliability on x86-64•Cache-coherence across nodes

•Fault resiliency – mirror thru block devices in memory – survive OS crash•Extensive fault isolation, datapath protection, monitoring/debug functions

•Accelerating Large-scale workloads•Fast Message-Passing (without cpu cache-line delays)

•Extends cpu capability for load requests

•System scale to 256+ sockets, 2048+ cores on standard Linux

•Accelerating Data-intensive applications•Extended physical memory address to peta-scale (8PB)

•Extended “Super” TLB page size (1TB, map up to 4PB) •avoid TLB misses for large, random data references

•Very fast locking mechanism for highly contended data (no cache-line delay)

•Off-load add, compare, swap instructions

•HUB/Node controller directly exposed to user for easy utilization•No system calls


System Management

� UV maintains the hierarchical system management approach. – Origin/Altix: L1/L2/L3

– ICE/UV: BMC, CMC, Leader Node/SMN

– Command line interface at L2 & CMC very similar

� Unified approach to system management wrapped into SGI Cluster Manager

� SNMP used extensively across product lines including UV– Hardware inventories & sensor values stored in MIB

format

– SNMP data coalesced at SMN, available via SGI provided RAS software or through SNMP queries by 3rd party or customer developed apps

Sgi Hpc Day Kiev 2009 10 Uv

Education

Transcript of Sgi Hpc Day Kiev 2009 10 Uv