Sgi Hpc Day Kiev 2009 10 Uv

41
Project Ultraviolet Overview

description

Project Ultraviolet Overview

Transcript of Sgi Hpc Day Kiev 2009 10 Uv

Page 1: Sgi Hpc Day Kiev 2009 10 Uv

Project Ultraviolet Overview

Page 2: Sgi Hpc Day Kiev 2009 10 Uv

2Company Confidential

Clusters vs. Shared Memory Architecture

• Each system has own memory and OS

• Batch, not interactive user interface

• Coding required for parallel code execution

• Great for capacity workflows

• SGI® Altix XE x86-64 clusters, Rackable BTO

• All nodes operate on one large shared memory space

• Cache Coherency

• Eliminates data passing between nodes

• Big data sets fit entirely in memory

• Less memory per node required

• Simpler to program

• High Performance, Low Cost, Easy to Deploy

...

Global shared memory

system system

SGI® NUMAflex™ Interconnect

system system

SGI® Altix™ 4000 Family, UV

OS

...

Commodity Interconnect

mem

system+OS

Small Node x86 Clusters

mem

system+OS

mem

system+OS

mem

system+OS

mem

system+OS

Page 3: Sgi Hpc Day Kiev 2009 10 Uv

3Company Confidential

Infiniband vs. Numalink™ Interconnect

Interconnect Type Bandwidth (each direction)

Infiniband 4xDDR 2.0 GBytes/s

Infiniband 4xQDR 4.0 GBytes/s

Numalink4 (Altix 4700/450) 3.2 GBytes/s

Numalink5 (UV) 7.5 GBytes/s

Page 4: Sgi Hpc Day Kiev 2009 10 Uv

4Company Confidential

Independent Scaling

CPU

Memory

I/O

Page 5: Sgi Hpc Day Kiev 2009 10 Uv

5Company Confidential

SGI® Modularity Evolution

ModulesOrigin 2000

BricksOrigin 3000, Altix

SGI BladesAltix 4700/450

1997 2006

Modules Bricks Blades

Page 6: Sgi Hpc Day Kiev 2009 10 Uv

6Company Confidential

NUMAlink Interconnect

Physical Memory

Interface

Chip

CACHE

CPU CPU

CACHE

Shared Memory

Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE

SGI Scalable ccNUMA Architecture

Basic Node Structure and Interconnect

Page 7: Sgi Hpc Day Kiev 2009 10 Uv

7Company Confidential

SGI Scalable ccNUMA Architecture

Scaling to Large Node Counts

(Local) Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE

(Local) Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE

(Local) Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE

(Local) Physical Memory

CACHE

CPU

Interface

Chip

CPU

CACHE

NUMAlink

and Routers

…..

Shared Memory (Within an SSI: OpenMP) Shared Memory

Shared Memory

….

Globally Addressable Memory (GAM) Within a NUMAlinked System: MPI

Page 8: Sgi Hpc Day Kiev 2009 10 Uv

8C

om

pa

ny

Co

nfid

en

tial

Ap

plica

tion

s on

Altix 3

00

0

Co

mm

un

icatio

n v

s. Co

mp

uta

tion

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

Nastran/4

Pam-Crash/32

Ls-Dyna/48p

Radioss/96

Fluent/64

StarHPC/32

Fire/32

Gamess/32

Amber/8

CASTEP/128

ADF/32

HOMME/1944

MM5/96

HIRLAM/128

CCM3/64

IFS /120

GeoDepth

Eclipse/52

VIP/32

Co

mp

utatio

nC

om

mu

nicatio

n

CS

MC

CM

CW

OR

ESC

FD

BIO

SP

I

Page 9: Sgi Hpc Day Kiev 2009 10 Uv

9Company Confidential

Why MOE (MPI Offload Engine) ?

Page 10: Sgi Hpc Day Kiev 2009 10 Uv

10Company ConfidentialSlide 10

SGI® Project Ultraviolet – OverviewExtraordinary Capability in an x86 Architecture

• Performance and Productivity for Demanding Workloads

• Highly Data-Efficient – up to Many Terabytes of Data in Memory

• Scales to 2048 Core and 16TB in Single x86 System

• Scales IO to >1TB/s

• Advanced Reliability

• Hardware-enabled Fault Detection, Prevention, Containment

• Enhanced Monitoring and Serviceability

• Low TCO

• X86-64 and Linux Economics

• Industry Leading Rack-level Energy Efficiency

• Easiest System to Administer and Productively Use

Page 11: Sgi Hpc Day Kiev 2009 10 Uv

11Company Confidential

UV Architectural Scalability

� 16,384 Nodes (scaling supported by NUMAlink5 node ID)

– 16,384 UV_HUBs

– 32,768K Sockets / 262,144 Cores (with 8-cores per socket)

– >2pflop

� Coherent shared memory

– Xeon: 16TB (44 bits socket PA)

� 8PB coherent get/put memory (53 bits PA w/GRU)

� 16 DIMMs per node (2DIMMs per Channel)

� Intel coherence scheme within node

� SGI coherence scheme between nodes

Page 12: Sgi Hpc Day Kiev 2009 10 Uv

12Company Confidential

UV Accelerated PerformanceFor Distributed or Shared Memory Programming

MPI Offload Engine (MOE) frees cpu from MPI activity- MPI Reductions 2-3X faster than competitive clusters/MPPs

- barriers up to 80X+ faster

NUMAlink Advances – industry’s most efficient

interconnect

Massively Memory-mapped I/O

- Big speedup for I/O bound apps

Hold massive datasets in memory

- to 16TB per OS system image, to petascale across systems

Page 13: Sgi Hpc Day Kiev 2009 10 Uv

13Company Confidential

UV Accelerated PerformanceFor Distributed or Shared Memory Programming

MPI Offload Engine (MOE) frees cpu from MPI activity- MPI Reductions 2-3X faster than competitive clusters/MPPs

- barriers up to 80X+ faster

NUMAlink Advances- 2-3X MPI latency improvement

Massively Memory-mapped I/O

- Big speedup for I/O bound apps

Hold massive datasets in memory

- to 16TB per OS image, to petascale across systems

- Up to 10X+ speedup for data-intensive applications

0

1

2

3

4

5

6

0 1000 2000

Destination CPU

Lo

ng

est

Path

MP

I L

ate

ncy Altix 4700

Altix ICE

UV

Page 14: Sgi Hpc Day Kiev 2009 10 Uv

14Company Confidential

UV Low TCOEconomical to own and operate

� Excellent Price/performance

– x86 economics plus UV performance advantages

– 3-5X compared to today’s Altix

– Can take the place of multiple systems

� Leading Rack-level Power Efficiency

– UV stretch goal = 80%

� Most Economical System

– to administer and use

60%

65%

70%

75%

78%

55%

60%

65%

70%

75%

80%

Origin 2000 Origin 3000 Altix 3000 Altix 4000

Carlsbad

Ultraviolet

UV

Delivered Rack-Level Power Efficiency

Page 15: Sgi Hpc Day Kiev 2009 10 Uv

15Company ConfidentialSlide 15

Project Ultraviolet Product Design

•Bladed Node Package•Memory or compute-dense blades

•Variety of IO expansion options

•Mix/match resources

•Expand or reconfigure when needed

•Industry-leading Scalability

•Run standard Linux Distros•RedHat, SLES

Page 16: Sgi Hpc Day Kiev 2009 10 Uv

16Company Confidential

IRU (Chassis) Packaging and Topology

N+1 PS

1+1 PSForBlowers

24”EIA

Compute node with IO expansion capability

(8) NUMAlink 5 Ports

per Router Cabled

to Network

Paired Nodes

(Dual NUMAlink 5 Cross-

Linked)

(8) NUMAlink 5

Fan-In Ports per

Router

24” IRU Topology

18U

16 blade IRU for 24” rack

2 blade IRU for 19” rack

3U

Page 17: Sgi Hpc Day Kiev 2009 10 Uv

17Company Confidential

Ultraviolet Rack

• Blade-based packaging• Air-Cooled electronics

• N+1 12VDC Power Supplies• N+1 Axial Fans

• (2) 60A 200VAC-240VAC 3-Φ IEC 60309 plugs provide17.3 kVA each

• Rack Nameplate 34.5 kVA max

• Optional water-cooling• Leverages SGI® Altix® ICE 8200

• (64) Intel® Xeon® Sockets• (512) Intel Xeon Cores• (512) DDR3 RDIMMs• 128GB / node (w/ 8GB DIMMs)• 4TB / rack (w/ 8GB DIMMs)

• Integrated BaseIO & Boot HDDs• Integrated or External IO Expansion

• SGI® NUMAlink™ 5 network• (1) System Management Node per up to 4-racks

• IO Expansion for higher power or larger form factor cards

Page 18: Sgi Hpc Day Kiev 2009 10 Uv

18Company Confidential

UV System Packaging Options

IO Expansion16 blade

chassis

Admin Node

42U, 24 inch rack

64 skts, 512c per rack

4TB memory (8GB DIMM)

Up to 4.65 tflop

Fat Tree, 7.5GB/s/skt bisection

NL Scalable to 16K sockets

Up to 2048core SSI supported

2 blade24/32 core chassis

Storage

Quad Router

42U, 24 inch rack, routerless

64 skts, 512c,

4TB memory (8GB DIMM)

Up to 4.65 tflop

2D Torus, 1.25GB/s/skt bisection

Can be clustered with IB, Gig-e

High Performance Price-performance Midrange Capability19” rack

20U, 19 inch rack

24 skts, 192c

3TB memory (8GB DIMM)

Up to 1.8 tflop per short rack

40U, 19 inch rack

Up to 50 skts, 400c,

3TB memory (8GB DIMM)

Up to 3.5 tflop per rack

Storage

Admin Node

Short Rack

Page 19: Sgi Hpc Day Kiev 2009 10 Uv

19Company Confidential

UV-MidrangeSystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

Scalable x86 (IBM. Bull, Unisys)System Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

8S GluelessSystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

IBM P6 570,575, HP IntegritySystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)

Capability Comparisons

UV-Midrange Offers More Headroom

SSI, SMax Memory,TB

Max IO, Slots

96

6

64+

Page 20: Sgi Hpc Day Kiev 2009 10 Uv

20Company Confidential

UV Nehalem-EX Node Board - Compute Blade

(8) DDR3 RDIMMs

& (4) Millbrook Memory

Buffers per socket

Nehalem-

EX QPI

QPIQPI

UV

HUB

(4) NUMAlink 5

Nehalem-

EX

(2) Directory

FB-DIMMs

RLDRAM

(Snoop Acceleration)

Boxboro

IOH

QPIQPI

Optional

I/O Riser

•SGI® NUMAlink™ 5 = 15.0 GB/s aggregate

•Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s)

•Directory FBD1 = 6.4GB/s Read + 3.2GB/s Write (800MHz DIMMs)

•Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / Socket

•Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket

Each Blade: 8-16 Xeon coresUp to 145gflopUp to 128GB

Single-SocketMemory/IO Expansion

Blade alsoAvailable

Page 21: Sgi Hpc Day Kiev 2009 10 Uv

21Company Confidential

UV Single Socket or Memory Expansion Blade

(16) DDR3 RDIMMs

& (4) Millbrook

Memory Buffers per

Single-Socket

QPI

QPI

UV

HUB

(4) NUMAlink 5

Boxboro

IOH

Optional

I/O Riser

Nehalem-

EX

Memory Expansion Blade

SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / SocketIntel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket

Page 22: Sgi Hpc Day Kiev 2009 10 Uv

22Company Confidential

UV Nehalem-EX Node Board

UV_HUB

Socket

Socket

Mezzanine Connector (2) Quick Path links to I/O Riser

11.2-in W x 19.5-in L (1/2 Panel Board)

RLDRAM

Page 23: Sgi Hpc Day Kiev 2009 10 Uv

23Company Confidential

(4) Integrated IO Riser Options

BaseIO

Externalized IO

(2) PCIe Gen2 x16 Cable Connections to IO Expansion Chassis

Integrated PCIe Gen2

(1) x16 low-profile(1) x8 low-profile

Node Blade or Memory Expansion Blade

(2) Hot plug 2.5” Boot HDD

Page 24: Sgi Hpc Day Kiev 2009 10 Uv

24Company Confidential

UV IO Expansion Chassis in DevelopmentFor Full-height and High-Power Card Support

One x16 PCIe G2.0 input connector

1U

Each unit supports up to 4 slots, either PCIxor PCIe

Page 25: Sgi Hpc Day Kiev 2009 10 Uv

25Company Confidential

P0- Node, in Engr Test

Page 26: Sgi Hpc Day Kiev 2009 10 Uv

26Company Confidential

P0- BaseIO

Page 27: Sgi Hpc Day Kiev 2009 10 Uv

27Company Confidential

18U 24-in EIA Individual Rack Unit (IRU)

Front (Node Blade) View Rear (Router Blade) View

Page 28: Sgi Hpc Day Kiev 2009 10 Uv

28Company Confidential

Power, Cooling and Facilities

Page 29: Sgi Hpc Day Kiev 2009 10 Uv

29Company Confidential

SGI Altix ICE 8200 Water-Cooled Coils

Target Heat Rejection 95% water / 05% air

3/4” (1.91 cm) Coupling

(4) Individual Coils

Chilled-Water Supply 45°F to 60°F (7.2°C to 15.6°C)

14.4 gpm (3.3 m3/hr) Max.

Swivel Coupling to Supply Hose

Branch Feed to Individual Coil

Condensate Drain Pan

Page 30: Sgi Hpc Day Kiev 2009 10 Uv

30Company Confidential

UV Rack w/ Top-Feed Water-Cooled Coil

Target Heat Rejection 95% water / 05% air

Chilled-Water Supply 45°F to 65°F (7.2°C to 18.3°C)

16.0 gpm (3.6 m3/hr) Max.

1” (2.54 cm) Coupling

UV Enhancements:

- Reduce water-side pressure drop

- Increase allowable water supply temp

to 65°F (18.3°C)-Enable top-feed water

Page 31: Sgi Hpc Day Kiev 2009 10 Uv

31Company Confidential

80 Plus® Organization

Ultraviolet Power Supplies Planned to be Gold Certified

� Mission

– Unique forum that is uniting electric utilities, the computer

industry and consumers in a groundbreaking effort to bring

energy efficient power supplies to desktop computers and

servers

� N+0 desktop power supply certification available today

– SGI worked with 80 Plus to draft N+1 server power supply

specification

� http://www.80plus.org/

80 Plus Bronze Silver Gold

CSCI

Year 1

July-07

Year 2

July-08

Year 3

July-09

20% PSU Load 81% 85% 88%

50% PSU Load 85% 89% 92%

100% PSU Load 81% 85% 88%

Page 32: Sgi Hpc Day Kiev 2009 10 Uv

32Company Confidential

Energy Efficiency : Rack Level

Rack

60%

65%

70%

75%

78%

55%

60%

65%

70%

75%

80%

Origin 2000 Origin 3000 Altix 3000 Altix 4000

Carlsbad

Ultraviolet

Net (all-in) Rack Energy Efficiency Roadmap(N.B. even higher efficiency if no water-coil)

stretch goal

Page 33: Sgi Hpc Day Kiev 2009 10 Uv

33Company Confidential

UV Rack Power

� 34.5kVA Rack Nameplate

– Used for facilities wire-sizing

� 33.3kW Power Model Roll-Up

– 130W TDP sockets, full memory, fans at altitude with water-

coil impedance

� 30.0kW Estimate Running Linpack

– 90% of Power Model

– “Maximum Measured”

� 22.5kW Estimate Running Applications

– ~75% of Linpack Power

– Used for energy consumption planning (kWh)

Page 34: Sgi Hpc Day Kiev 2009 10 Uv

34Company Confidential

Projected UV Performance Advances

Source: Qlogic, Inc.

MP

I and H

PP

C, B

arr

iers

S

peedups w

ith G

RU

Excelle

nt B

W/late

ncy

Pro

file

for

Larg

e J

obs

0

1

2

3

4

5

6

0 1000 2000

Destination CPU

Longest Path

MPI Late

ncy Altix 4700

Altix ICE

UV

IB

NL4

UV-NL 5

Destination CPU

ptrans

FFTE

Ramdom

Access

HPCC Benchmarks

UV with GRU

UV no GRU

0

Single element MPI_reduce

0

5

10

15

20

25

256

512

1,02

4

2,04

8

4,09

6

8,19

2

16,3

84

32,7

68

65,5

36

131,

072

262,

144

number of threads

Tim

e for

MP

I_R

educe (us)

IB

UV

3X

MPI_Reduce

Barrier Latency <1usec (4096 thread)

UV

Typical Cluster Systems

MPI

Bandwidth

vs

Message

Size

Bytes

Page 35: Sgi Hpc Day Kiev 2009 10 Uv

35Company ConfidentialSlide 35

MPI Latency

UV MPI Half Ping-Pong Latencies

Longest Path

0

200

400

600

800

1000

1200

1400

1600

1800

2000

32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K

Cores

Late

ncy;

ns

Page 36: Sgi Hpc Day Kiev 2009 10 Uv

36Company Confidential

UV_HUB / Node Controller Technologies

Processor Interface

• Snoop Acceleration

• Large Number of In-Flight References

Globally Addressable Memory

• Large Shared Address Space

• Extremely Large Coherent Get/Put Space

• AMOs in Coherent Memory

• Coherence Directory

RAS

• Redundant Real-Time Clock

• Built-In Debug and Performance Monitors

• Internal/External Datapath Protection

• Alpha-immune Flip-Flops

Active Memory Unit

• Rich set of Atomic Operations

• AMO cache at memory home

• Multicast

• Message Queues in Coherent Memory

• Page Initialization

GRU Global Reference Unit

• High-BW, Low-Latency Socket Communication

• Update Cache for many AMOs

• Scatter/Gather Operations

• BCOPY Operations

• External TLB with Large Page Support

Page 37: Sgi Hpc Day Kiev 2009 10 Uv

37Company Confidential

© 2008 Silicon GraphicsI. All rights reserved. Silicon Graphics, SGI, Altix, XFS, the SGI logo, NUMAflex and the Silicon Graphics cube are registered trademarks and NUMAlink, CXFS are trademarks of SGI in the U.S. and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds in several countries. Intel, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All other trademarks mentioned herein are the property of their respective owners.

Thank You!

Page 38: Sgi Hpc Day Kiev 2009 10 Uv

38Company Confidential

Paired UV Nehalem-EX Nodes

Boxboro

IOH

QPIQPI

Optional

I/O Riser

Boxboro

IOH

QPIQPI

Optional

I/O Riser

(2) SGI NUMAlink 5 on Backplane

Nehalem-

EX QPI

QPIQPI

UV

HUB

(2) NUMAlink 5

Nehalem-

EX

Nehalem-

EX QPI

QPIQPI

UV

HUB

(2) NUMAlink 5

Nehalem-

EX

(8) DDR3 RDIMMs &

(4) Millbrook Memory

Buffers per socket

SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / SocketIntel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket

Page 39: Sgi Hpc Day Kiev 2009 10 Uv

39Company Confidential

SGI’s Flagship Product Line has 4 Characteristics:

UV - 3 things to know:1. Xeons into the Flagship Product Line WITHOUT COMPROMISE

2. MOE (MPI Offload Engine)

3. Topology Options:

- Selectable Fat-tree sizes

- Vertices within a Torus

- Paired Node Routerless or Routed

- Constellations

1. GAM

2. SSI

3. x/core, where x={I/O, Memory}

4. SWAP (and cooling)

SGI Flagship Platform Evolution

Page 40: Sgi Hpc Day Kiev 2009 10 Uv

40Company Confidential

UV HUB/Node Controller FeaturesExtended Capability

•Enabling Enterprise-class scalability and reliability on x86-64•Cache-coherence across nodes

•Fault resiliency – mirror thru block devices in memory – survive OS crash•Extensive fault isolation, datapath protection, monitoring/debug functions

•Accelerating Large-scale workloads•Fast Message-Passing (without cpu cache-line delays)

•Extends cpu capability for load requests

•System scale to 256+ sockets, 2048+ cores on standard Linux

•Accelerating Data-intensive applications•Extended physical memory address to peta-scale (8PB)

•Extended “Super” TLB page size (1TB, map up to 4PB) •avoid TLB misses for large, random data references

•Very fast locking mechanism for highly contended data (no cache-line delay)

•Off-load add, compare, swap instructions

•HUB/Node controller directly exposed to user for easy utilization•No system calls

Page 41: Sgi Hpc Day Kiev 2009 10 Uv

41Company Confidential

System Management

� UV maintains the hierarchical system management approach. – Origin/Altix: L1/L2/L3

– ICE/UV: BMC, CMC, Leader Node/SMN

– Command line interface at L2 & CMC very similar

� Unified approach to system management wrapped into SGI Cluster Manager

� SNMP used extensively across product lines including UV– Hardware inventories & sensor values stored in MIB

format

– SNMP data coalesced at SMN, available via SGI provided RAS software or through SNMP queries by 3rd party or customer developed apps