Sgi Hpc Day Kiev 2009 10 Uv
-
Upload
oleg-nazarevych-taltekinfo -
Category
Education
-
view
1.152 -
download
4
description
Transcript of Sgi Hpc Day Kiev 2009 10 Uv
Project Ultraviolet Overview
2Company Confidential
Clusters vs. Shared Memory Architecture
• Each system has own memory and OS
• Batch, not interactive user interface
• Coding required for parallel code execution
• Great for capacity workflows
• SGI® Altix XE x86-64 clusters, Rackable BTO
• All nodes operate on one large shared memory space
• Cache Coherency
• Eliminates data passing between nodes
• Big data sets fit entirely in memory
• Less memory per node required
• Simpler to program
• High Performance, Low Cost, Easy to Deploy
...
Global shared memory
system system
SGI® NUMAflex™ Interconnect
system system
SGI® Altix™ 4000 Family, UV
OS
...
Commodity Interconnect
mem
system+OS
Small Node x86 Clusters
mem
system+OS
mem
system+OS
mem
system+OS
mem
system+OS
3Company Confidential
Infiniband vs. Numalink™ Interconnect
Interconnect Type Bandwidth (each direction)
Infiniband 4xDDR 2.0 GBytes/s
Infiniband 4xQDR 4.0 GBytes/s
Numalink4 (Altix 4700/450) 3.2 GBytes/s
Numalink5 (UV) 7.5 GBytes/s
4Company Confidential
Independent Scaling
CPU
Memory
I/O
5Company Confidential
SGI® Modularity Evolution
ModulesOrigin 2000
BricksOrigin 3000, Altix
SGI BladesAltix 4700/450
1997 2006
Modules Bricks Blades
6Company Confidential
NUMAlink Interconnect
Physical Memory
Interface
Chip
CACHE
CPU CPU
CACHE
Shared Memory
Physical Memory
CACHE
CPU
Interface
Chip
CPU
CACHE
SGI Scalable ccNUMA Architecture
Basic Node Structure and Interconnect
7Company Confidential
SGI Scalable ccNUMA Architecture
Scaling to Large Node Counts
(Local) Physical Memory
CACHE
CPU
Interface
Chip
CPU
CACHE
(Local) Physical Memory
CACHE
CPU
Interface
Chip
CPU
CACHE
(Local) Physical Memory
CACHE
CPU
Interface
Chip
CPU
CACHE
(Local) Physical Memory
CACHE
CPU
Interface
Chip
CPU
CACHE
NUMAlink
and Routers
…..
Shared Memory (Within an SSI: OpenMP) Shared Memory
…
Shared Memory
….
Globally Addressable Memory (GAM) Within a NUMAlinked System: MPI
8C
om
pa
ny
Co
nfid
en
tial
Ap
plica
tion
s on
Altix 3
00
0
Co
mm
un
icatio
n v
s. Co
mp
uta
tion
0%
10
%
20
%
30
%
40
%
50
%
60
%
70
%
80
%
90
%
10
0%
Nastran/4
Pam-Crash/32
Ls-Dyna/48p
Radioss/96
Fluent/64
StarHPC/32
Fire/32
Gamess/32
Amber/8
CASTEP/128
ADF/32
HOMME/1944
MM5/96
HIRLAM/128
CCM3/64
IFS /120
GeoDepth
Eclipse/52
VIP/32
Co
mp
utatio
nC
om
mu
nicatio
n
CS
MC
CM
CW
OR
ESC
FD
BIO
SP
I
9Company Confidential
Why MOE (MPI Offload Engine) ?
10Company ConfidentialSlide 10
SGI® Project Ultraviolet – OverviewExtraordinary Capability in an x86 Architecture
• Performance and Productivity for Demanding Workloads
• Highly Data-Efficient – up to Many Terabytes of Data in Memory
• Scales to 2048 Core and 16TB in Single x86 System
• Scales IO to >1TB/s
• Advanced Reliability
• Hardware-enabled Fault Detection, Prevention, Containment
• Enhanced Monitoring and Serviceability
• Low TCO
• X86-64 and Linux Economics
• Industry Leading Rack-level Energy Efficiency
• Easiest System to Administer and Productively Use
11Company Confidential
UV Architectural Scalability
� 16,384 Nodes (scaling supported by NUMAlink5 node ID)
– 16,384 UV_HUBs
– 32,768K Sockets / 262,144 Cores (with 8-cores per socket)
– >2pflop
� Coherent shared memory
– Xeon: 16TB (44 bits socket PA)
� 8PB coherent get/put memory (53 bits PA w/GRU)
� 16 DIMMs per node (2DIMMs per Channel)
� Intel coherence scheme within node
� SGI coherence scheme between nodes
12Company Confidential
UV Accelerated PerformanceFor Distributed or Shared Memory Programming
MPI Offload Engine (MOE) frees cpu from MPI activity- MPI Reductions 2-3X faster than competitive clusters/MPPs
- barriers up to 80X+ faster
NUMAlink Advances – industry’s most efficient
interconnect
Massively Memory-mapped I/O
- Big speedup for I/O bound apps
Hold massive datasets in memory
- to 16TB per OS system image, to petascale across systems
13Company Confidential
UV Accelerated PerformanceFor Distributed or Shared Memory Programming
MPI Offload Engine (MOE) frees cpu from MPI activity- MPI Reductions 2-3X faster than competitive clusters/MPPs
- barriers up to 80X+ faster
NUMAlink Advances- 2-3X MPI latency improvement
Massively Memory-mapped I/O
- Big speedup for I/O bound apps
Hold massive datasets in memory
- to 16TB per OS image, to petascale across systems
- Up to 10X+ speedup for data-intensive applications
0
1
2
3
4
5
6
0 1000 2000
Destination CPU
Lo
ng
est
Path
MP
I L
ate
ncy Altix 4700
Altix ICE
UV
14Company Confidential
UV Low TCOEconomical to own and operate
� Excellent Price/performance
– x86 economics plus UV performance advantages
– 3-5X compared to today’s Altix
– Can take the place of multiple systems
� Leading Rack-level Power Efficiency
– UV stretch goal = 80%
� Most Economical System
– to administer and use
60%
65%
70%
75%
78%
55%
60%
65%
70%
75%
80%
Origin 2000 Origin 3000 Altix 3000 Altix 4000
Carlsbad
Ultraviolet
UV
Delivered Rack-Level Power Efficiency
15Company ConfidentialSlide 15
Project Ultraviolet Product Design
•Bladed Node Package•Memory or compute-dense blades
•Variety of IO expansion options
•Mix/match resources
•Expand or reconfigure when needed
•Industry-leading Scalability
•Run standard Linux Distros•RedHat, SLES
16Company Confidential
IRU (Chassis) Packaging and Topology
N+1 PS
1+1 PSForBlowers
24”EIA
Compute node with IO expansion capability
(8) NUMAlink 5 Ports
per Router Cabled
to Network
Paired Nodes
(Dual NUMAlink 5 Cross-
Linked)
(8) NUMAlink 5
Fan-In Ports per
Router
24” IRU Topology
18U
16 blade IRU for 24” rack
2 blade IRU for 19” rack
3U
17Company Confidential
Ultraviolet Rack
• Blade-based packaging• Air-Cooled electronics
• N+1 12VDC Power Supplies• N+1 Axial Fans
• (2) 60A 200VAC-240VAC 3-Φ IEC 60309 plugs provide17.3 kVA each
• Rack Nameplate 34.5 kVA max
• Optional water-cooling• Leverages SGI® Altix® ICE 8200
• (64) Intel® Xeon® Sockets• (512) Intel Xeon Cores• (512) DDR3 RDIMMs• 128GB / node (w/ 8GB DIMMs)• 4TB / rack (w/ 8GB DIMMs)
• Integrated BaseIO & Boot HDDs• Integrated or External IO Expansion
• SGI® NUMAlink™ 5 network• (1) System Management Node per up to 4-racks
• IO Expansion for higher power or larger form factor cards
18Company Confidential
UV System Packaging Options
IO Expansion16 blade
chassis
Admin Node
42U, 24 inch rack
64 skts, 512c per rack
4TB memory (8GB DIMM)
Up to 4.65 tflop
Fat Tree, 7.5GB/s/skt bisection
NL Scalable to 16K sockets
Up to 2048core SSI supported
2 blade24/32 core chassis
Storage
Quad Router
42U, 24 inch rack, routerless
64 skts, 512c,
4TB memory (8GB DIMM)
Up to 4.65 tflop
2D Torus, 1.25GB/s/skt bisection
Can be clustered with IB, Gig-e
High Performance Price-performance Midrange Capability19” rack
20U, 19 inch rack
24 skts, 192c
3TB memory (8GB DIMM)
Up to 1.8 tflop per short rack
40U, 19 inch rack
Up to 50 skts, 400c,
3TB memory (8GB DIMM)
Up to 3.5 tflop per rack
Storage
Admin Node
Short Rack
19Company Confidential
UV-MidrangeSystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)
Scalable x86 (IBM. Bull, Unisys)System Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)
8S GluelessSystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)
IBM P6 570,575, HP IntegritySystem Scale, SocketsMax Memory, TBMax IO (PCIe slots/system)
Capability Comparisons
UV-Midrange Offers More Headroom
SSI, SMax Memory,TB
Max IO, Slots
96
6
64+
20Company Confidential
UV Nehalem-EX Node Board - Compute Blade
(8) DDR3 RDIMMs
& (4) Millbrook Memory
Buffers per socket
Nehalem-
EX QPI
QPIQPI
UV
HUB
(4) NUMAlink 5
Nehalem-
EX
(2) Directory
FB-DIMMs
RLDRAM
(Snoop Acceleration)
Boxboro
IOH
QPIQPI
Optional
I/O Riser
•SGI® NUMAlink™ 5 = 15.0 GB/s aggregate
•Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s)
•Directory FBD1 = 6.4GB/s Read + 3.2GB/s Write (800MHz DIMMs)
•Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / Socket
•Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket
Each Blade: 8-16 Xeon coresUp to 145gflopUp to 128GB
Single-SocketMemory/IO Expansion
Blade alsoAvailable
21Company Confidential
UV Single Socket or Memory Expansion Blade
(16) DDR3 RDIMMs
& (4) Millbrook
Memory Buffers per
Single-Socket
QPI
QPI
UV
HUB
(4) NUMAlink 5
Boxboro
IOH
Optional
I/O Riser
Nehalem-
EX
Memory Expansion Blade
SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / SocketIntel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket
22Company Confidential
UV Nehalem-EX Node Board
UV_HUB
Socket
Socket
Mezzanine Connector (2) Quick Path links to I/O Riser
11.2-in W x 19.5-in L (1/2 Panel Board)
RLDRAM
23Company Confidential
(4) Integrated IO Riser Options
BaseIO
Externalized IO
(2) PCIe Gen2 x16 Cable Connections to IO Expansion Chassis
Integrated PCIe Gen2
(1) x16 low-profile(1) x8 low-profile
Node Blade or Memory Expansion Blade
(2) Hot plug 2.5” Boot HDD
24Company Confidential
UV IO Expansion Chassis in DevelopmentFor Full-height and High-Power Card Support
One x16 PCIe G2.0 input connector
1U
Each unit supports up to 4 slots, either PCIxor PCIe
25Company Confidential
P0- Node, in Engr Test
26Company Confidential
P0- BaseIO
27Company Confidential
18U 24-in EIA Individual Rack Unit (IRU)
Front (Node Blade) View Rear (Router Blade) View
28Company Confidential
Power, Cooling and Facilities
29Company Confidential
SGI Altix ICE 8200 Water-Cooled Coils
Target Heat Rejection 95% water / 05% air
3/4” (1.91 cm) Coupling
(4) Individual Coils
Chilled-Water Supply 45°F to 60°F (7.2°C to 15.6°C)
14.4 gpm (3.3 m3/hr) Max.
Swivel Coupling to Supply Hose
Branch Feed to Individual Coil
Condensate Drain Pan
30Company Confidential
UV Rack w/ Top-Feed Water-Cooled Coil
Target Heat Rejection 95% water / 05% air
Chilled-Water Supply 45°F to 65°F (7.2°C to 18.3°C)
16.0 gpm (3.6 m3/hr) Max.
1” (2.54 cm) Coupling
UV Enhancements:
- Reduce water-side pressure drop
- Increase allowable water supply temp
to 65°F (18.3°C)-Enable top-feed water
31Company Confidential
80 Plus® Organization
Ultraviolet Power Supplies Planned to be Gold Certified
� Mission
– Unique forum that is uniting electric utilities, the computer
industry and consumers in a groundbreaking effort to bring
energy efficient power supplies to desktop computers and
servers
� N+0 desktop power supply certification available today
– SGI worked with 80 Plus to draft N+1 server power supply
specification
� http://www.80plus.org/
80 Plus Bronze Silver Gold
CSCI
Year 1
July-07
Year 2
July-08
Year 3
July-09
20% PSU Load 81% 85% 88%
50% PSU Load 85% 89% 92%
100% PSU Load 81% 85% 88%
32Company Confidential
Energy Efficiency : Rack Level
Rack
60%
65%
70%
75%
78%
55%
60%
65%
70%
75%
80%
Origin 2000 Origin 3000 Altix 3000 Altix 4000
Carlsbad
Ultraviolet
Net (all-in) Rack Energy Efficiency Roadmap(N.B. even higher efficiency if no water-coil)
stretch goal
33Company Confidential
UV Rack Power
� 34.5kVA Rack Nameplate
– Used for facilities wire-sizing
� 33.3kW Power Model Roll-Up
– 130W TDP sockets, full memory, fans at altitude with water-
coil impedance
� 30.0kW Estimate Running Linpack
– 90% of Power Model
– “Maximum Measured”
� 22.5kW Estimate Running Applications
– ~75% of Linpack Power
– Used for energy consumption planning (kWh)
34Company Confidential
Projected UV Performance Advances
Source: Qlogic, Inc.
MP
I and H
PP
C, B
arr
iers
S
peedups w
ith G
RU
Excelle
nt B
W/late
ncy
Pro
file
for
Larg
e J
obs
0
1
2
3
4
5
6
0 1000 2000
Destination CPU
Longest Path
MPI Late
ncy Altix 4700
Altix ICE
UV
IB
NL4
UV-NL 5
Destination CPU
ptrans
FFTE
Ramdom
Access
HPCC Benchmarks
UV with GRU
UV no GRU
0
Single element MPI_reduce
0
5
10
15
20
25
256
512
1,02
4
2,04
8
4,09
6
8,19
2
16,3
84
32,7
68
65,5
36
131,
072
262,
144
number of threads
Tim
e for
MP
I_R
educe (us)
IB
UV
3X
MPI_Reduce
Barrier Latency <1usec (4096 thread)
UV
Typical Cluster Systems
MPI
Bandwidth
vs
Message
Size
Bytes
35Company ConfidentialSlide 35
MPI Latency
UV MPI Half Ping-Pong Latencies
Longest Path
0
200
400
600
800
1000
1200
1400
1600
1800
2000
32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K
Cores
Late
ncy;
ns
36Company Confidential
UV_HUB / Node Controller Technologies
Processor Interface
• Snoop Acceleration
• Large Number of In-Flight References
Globally Addressable Memory
• Large Shared Address Space
• Extremely Large Coherent Get/Put Space
• AMOs in Coherent Memory
• Coherence Directory
RAS
• Redundant Real-Time Clock
• Built-In Debug and Performance Monitors
• Internal/External Datapath Protection
• Alpha-immune Flip-Flops
Active Memory Unit
• Rich set of Atomic Operations
• AMO cache at memory home
• Multicast
• Message Queues in Coherent Memory
• Page Initialization
GRU Global Reference Unit
• High-BW, Low-Latency Socket Communication
• Update Cache for many AMOs
• Scatter/Gather Operations
• BCOPY Operations
• External TLB with Large Page Support
37Company Confidential
© 2008 Silicon GraphicsI. All rights reserved. Silicon Graphics, SGI, Altix, XFS, the SGI logo, NUMAflex and the Silicon Graphics cube are registered trademarks and NUMAlink, CXFS are trademarks of SGI in the U.S. and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds in several countries. Intel, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All other trademarks mentioned herein are the property of their respective owners.
Thank You!
38Company Confidential
Paired UV Nehalem-EX Nodes
Boxboro
IOH
QPIQPI
Optional
I/O Riser
Boxboro
IOH
QPIQPI
Optional
I/O Riser
(2) SGI NUMAlink 5 on Backplane
Nehalem-
EX QPI
QPIQPI
UV
HUB
(2) NUMAlink 5
Nehalem-
EX
Nehalem-
EX QPI
QPIQPI
UV
HUB
(2) NUMAlink 5
Nehalem-
EX
(8) DDR3 RDIMMs &
(4) Millbrook Memory
Buffers per socket
SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / SocketIntel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket
39Company Confidential
SGI’s Flagship Product Line has 4 Characteristics:
UV - 3 things to know:1. Xeons into the Flagship Product Line WITHOUT COMPROMISE
2. MOE (MPI Offload Engine)
3. Topology Options:
- Selectable Fat-tree sizes
- Vertices within a Torus
- Paired Node Routerless or Routed
- Constellations
1. GAM
2. SSI
3. x/core, where x={I/O, Memory}
4. SWAP (and cooling)
SGI Flagship Platform Evolution
40Company Confidential
UV HUB/Node Controller FeaturesExtended Capability
•Enabling Enterprise-class scalability and reliability on x86-64•Cache-coherence across nodes
•Fault resiliency – mirror thru block devices in memory – survive OS crash•Extensive fault isolation, datapath protection, monitoring/debug functions
•Accelerating Large-scale workloads•Fast Message-Passing (without cpu cache-line delays)
•Extends cpu capability for load requests
•System scale to 256+ sockets, 2048+ cores on standard Linux
•Accelerating Data-intensive applications•Extended physical memory address to peta-scale (8PB)
•Extended “Super” TLB page size (1TB, map up to 4PB) •avoid TLB misses for large, random data references
•Very fast locking mechanism for highly contended data (no cache-line delay)
•Off-load add, compare, swap instructions
•HUB/Node controller directly exposed to user for easy utilization•No system calls
41Company Confidential
System Management
� UV maintains the hierarchical system management approach. – Origin/Altix: L1/L2/L3
– ICE/UV: BMC, CMC, Leader Node/SMN
– Command line interface at L2 & CMC very similar
� Unified approach to system management wrapped into SGI Cluster Manager
� SNMP used extensively across product lines including UV– Hardware inventories & sensor values stored in MIB
format
– SNMP data coalesced at SMN, available via SGI provided RAS software or through SNMP queries by 3rd party or customer developed apps