HP Enterprise Business Template Angle Light 4:3 Purple Short · DL2000G6 (2x DL170hG6) 2x192GB 2x...
Transcript of HP Enterprise Business Template Angle Light 4:3 Purple Short · DL2000G6 (2x DL170hG6) 2x192GB 2x...
©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice 1
Patrick Van Reeth
Lugano March 22
BUILD your competitive advantage with HP
Code Migration to CPU-GPU Hybrid : An Economic Approach
2 Footer Goes Here
•Power
•Cooling
•Density
•Price/perf
Current Computing Challenges
Insatiable demand
for :
more Flops
and
Bigger data sets
GPUs :a supplement
multi-core with more
silicon devoted to
computation
3 Footer Goes Here ©2009 HP Confidential 3
Migrate a CPU application to a CPU + GPU Hybrid Platform
4 Footer Goes Here
Application Speed-up key factors
The tool box :
• Multi-core CPUs
• Memory bwth/Latency
• InterConnect
• GPUs
• SW Environments : C & Fortran compilers, CUDA, HMPP, openCL, Allinea, TotalView…
Understand the existing code :
• Track the most
computationally expensive
areas (inner loops…)
• Probe the load-balancing on
servers, CPUs & cores
• Examine the data set splits
• Probe the communications
Be Agnostic first !!!
5 Footer Goes Here
GPU Success Factors
TRACK THE MINES !!!
• Redesign the code architecture to scale to the right levels : Node, CPU, core,
GPU
• With GPU :
• Embarrassingly parallel is good !
• Minimize memory access versus calculation. Too few calculations per memory read/write is bad.
• use memory hierarchy ! use cache when possible
• Thread/core mapping is very important
• Minimize CPU-GPU transfers
• Partition the data sets to stay in the onboard memory boundary
• Data set split : shift along the correct X, Y, Z axis !
• Hide communications
• Balance CPU (e.g. summations, less flops…) and GPU (e.g. convolutions, transpositions, …) execution time appropriately.
Y
X
O
6 Footer Goes Here
Determine the right platform • CPU / GPU execution time ratio
will determine the server type,# cores, the # GPU per node
• PCI-E communication between CPU & GPU will determine if can be shared by multiple GPUs
• Single or double precision will determine the GPU type
• Data set size will determine the node type
• $$$
7 Footer Goes Here
GPU-enabled servers System Mem Max Enabled devices
DL160(se)G6 192GB/288GB FX3800/Q4000, Ext. Quadroplex & Tesla
DL380G7 192GB 2x FX3800 or Q4000
Ext. Quadroplex & Tesla
DL370G6
144GB 3x [FX5800, Q4000, Q5000, Q6000, C1060,
C2050, C2070]
Ext. Quadroplex & Tesla
DL2000G6 (2x DL170hG6) 2x192GB 2x [FX5800, Q4000, Q5000, Q6000]
2x [C2050, C2070]
Ext. Quadroplex & Tesla
DL580G7
DL585G7
DL980G7
1TB
512GB
2TB
3x [FX5800, Q4000, Q5000, Q6000]
3x [C1060, C2050, C2070]
Ext. Quadroplex & Tesla
SL390G7 192GB 2U SL390 : 3x [M2050, M2070, M2070Q]
4U SL390 : 8x [M2050, M2070, M2070Q]
WS460cG6
192GB 1 or 2 FX880M,FX2800M, FX3600M
x2 FX3800, FX4800, FX5800, Q5000,
Q6000, C1060
©2010 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
ECONOMIC APPROACH
ON 2 APPLICATIONS
Joint study with
CAPS entreprise www.caps-entreprise.com
Providing worldwide solutions for application deployment on manycore systems
Tools
Porting
Methodology
Services
• Assessment of GPU speed-up
• Code porting to hybrid architectures
• CPU & GPU code optimization
• Trainings in parallelism & hybrid
computing (OpenMP, MPI, CUDA, OpenCL, HMPP, porting methodo)
• HMPP Workbench Source-to-source hybrid compiler
• GPU Migration Tools suite
10 Footer Goes Here
• CAP-EX (Capital Expenses):
• System acquisition costs:
• All systems included : Servers, GPUs, Interconnect
• Systems amortized over a 3 years period
• (SW license price not included)
• Migration costs
• in man-month
• Migration effort done once for 10 years
• OP-EX (Operational Expenses):
• Energy costs (system consumption + cooling)
• Maintenance costs (system maintenance, support)
11 Footer Goes Here
–4-node System (4x SL390)
• CPU: 2 x Intel Xeon X5670 6 cores at 2.93 Ghz
• (option CPU: 2 x Intel Xeon E5506 4 cores at 2.13 Ghz)
• Main Memory: 48 GB per node
• GPUs: up to 3 Tesla C2050 per node
• IB on board
–Base energy power
•System Power measured with Amp-meter during code
execution, rated at 0.07€/kWh
•Cooling Costs = 50% Server Consumption
•(Interconnect is neglected)
Conditions
12 Footer Goes Here
Applications Used
Application 1
• Field: Monte Carlo simulation for thermal radiation
• MPI code
• Migration cost: 1 man month
Application 2
• Field: astrophysics, hydrodynamic
• MPI code
• Requires 3 GPUs per node for having enough memory space
• Migration cost: 2 man month
13 Footer Goes Here
Application 1: Power consumption results
14 Footer Goes Here
Application 2: Power consumption results
15 Footer Goes Here
• Comparison on an equivalent workload
• CAPEX = System costs + Migration costs
• OPEX = Energy cost + Computer maintenance cost (10% Computer costs)
Configuration Execution
time (s) System Costs
Maintenance
Costs
Energy
Costs
CAPEX
+OPEX
Application 1 Migration cost = 1 man.month
4 nodes 6862 1.87€ 0.19€ 0.37€ 2.43€
4 nodes + 4 GPUs 1744 0.71€ 0.07€ 0,12€ 0.90€
4 nodes + 8 GPUs 1000 0.51€ 0.05€ 0,08€ 0.64€
4 nodes + 12 GPUs 731 0.45€
0.04€ 0,08€ 0.57€
Application 2 Migration cost = 2 man.month
4 nodes 713 0.19€ 0.02€ 0.025€ 0.239€
4 nodes + 12 GPUs 485 0.30€ 0.03€ 0.034€ 0.358€
4 nodes (slow ck)+ 12 GPUs 500 (estim.) 0.24€ 0.02€ 0.034€ 0.302€
CAPEX-OPEX Overview
***0.72 of a 3 year period *no infiniband cost counted ****Use GPU mainly – projection not actual run
16 Footer Goes Here
Cost per run Application 1 Application 2
- €
0.50 €
1.00 €
1.50 €
2.00 €
2.50 €
3.00 €
no GPU 1 GPU/node 2 GPU/node 3 GPU/node
Migration Costs(4 nodes)
Energy Costs(power + cooling)
Maintenance Costs(10% CC)
System Costs
- €
0.05 €
0.10 €
0.15 €
0.20 €
0.25 €
0.30 €
0.35 €
0.40 €
0.45 €
no GPU 3 GPU/node 3 GPU/node(slow ck)
17 Footer Goes Here
Conclusions
– Be agnostic first … GPUs can do miracles… CPUs too…
– Hybrid System Architecture is key (server, CPU, # GPU, and
associated connection)
– Do consider TCO (CAP-EX + OP-EX) !!!
– Application 1 : GPU extra-cost is easily amortized
– Application 2 : price/perf improvement not compensated by the
speed-up (1.5) better on next GPU generation ?
– ISV’s Licensing policy can drastically change the rule !
– On large machine, Migration cost can be negligible
18 Footer Goes Here
Thank you !
• A directive based multi-language programming environment
o Help keeping software independent from hardware targets
o Provide an incremental tool to exploit GPU in legacy applications
o Avoid exit cost, can be future-proof solution
• HMPP provides
o Code generators from C and Fortran to GPU (CUDA or OpenCL)
o A compiler driver that handles all low level details of GPU compilers
o A runtime to allocate and manage GPU resources
• Source to source compiler
o CPU code does not require compiler change
o Complement existing parallel APIs (OpenMP or MPI)
www.caps-entreprise.com 19 February 2011
What is HMPP? (Hybrid Manycore Parallel Programming)
©2009 HP Confidential template rev. 12.10.09 20
WHERE ARE GPU’S WORKING?
– Oil and Gas :
• seismic exploration and reservoir modeling
• Enables Reverse Time Migration
– Financial Services (FSI banks) • Option pricing and risk modeling
– Bioscience • Genetic sequencing and chemistry • Molecular dynamics • Drug discovery (protein docking)
– Government • Searching and encryption engines
©2009 HP Confidential 21
Tokyo Institute of Technology TSUBAME 2.0
22 ©2009 HP Confidential
TSUBAME 2.0 Overview
• Compute nodes: 2.4PFlops (CPU+GPU)
• SL390s G7 (1408 ) thin nodes, each with 2 Westmere-EP and 3 NVIDIA M2050
− 1347 with 54GB and SSD 60GB x2, 41 with 96GB and SSD 120GB x2
− Suse Linux Enterprise Server or Windows HPC Server
• DL580 G7 Medium (24) and Fat (10) nodes, with 2 NVIDIA S1070
− Medium: 128GB plus SSD 120GB x4
− Fat: 256GB plus SSD 120GB x4
• QDR InfiniBand, full bisection, non-blocking
• Spine: Voltaire Grid Director 4700 12 x 324port
• Edge: Voltaire Grid Director 4036 179 x 36 port and 4036E 6 x 34port/10GbE 2 port
• Storage: 7.13PB
• Lustre file system 5.93PB: DDN SFA 10000 (10/rack, 5 racks) and DL360 G6 (30)
• Home file system: 1.2PB: DDN SFA 10000 (10/rack, 1 racks), BlueArc Mercury 100 (2) and DL360 G6 (4)
• Press release (Japanese):
• http://www.gsic.titech.ac.jp/sites/default/files/pdf/TSUBAME/press.pdf
Storage system: Total 7.13PB (Lustre+ home)
Interconnect: Full bi-section non-blocking
Lustre file system (DDN SFA10K) 5.93PB
SupreTitenet
Home directory region 1.2PB
SupreSinet3
Tapa System
Existing system
OSS x20 MDS x10
MDS,OSS HP DL360 G6 30nodes Storage DDN SFA10000 x5 ( 10 enclosure x5) Lustre(5File System) OSS: 20 OST: 5.9PB MDS: 10 MDT: 30TB
x5
Voltaire Grid Director 4700 12switches IB QDR: 324port
Core Switch
Edge Switch
Edge Switch(10GbE port付き)
Voltaire Grid Director 4036 179switches IB QDR : 36 port
Voltaire Grid Director 4036E 6 switches IB QDR:34port 10GbE: 2port
12switches
6switches 179switches
Storage Server HP DL380 G6 4nodes BlueArc Mercury 100 x2 Storage DDN SFA10000 x1 (10 enclosure x1)
Management nodes
“THIN” nodes
1408nodes (32node x44 Rack)
SL390s G7 1408nodes CPU Intel Westmere-EP 2.93GHz Turbo boost 3.196GHz) 12Core/node Mem: 54GB (1347 nodes) 96GB (41 nodes) GPU NVIDIA M2050 515GFlops,3GPU/node SSD 60GB x 2 120GB (54GB nodes) 120GB x 2 240GB (96GB nodes) OS: Suse Linux Enterprise Server Windows HPC Server
“Med” nodes
DL580 G7 24nodes CPU Intel Nehalem-EX 2.0GHz 32Core/node Mem:137GB(=128GiB) SSD 120GB x 4 480GB OS: Suse Linux Enterprise Server
“Fat” nodes
DL580 G7 10nodes CPU Intel Nehalem-EX 2.0GHz 32Core/node Mem:274GB(=256GiB) ※8nodes 549GB(=512GiB) ※2nodes SSD 120GB x 4 480GB OS: Suse Linux Enterprise Server
CPU Total: 215.99TFLOPS(Turbo boost 3.196GHz) CPU+GPU: 2391.35TFlops Memory Total:80.55TB SSD Total:173.88TB
CPU Total: 6.14TFLOPS CPU Total: 2.56TFLOPS
・・・・・・
Compute nodes: 2.4PFlops(CPU+GPU)
GSIC:NVIDIA Tesla S1070GPU
PCI –E gen2 x16 x2slot/node
TSUBAME 2.0 System Overview
NFS,CIFS用 x4 NFS,CIFS,iSCSI用 x2
– GPU offers a massively parallel environment
– GPU offers a high (potential) peak performance
• 515 GFlops DP for FERMI compare to 140 GFlops of dual socket X5670 @ 2.93 GHz
– GPU internal bandwidth is really high
• 86.4 GB/s for FERMI vs 40GB/s on Intel Westmere or 53 GB/s on AMD Magny-Cours
Why using GPU ?
©2010 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
HP PORTFOLIO
©2009 HP Confidential template rev. 12.10.09 26
[GPU+GRAPHIC]-ENABLED SERVERS System Mem Max Enabled devices **
DL160(se)G6 192GB/288GB FX3800/Q4000, Ext. Quadroplex & Tesla
DL380G7 192GB 2x FX3800 or Q4000
Ext. Quadroplex & Tesla
DL370G6
144GB 3x [FX5800, Q4000, Q5000, Q6000, C1060,
C2050, C2070]
Ext. Quadroplex & Tesla
DL2000 (2x DL170eG6) 2x192GB 2x [FX5800, Q4000, Q5000, Q6000]
2x [C2050, C2070]
Ext. Quadroplex & Tesla
DL580G7
DL585G7
DL980G7
1TB
512GB
2TB
3x [FX5800, Q4000, Q5000, Q6000]
3x [C1060, C2050, C2070]
Ext. Quadroplex & Tesla
SL390G7 192GB 2U SL390 : 3x [M2050, M2070, M2070Q]
4U SL390 : 8x [M2050, M2070, M2070Q]
WS460cG6
192GB 1 or 2 FX880M,FX2800M, FX3600M
x2 FX3800, FX4800, FX5800, Q5000,
Q6000, C1060
** all combinations may not be qualified yet, please consult HP
©2009 HP Confidential template rev. 12.10.09 27
GPU-BASED SYSTEM SET-UP HP
Servers
D-HIC
TESLA S1070 & S2050
:
4GPUs (2+2)
HIC
OR
TESLA C1060,
C2050, C2070
TESLA M1060, M2050,
M2070, M2070Q
©2009 HP Confidential template rev. 12.10.09 28 ©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
HP ProLiant SL390s G7 Server
PURPOSE BUILT SOLUTIONS FOR SCALE
DL BL SL
Design center Rack Blade enclosure in
rack Rack
Design focus Versatility & value Integrated &
optimized, maximum redundancy
ROI optimized for extreme scale out
Management
Essential and advanced
management HP Insight Dynamics
Advanced management-
accelerated service delivery & change in
minutes
In house developed management tools Basic management
via IPMI/DCMI
Density optimized for
the data center
Shared infrastructure for
accelerated service delivery
Extreme scale
out
datacenters with
lean
management
#1 Energy Efficiency
#1Blade Density
#1 Virtualization
29 HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 30
HIGHLY FLEXIBLE S6500 CHASSIS
Multi-node, Shared Power &
Cooling Architecture
Benefits: Low cost, high
efficiency chassis
• 4U Chassis for deployment flexibility
• Standard 19” racks, with front I/O cabling
• Unrestricted airflow (no mid-plane or I/O connectors)
• Reduced weight
• Individually Serviceable Nodes
• Variety of optimized Node Modules
• SL Advanced Power Manager Support
• Dynamic Power Capping
• Power Monitoring
• Node Level Power Off/On • Shared Power & Fans
• Optional Hot-Plug Redundant PSU
• Energy efficient Hot Plug fans
• 3 Phase Load Balancing
• 94% Platinum Common Slot Power Supplies
• N +1 Capable Power Supplies
©2009 HP Confidential template rev. 12.10.09 31
HP S6000 CHASSIS HOT PLUG FANS & POWER SUPPLIES
©2009 HP Confidential template rev. 12.10.09 32 © Copyright 2010 HPDC 32
HP ProLiant
SL390s G7 1U half width
HP ProLiant
SL390s G7 2U half width
HP ProLiant
SL390s G7 4U half width
Dense, Fabric heavy HPC
with manageability and SNS
Dense GPU offloaded HPC with
heavy fabric, manageability and
SNS
Maximum GPU offloaded HPC
Ideal Application Ideal Application Ideal Application
HPC compute intensive
Highly managed nodes
Balanced GPU & compute
intensive HPC
Extreme GPU
HP ProLiant SL6500 – 300 Series trays
Ideal environments
HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 33 © Copyright 2010 HPDC 33
HP ProLiant SL390s G7 Server Trays 1U Half width 2U half width 4U half width
Processor
Chipset
2P Intel (Westmere)
Intel® 5520 Chipset (Tylersburg rev C2)
Max Memory
12 DDR3 slots, R & U DIMMs
Max memory: 192GB
Storage
HP Smart Array B110i SATA SW Raid [standard]
HP SmartArray SAS controller family [upgrade]
2 LFF SATA / SAS, or
4 SFF SATA / SAS
4 SFF HP SATA / SAS,
and 2 SFF SATA / SAS 8 SFF HP SATA / SAS
Networking
Dual port 1GbE NIC [standard]
10GbE SFP+ (via ConnectX2) [standard]
QDR IB QSFP (via ConnectX2) [upgrade]
Slots/GPU 1 x16 LP PCIe G2
1 x8 LP PCIe G2
3 x16 PCIe G2 for up
to 3 GPUs
1 x8 LP PCIe G2
8 x16 PCIe G2 via PLX
for up to 8 GPUs
Remote Mgmt.
Integrated LightsOut 3 (support for IPMI 2.0 and DCMI 1.0)
(Sideband or dedicated port)
Additional features
High efficiency shared PS, (>94%) @ 50%+ load
Shared fan topology
Density 8 servers in 4U 4 servers in 4U 2 servers in 4U
Form Factor ½ W, 1U, SNS ½ W, 2U, SNS ½ W, 4U, SNS
HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 34 © Copyright 2010 HPDC 34
HP PROLIANT SL390S FAST FABRIC COMPUTE TRAY
½ width 1U tall, 2 hot plug nodes in 1U
2 LFF (3.5”) or 4
SFF (2.5”) Quick
Release HDD
Dual 1GbE
Dedicated management iLO3
LAN & 2 USB ports VGA
UID LED & Button
Health LED
Serial (RJ45)
Power
Button QSFP
(QDR IB) SFP+
PCIe Gen2 x8
LP Slot
Cx2 LOM
Dual GbE
Dedicated iLO3 port
HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 35 © Copyright 2010 HPDC 35
HP PROLIANT SL390S, 0 TO 3 GPUS ½ WIDTH, 2U TALL (EFFECTIVE DENSITY=1U); 4 TRAYS PER 4U
CHASSIS
4 Hot plug SFF
(2.5”) HDDs
SFP+
1 GPU module in the
rear, lower 1U
2 GPU modules
in upper 1U
Dual 1GbE
Dedicated
management iLO3 LAN
& 2 USB ports
VGA
UID LED & Button
Health LED
Serial (RJ45)
Power
Button QSFP
(QDR IB)
2 Non-hot plug
SFF (2.5”) HDD
PCIe Gen2 x8
LP Slot
Std IOH for 1 GPU
2nd IOH for 2 & 3 GPUs
…dedicated x16 lanes
3 GPUs per 1U w/node
HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 36 © Copyright 2010 HPDC 36
SL390s Block Diagram 0 to 3 GPUs
PCI-E Riser
CPU1
CPU2
Tylersburg-
36D
3,4
ESI
Mellanox
Connect-
X
10G & IB
ICH10
GXE10/100 PHY
210
210
SPI
QPI
QPI
QP
I
X8 Gen2
NIC1
Serial (RJ45)2xUSB
ESI
GROM
DIMM2A
DIMM4B
DIMM1D
DIMM3E
DIMM1D
DIMM3E
DIMM5F
DD
R3
DIMM6C
DIMM4B
DIMM6C
DIMM5F
DIMM2A
DD
R3
0
1
1
0
0
1
x16
x16
PC
I-S
LO
T
x1
Tylersburg-
36D789
10
ESI
345
0
1
PC
I-S
LO
T
QP
I-R
ise
r
x16 Gen2
6
5,6
7,8,9,10
x8 Gen2x24 PCIE Riser
Conn
RJ45
RJ45
SFP+
6xSATA
LPCPCIE
PC
I-S
LO
T
x16
x8
PC
I-S
LO
T
x4 QSFP
Video
USBInternal
Micro SD
Intel Kawela
82576
x4 Gen2
P6
SROM
RJ45NIC2
NC-SI
1,2
NVRAMLVAD
PCI
RN50/ES1000
DVI
Optional 2nd
IOH & 2 PCIe
x16 G2 slots
HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 37 © Copyright 2010 HPDC 37
HP ProLiant SL390s G7 4U half width tray
8 Hot plug SFF
(2.5”) HDDs
SFP+
4 GPU modules in
the front
4 GPU modules
in the back
Dual 1GbE
Dedicated management
iLO3 LAN & 2 USB ports VGA
UID LED & Button
Health LED
Serial (RJ45)
Power Button QSFP
(QDR IB)
HP Confidential - CDA Required
©2009 HP Confidential template rev. 12.10.09 38 © Copyright 2010 HPDC 38
22 March 2011
SL390 8 GPU Block Diagram
CPU1
CPU2
Tylersburg-
36D
3,4
ESI
Mellanox
Connect-X
10G or IB
ICH10
GXE
10/100
PHY
210
210
SPI
QPI
QPI
QP
I
X8 Gen2
NIC1
Serial (RJ45)2xUSB
ESI
GROM
DIMM2A
DIMM4B
DIMM1D
DIMM3E
DIMM1D
DIMM3E
DIMM5F
DD
R3
DIMM6C
DIMM4B
DIMM6C
DIMM5F
DIMM2A
DD
R3
0
1
1
0
0
1
x16
x16
x4
Tylersburg-
36D789
10
ESI
345
0
1
QP
I-R
ise
r
x16 Gen2
6
1,2
7,8,9,10
5,6 x8 Gen2
RJ45
RJ45
QSFP
6xSATA
LPC
PCIE
x16
x8PCI-SLOT
x1SFP+
USB
RN50PCI-33
DVI
Video
USBAlcor
micro
SD
PLX
PEX8647
PLX
PEX8664
GPU PCIe-SLOT
GPU PCIe-SLOT
x16
x16
GPU PCIe-SLOT
GPU PCIe-SLOT
x16
x16
GPU PCIe-SLOTx16
PLX
PEX8664
GPU PCIe-SLOT
GPU PCIe-SLOT
x16
x16
GPU PCIe-SLOTx16
PC
I-S
LO
T
Intel
82576 NIC2RJ45
X4 Gen2
SROM
SPI
SPI
HP Confidential - CDA Required
Cluster Management with integrated GPU support
Easy customizable utility with full GUI and command line interface
• Scalable provisioning
• Configurable monitoring
• Cluster commands
Well adapted for large-scale Linux clusters
Proven and scalable: hundreds of customers including Top500 sites
Enhanced for Nvidia Tesla Modules • Monitoring of GPU health & temp
• Handling of ECC,
•Parallel installation/update of
GPU drivers
•Installation of CUDA tools
•CUDA, GPU-aware job-scheduling
•Etc.
Thank you