Power7 Performance Overview - Dashboard -...
Transcript of Power7 Performance Overview - Dashboard -...
© 2010 IBM Corporation
POWER7 Performance Overview
Raj Panda
SPXXL 2010 – San Francisco, 9 -16 May 2010
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
2
IBM Power7 Product (HPC) Offerings
Blades
4U 32 Core Air Cooled
256 Core Nodes Water Cooled
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
3
POWER7
P7 Core
L2
P7 Core
L2
Memory Interface
P7 Core
L2
P7 Core
L2
P7 Core
L2
P7 Core
L2
P7 Core
L2
P7 Core
L2
G X
S M P
F A B R I C
P O W E R
B U S
POWER7
Memory++
L3 Cache (32MB)
Core options: 8 ( For HPC ) 567mm2 Technology:
– 45nm lithography, Cu, SOI, eDRAM Transistors: 1.2 B
– Equivalent function of 2.7B – eDRAM efficiency
Eight processor cores – 12 execution units per core – 4 Way SMT per core – 32 Threads per chip – 256 KB L2 per core
32MB on chip eDRAM shared L3
Dual DDR3 Memory Controllers – 100 GB/s Memory bandwidth per chip
Scalability up to 32 Sockets – 360 GB/s SMP bandwidth/chip – 20,000 coherent operations in flight
Advanced pre-fetching Data and Instruction
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
4
POWER7: Core
64-bit PowerPC architecture v2.07 Out of Order Execution Execution Units • 2 Fixed Point Units • 2 Load Store Units • 4 Double Precision Floating Point Units • 1 VMX Unit • 1 Decimal Floating Point Unit • 1 Branch • 1 Condition Register • 6 Wide Dispatch • Units include distributed Recovery Function
L2 Cache
IFU CRU/BRU
ISU
DFU
FXU
VSX FPU
LSU
POWER7 continues to support VMX / Extends SIMD support with VSX – 2 VSX units that can each handle 2 Double-Precision FP instructions – 8 FLOPS per cycles – VSX units can also handle 4 Single Precision instructions per cycle – VSX instruction set support for vector and scalar instructions
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
5
POWER7 Vector/Scalar Unit 64 Entry Vector/Scalar Register File
– 128-bit wide registers – Used for 32b/64b scalar as well as 4x32B/2x64b SIMD
instructions Floating point instructions and issue rates:
– Up to two instructions can be issued to the VSU in a given cycle – one for each pipeline
– Instructions executed by pipe0 can be a 128-bit simple fixed point operation (Altivec), 128-bit complex fixed point operation (Altivec), 4-way SIMD single-precision FPU operation (Altivec), a 2-way SIMD double-precision FPU operation (VSX) or a scalar floating point (single or double precision) operation.
– Instructions executed by pipe1 can be a 128-bit permute (Altivec or VSX permute), a store, a scalar floating point (single or double precision) operation, or a 2-way SIMD double-precision FPU operation (VSX).
– So there can be two simultaneous VSX instructions executing at once, each handling 2 double-precision FP operations. Since each operation can be a FP multiply-add (FMA) that gives a peak of 2x2x2=8 double-precision FP operations per cycle.
– Different from previous implementations, the scalar and vector FP operations are all executed within the VSU.
– Because there are two scalar FXU pipelines independent of the VSU, two additional FXU operations, for logical operations and/or array indexing, can be executed at the same time as VSU operations
Floating Point Operations are ANSI/IEEE standard 754-1985 Compliant Scalar: 4 ops/cycle (DP or SP)
Vector (VSX): 8 ops/cycle (DP)
Vector (Altivec): 8 ops/cycle (SP)
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
6
How to use vector capability in Power7
Using compiler: compiler versions that recognize the POWER7 architecture are XL C/C++ 11.1 and XLF Fortran 13.1. – For C:
• xlc –qarch=pwr7 –qtune=pwr7 –O3 –qhot -qsimd – For Fortran:
• xlf –qarch=pwr7 –qtune=pwr7 –O3 –qhot
Using ESSL libraries with vectorization support: – Select routines have vector analogs in the library – Key FFT, BLAS routines
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
7
Code constructs that prevent vectorization…
Loop carried data dependencies for (i = 0; i < N; i++)
dc[i] = dc[i-1] + af[i]
Unresolved aliases double sub(double *a, double *b, double *c) {
for (i = 0; i < N; i++) a[i] = a[b[i]] + c[i]
: }
Non-stride-1 accesses for (i = 0; i < N; i+=4)
a[i] = b[i] + c[i]
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
8
Power 755: SIMD (AltiVec™) SP Floating Point Performance
NOTES: 1. Non-library version of FFT 2. Library versions (ESSL, FFTW) would show similar behavior with increasing vector length
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
9
Power 755: SIMD (VSX) performance on Financial Services
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
10
SMT support in POWER7
SMT is a processor technology that allows – separate instruction streams (threads) to run concurrently on the same physical processor
improving overall throughput
P7 supports 2-way and 4-way SMT – SMT4 gain for commercial applications range between 1.7x to 2.2x of single-thread performance – SMT4 gains limited by resources constraints such as fewer FP rename registers per thread, fewer
instruction buffer entries by thread, …
Not all applications benefit from SMT – Cases where the performance may not be improved and even possibly degrade
• applications with execution-unit–limited performance (LINPACK for example) • applications that consume all the chip's memory bandwidth (STREAM for
example)
SMT gain on P7 – Less than on p6 due to out-of-order architecture
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
11
When to use SMT on Power7
Capacity-oriented workload: lot of serial jobs
– Both SMT2 and SMT4 are likely to improve performance
Capability oriented workload: turnaround time is critical
– SMT2 may show some benefit – SMT4 is very unlikely to show benefit
Capability-capacity: (lot of parallel jobs) – SMT2 – may show up to 10% benefit – SMT4 – may show a few percent more than with SMT2
Above heuristics are very dependent on
– performance characteristics of the individual apps in the job stream – mileage may vary but worth trying
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
12
SMT gains on Power7
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
13
5.3 / 6.1 RHEL / SLES
Power 755 4-Socket HPC System Power 755
POWER7 Architecture 4 Processor Sockets = 32 Cores 8 Core @ 3.3 GHz
DDR3 Memory 128 GB / 256 GB, 32 DIMM Slots
DASD / Bays
Up to 8 SFF SAS DASD (2.4TB) 73 / 146 / 300GB @ 15K
(Opt: RAID)
Expansion
PCIe x8: 3 Slots (1 shared) PCI-X DDR: 2 Slots
GX++ Bus
Integrated Ports 3 USB, 2 Serial, 2 HMC
Integrated Ethernet Quad 1Gb Copper (Opt: Dual 10Gb Cu or Fiber)
Media Bays 1 Slim-line ( No tape support )
Cluster 64 nodes) Ethernet or IB-DDR
Redundant Power Yes (AC or DC Power) Single phase 240vac or -48 VDC
Cetrtifications NEBS / ETSI for harsh environments
EnergyScale Active Thermal Power Management Dynamic Energy Save & Capping
Up to 8.4 TFlops per Rack ( 10 nodes per Rack )
4U x 28.8” depth
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
14
Rack to Rack: Power 755 Compared to Power 575 (POWER6)
Power 755 Power 575 Cores/chip 8 2 Total cores 32 32 Frequency 3.3 GHz 4.7 GHz Memory (max) 256 GB 256 GB Cooling Air Water
Cores/rack Rack type
320 19”
448 24”
Power (Watts) (Linpack) 1650 5400
Each Power 755 node offers the same core count as Power 575 with:
40-50% Improvement in Performance
Air Cooling vs. Water Cooling
1/3 of the Energy Consumption
37% Improvement in floor space for a 64 node configuration
Green500 ~ 592 MFlops/Watt
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
15
Power 755 vs Power 575: Standard Benchmarks
Power 755 AIX 6.1 Peak performance (GFLOPS ) (*) 844
STREAM (triad) (GB/s) 122
Linpack (HPL) (GFLOPS ) 820
SPECfp_rate2006 825
SPECint_rate2006 1010
(*) – at nominal frequency of 3.3 GHz. Using DPS-FP mode, 755 can be run at a higher frequency
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
16
Power 755 vs Power 575: Gaussian Application Performance
• Small DFT: Density Function Theory Frequency calculation on a small molecule • Medium Force: Split Basis Function Force calculation on a medium sized molecule • Large DFT: Density Function Theory Frequency calculation on a large molecule • Binary built with XLF V9, VAC v7, -qarch=pwr4 –qtune=pwr4
Source: Balaji Atyam, Carlos Sosa, Tony Pirraglia
PRELIMINARY results; Final to be published by Gaussian inc.
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
17
Power 755: SIMD (VSX) Performance on ABAQUS
1. ABAQUS was tested using standard version of ESSL & VSX enabled (pre-GA) version of ESSL 2. ABAQUS Standard Benchmark cases (S2a, S4a, etc.) were used 3. ABAQUS uses the DGEMM routine from ESSL 4. Performance benefit varies depending on the size of matrices and the DGEMM content 5. There are benchmark cases (e.g. S5) with no performance improvement with VSX 6. VSX exploitation REQUIRES Power7 capable ESSL and XLF/XLC runtime environments.
Source: Balaji Atyam, Tony Pirraglia
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
18
HMMER benchmark
cores Scalar
(secs)
Vector (Altivec)
(secs)
Vector/Scalar
Ratio 1 2879 502 5.74
2 1442 252 5.72
4 724 140 5.17
8 365 80 4.56
16 205 59 3.47
Notes: table entries are elapsed time in seconds Power 755 with 32 cores at 3.3 GHz, xlc 11.1 beta compiler
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
19
PS 701, PS702 Blades
POWER7 Blades
Architecture POWER7: 8 cores per socket
Single or double wide
L2 & L3 Cache On Chip
DDR3 Memory Up to 128GB / 256 GB
DASD / Bays 0 - 2 SSD per side
0 - 1 SAS per side
Daughter Card Options Legacy, SFF, or High speed PCIe
Integrated Options Dual Port 10/100/1000 Ethernet
Ethernet, USB
Fiber Support Yes ( via Blade center )
Media Bays 1 Blade Center
Clustering 10Gbt Ethernet
Redundant Power Yes Blade Center
Redundant Cooling Yes Blade Center
Service Processor Yes
Power & Thermal POWER Save / Power Cap
8 Cores 16 Cores
Up to 2.7 TF / BladeCenter 10.75 TF / Rack
( 14 Blades per Chassis )
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
20
Blade Performance: Power6 vs Power7
JS22 JS23 PS701
CPU Power6 Power6 Power7
Core frequency (GHz) 4 4.2 3
Cores/Single wide blade 4 4 8
RAM 4x4GB 8x4GB 16x4GB
DIMM speed (MHz) 667 677 1066
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
21
Blade Performance: Linpack, STREAM, SPEC CPU
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
22
Blade Performance: Linpack, STREAM, SPEC CPU
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
23
Active Energy Management: Energy Scale on POWER7 Active Energy Manager is configurable using IBM
Systems Director Offers 3 modes of energy management Static Power Saver (SPS)
– Static: Active processor frequency set at 30% below nominal (2.31GHz)
– Folding will set idle cores to Nap (1.65 GHz) or to Sleep (0 GHz)
– Maximum energy savings – used for long periods of low utilization
Dynamic Power Saver (DPS) – Processor frequency is set based on processor core
utilization – Un-utilized cores set to 1.65 GHz and ramped up as
utilization increases to maximum 90% of nominal frequency (2.97 GHz)
– This feature prefers power savings over performance Dynamic Power Saver – Favor Performance (DPS-FP)
– Processor frequency is set based on processor core utilization
– Un-utilized cores set to 1.65 GHz and ramped up as utilization increases to maximum 107% of nominal frequency (3.53 GHz)
– This feature prefers maximum performance over power savings
SPEC Benchmark Performance Characteristic 416.gamess Core intensive 433.milc Mem. bandwidth intensive 435.gromacs Core intensive 437.leslie3d Mem. bandwidth intensive 444.namd Core intensive 459.GemsFDTD Mem. bandwidth intensive
Correlation of performance and power consumption
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
24
DPS-FP on Power 755
Linpack Benchmark with Active Energy Manager AEM “Over–Clocking” support with DPS-FP
– Dynamic Power Save - Favor Performance Test environment
– Power 755 32core @ 3.3GHz – XLC V11.0 beta – ESSL V5.1 beta – PE V5.2
In case of AEM=OFF – 692.7 Gflops* (82.0% efficiency)
In case of AEM=ON – DPS-FP – 753.6 Gflops* (89.2% efficiency)
Using AEM: 8.8% Performance Gain – 753.6 Gflops / 692.7 Gflops
* Power 755 performance projected from actual Power 750 results
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
25
Power 755 was 16 node (512 core) POWER7 3.3 GHz cluster with Infiniband: 11.15 TFlops Linpack Rmax/13.517 TFlops Rpeak, 18.8 kw (Source: IBM measurements)
Sun x6440 was 6440 core 2.5 GHz Opteron with Infiniband: 51.88 TFlops Linpack Rmax/64.64 TF Rpeak, 152 kw (Source: November 2009 Little Green500 list www.green500.org)
1.7 times the Mflops/watt vs Sun X6440 cluster
IBM Power 755
Power 755 cluster delivers superior Linpack Mflops per watt performance compared to Sun 4-socket x86 cluster
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
26 Source: www.top500.org Mflops/watt is calculated by dividing “Rmax” by “Power”, Green500 will not publish until a later date
Meg
aflo
ps/w
att
Number shown in column is November 2009 TOP500 rank Power7 is
there !! !
Source: www.top500.org Notes: * Kraken power is scaled down from Jaguar, NUDT, Sandia/Sun did not provide power numbers
Rank Site Mfgr System Rmax MF/w Relative 1 ORNL Cray Jaguar XT5 HE 2.6 GHz 6C Opteron 1759 253 1.76 2 LANL IBM Roadrunner QS22/LS21 1042 444 1 3 U of Tenn Cray Kraken XT5 HE 2.6 GHz 6C Opteron * 831.7 253 1.76 4 Juelich IBM Blue Gene/P 825.5 364 1.22 5 NUDT Self Intel Nehalem/AMD Radeon GPU * 563.1 6 NASA Ames SGI QC 3.0 Xeon 544.3 232 1.92 7 LLNL IBM Blue Gene/L 478.2 205 2.16 8 ANL IBM Blue Gene/P 458.6 364 1.22 9 TACC Sun 2.3 GHz QC Opteron 433.2 217 2.05
10 Sandia Sun Red Sky 2.93 Nehalem * 423.9
6.9 Mw 2.3 Mw
2.2 Mw
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
27
Summary
Power 755 targets Divisional and Departmental HPC Segments – 4S single node systems – Clusters with GigE and IB networks
Power 755 provides an ideal migration path for currently deployed systems such as: – System p5 550 and Power 550 (POWER6) – System p5 575 and Power 575 (POWER6) – JS21 Blade Clusters
Power 755 brings great improvement over Power 575 with : – 2 x improvement in price performance – 3 x improvement in power consumption (Air cooled vs. Water cooled) – 1.7 x improvement in density – But, with lower interconnect bandwidth
Application segments – Weather – Reservoir modeling – Financial Services – Computational Chemistry/Molecular Dynamics
© 2010 IBM Corporation
SPXXL 2010 – San Francisco, 9 -16 May 2010
28
...any Questions?
Thank you...