Power7 Performance Overview - Dashboard -...

© 2010 IBM Corporation

POWER7 Performance Overview

Raj Panda

SPXXL 2010 – San Francisco, 9 -16 May 2010



2

IBM Power7 Product (HPC) Offerings

Blades

4U 32 Core Air Cooled

256 Core Nodes Water Cooled



3

POWER7

P7 Core

L2

P7 Core

L2

Memory Interface

P7 Core

L2

P7 Core

L2

P7 Core

L2

P7 Core

L2

P7 Core

L2

P7 Core

L2

G X

S M P

F A B R I C

P O W E R

B U S

POWER7

Memory++

L3 Cache (32MB)

 Core options: 8 ( For HPC )  567mm2 Technology:

–  45nm lithography, Cu, SOI, eDRAM  Transistors: 1.2 B

–  Equivalent function of 2.7B –  eDRAM efficiency

 Eight processor cores –  12 execution units per core –  4 Way SMT per core –  32 Threads per chip –  256 KB L2 per core

 32MB on chip eDRAM shared L3

 Dual DDR3 Memory Controllers –  100 GB/s Memory bandwidth per chip

 Scalability up to 32 Sockets –  360 GB/s SMP bandwidth/chip –  20,000 coherent operations in flight

 Advanced pre-fetching Data and Instruction



4

POWER7: Core

64-bit PowerPC architecture v2.07 Out of Order Execution Execution Units •  2 Fixed Point Units •  2 Load Store Units •  4 Double Precision Floating Point Units •  1 VMX Unit •  1 Decimal Floating Point Unit •  1 Branch •  1 Condition Register •  6 Wide Dispatch • Units include distributed Recovery Function

L2 Cache

IFU CRU/BRU

ISU

DFU

FXU

VSX FPU

LSU

  POWER7 continues to support VMX / Extends SIMD support with VSX –  2 VSX units that can each handle 2 Double-Precision FP instructions –  8 FLOPS per cycles –  VSX units can also handle 4 Single Precision instructions per cycle –  VSX instruction set support for vector and scalar instructions



5

POWER7 Vector/Scalar Unit   64 Entry Vector/Scalar Register File

–  128-bit wide registers –  Used for 32b/64b scalar as well as 4x32B/2x64b SIMD

instructions   Floating point instructions and issue rates:

–  Up to two instructions can be issued to the VSU in a given cycle – one for each pipeline

–  Instructions executed by pipe0 can be a 128-bit simple fixed point operation (Altivec), 128-bit complex fixed point operation (Altivec), 4-way SIMD single-precision FPU operation (Altivec), a 2-way SIMD double-precision FPU operation (VSX) or a scalar floating point (single or double precision) operation.

–  Instructions executed by pipe1 can be a 128-bit permute (Altivec or VSX permute), a store, a scalar floating point (single or double precision) operation, or a 2-way SIMD double-precision FPU operation (VSX).

–  So there can be two simultaneous VSX instructions executing at once, each handling 2 double-precision FP operations. Since each operation can be a FP multiply-add (FMA) that gives a peak of 2x2x2=8 double-precision FP operations per cycle.

–  Different from previous implementations, the scalar and vector FP operations are all executed within the VSU.

–  Because there are two scalar FXU pipelines independent of the VSU, two additional FXU operations, for logical operations and/or array indexing, can be executed at the same time as VSU operations

  Floating Point Operations are ANSI/IEEE standard 754-1985 Compliant Scalar: 4 ops/cycle (DP or SP)

Vector (VSX): 8 ops/cycle (DP)

Vector (Altivec): 8 ops/cycle (SP)



6

How to use vector capability in Power7

  Using compiler: compiler versions that recognize the POWER7 architecture are XL C/C++ 11.1 and XLF Fortran 13.1. –  For C:

•  xlc –qarch=pwr7 –qtune=pwr7 –O3 –qhot -qsimd –  For Fortran:

•  xlf –qarch=pwr7 –qtune=pwr7 –O3 –qhot

  Using ESSL libraries with vectorization support: –  Select routines have vector analogs in the library –  Key FFT, BLAS routines



7

Code constructs that prevent vectorization…

  Loop carried data dependencies for (i = 0; i < N; i++)

dc[i] = dc[i-1] + af[i]

  Unresolved aliases double sub(double *a, double *b, double *c) {

for (i = 0; i < N; i++) a[i] = a[b[i]] + c[i]

: }

  Non-stride-1 accesses for (i = 0; i < N; i+=4)

a[i] = b[i] + c[i]



8

Power 755: SIMD (AltiVec™) SP Floating Point Performance

NOTES: 1.  Non-library version of FFT 2.  Library versions (ESSL, FFTW) would show similar behavior with increasing vector length



9

Power 755: SIMD (VSX) performance on Financial Services



10

SMT support in POWER7

 SMT is a processor technology that allows –  separate instruction streams (threads) to run concurrently on the same physical processor

improving overall throughput

 P7 supports 2-way and 4-way SMT –  SMT4 gain for commercial applications range between 1.7x to 2.2x of single-thread performance –  SMT4 gains limited by resources constraints such as fewer FP rename registers per thread, fewer

instruction buffer entries by thread, …

 Not all applications benefit from SMT –  Cases where the performance may not be improved and even possibly degrade

•  applications with execution-unit–limited performance (LINPACK for example) •  applications that consume all the chip's memory bandwidth (STREAM for

example)

 SMT gain on P7 – Less than on p6 due to out-of-order architecture



11

When to use SMT on Power7

 Capacity-oriented workload: lot of serial jobs

–  Both SMT2 and SMT4 are likely to improve performance

 Capability oriented workload: turnaround time is critical

–  SMT2 may show some benefit –  SMT4 is very unlikely to show benefit

 Capability-capacity: (lot of parallel jobs) –  SMT2 – may show up to 10% benefit –  SMT4 – may show a few percent more than with SMT2

 Above heuristics are very dependent on

–  performance characteristics of the individual apps in the job stream –  mileage may vary but worth trying



12

SMT gains on Power7



13

5.3 / 6.1 RHEL / SLES

Power 755 4-Socket HPC System Power 755

POWER7 Architecture 4 Processor Sockets = 32 Cores 8 Core @ 3.3 GHz

DDR3 Memory 128 GB / 256 GB, 32 DIMM Slots

DASD / Bays

Up to 8 SFF SAS DASD (2.4TB) 73 / 146 / 300GB @ 15K

(Opt: RAID)

Expansion

PCIe x8: 3 Slots (1 shared) PCI-X DDR: 2 Slots

GX++ Bus

Integrated Ports 3 USB, 2 Serial, 2 HMC

Integrated Ethernet Quad 1Gb Copper (Opt: Dual 10Gb Cu or Fiber)

Media Bays 1 Slim-line ( No tape support )

Cluster 64 nodes) Ethernet or IB-DDR

Redundant Power Yes (AC or DC Power) Single phase 240vac or -48 VDC

Cetrtifications NEBS / ETSI for harsh environments

EnergyScale Active Thermal Power Management Dynamic Energy Save & Capping

Up to 8.4 TFlops per Rack ( 10 nodes per Rack )

4U x 28.8” depth



14

Rack to Rack: Power 755 Compared to Power 575 (POWER6)

Power 755 Power 575 Cores/chip 8 2 Total cores 32 32 Frequency 3.3 GHz 4.7 GHz Memory (max) 256 GB 256 GB Cooling Air Water

Cores/rack Rack type

320 19”

448 24”

Power (Watts) (Linpack) 1650 5400

Each Power 755 node offers the same core count as Power 575 with:

  40-50% Improvement in Performance

  Air Cooling vs. Water Cooling

  1/3 of the Energy Consumption

  37% Improvement in floor space for a 64 node configuration

  Green500 ~ 592 MFlops/Watt



15

Power 755 vs Power 575: Standard Benchmarks

Power 755 AIX 6.1 Peak performance (GFLOPS ) (*) 844

STREAM (triad) (GB/s) 122

Linpack (HPL) (GFLOPS ) 820

SPECfp_rate2006 825

SPECint_rate2006 1010

(*) – at nominal frequency of 3.3 GHz. Using DPS-FP mode, 755 can be run at a higher frequency



16

Power 755 vs Power 575: Gaussian Application Performance

•  Small DFT: Density Function Theory Frequency calculation on a small molecule •  Medium Force: Split Basis Function Force calculation on a medium sized molecule •  Large DFT: Density Function Theory Frequency calculation on a large molecule •  Binary built with XLF V9, VAC v7, -qarch=pwr4 –qtune=pwr4

Source: Balaji Atyam, Carlos Sosa, Tony Pirraglia

PRELIMINARY results; Final to be published by Gaussian inc.



17

Power 755: SIMD (VSX) Performance on ABAQUS

1.  ABAQUS was tested using standard version of ESSL & VSX enabled (pre-GA) version of ESSL 2.  ABAQUS Standard Benchmark cases (S2a, S4a, etc.) were used 3.  ABAQUS uses the DGEMM routine from ESSL 4.  Performance benefit varies depending on the size of matrices and the DGEMM content 5.  There are benchmark cases (e.g. S5) with no performance improvement with VSX 6.  VSX exploitation REQUIRES Power7 capable ESSL and XLF/XLC runtime environments.

Source: Balaji Atyam, Tony Pirraglia



18

HMMER benchmark

cores Scalar

(secs)

Vector (Altivec)

(secs)

Vector/Scalar

Ratio 1 2879 502 5.74

2 1442 252 5.72

4 724 140 5.17

8 365 80 4.56

16 205 59 3.47

Notes:   table entries are elapsed time in seconds   Power 755 with 32 cores at 3.3 GHz, xlc 11.1 beta compiler



19

PS 701, PS702 Blades

POWER7 Blades

Architecture POWER7: 8 cores per socket

Single or double wide

L2 & L3 Cache On Chip

DDR3 Memory Up to 128GB / 256 GB

DASD / Bays 0 - 2 SSD per side

0 - 1 SAS per side

Daughter Card Options Legacy, SFF, or High speed PCIe

Integrated Options Dual Port 10/100/1000 Ethernet

Ethernet, USB

Fiber Support Yes ( via Blade center )

Media Bays 1 Blade Center

Clustering 10Gbt Ethernet

Redundant Power Yes Blade Center

Redundant Cooling Yes Blade Center

Service Processor Yes

Power & Thermal POWER Save / Power Cap

8 Cores 16 Cores

Up to 2.7 TF / BladeCenter 10.75 TF / Rack

( 14 Blades per Chassis )



20

Blade Performance: Power6 vs Power7

JS22 JS23 PS701

CPU Power6 Power6 Power7

Core frequency (GHz) 4 4.2 3

Cores/Single wide blade 4 4 8

RAM 4x4GB 8x4GB 16x4GB

DIMM speed (MHz) 667 677 1066



21

Blade Performance: Linpack, STREAM, SPEC CPU



22

Blade Performance: Linpack, STREAM, SPEC CPU



23

Active Energy Management: Energy Scale on POWER7   Active Energy Manager is configurable using IBM

Systems Director   Offers 3 modes of energy management   Static Power Saver (SPS)

–  Static: Active processor frequency set at 30% below nominal (2.31GHz)

–  Folding will set idle cores to Nap (1.65 GHz) or to Sleep (0 GHz)

–  Maximum energy savings – used for long periods of low utilization

  Dynamic Power Saver (DPS) –  Processor frequency is set based on processor core

utilization –  Un-utilized cores set to 1.65 GHz and ramped up as

utilization increases to maximum 90% of nominal frequency (2.97 GHz)

–  This feature prefers power savings over performance   Dynamic Power Saver – Favor Performance (DPS-FP)

–  Processor frequency is set based on processor core utilization

–  Un-utilized cores set to 1.65 GHz and ramped up as utilization increases to maximum 107% of nominal frequency (3.53 GHz)

–  This feature prefers maximum performance over power savings

SPEC Benchmark Performance Characteristic 416.gamess Core intensive 433.milc Mem. bandwidth intensive 435.gromacs Core intensive 437.leslie3d Mem. bandwidth intensive 444.namd Core intensive 459.GemsFDTD Mem. bandwidth intensive

Correlation of performance and power consumption



24

DPS-FP on Power 755

 Linpack Benchmark with Active Energy Manager  AEM “Over–Clocking” support with DPS-FP

– Dynamic Power Save - Favor Performance  Test environment

–  Power 755 32core @ 3.3GHz –  XLC V11.0 beta –  ESSL V5.1 beta –  PE V5.2

 In case of AEM=OFF –  692.7 Gflops* (82.0% efficiency)

 In case of AEM=ON –  DPS-FP –  753.6 Gflops* (89.2% efficiency)

 Using AEM: 8.8% Performance Gain –  753.6 Gflops / 692.7 Gflops

* Power 755 performance projected from actual Power 750 results



25

  Power 755 was 16 node (512 core) POWER7 3.3 GHz cluster with Infiniband: 11.15 TFlops Linpack Rmax/13.517 TFlops Rpeak, 18.8 kw (Source: IBM measurements)

  Sun x6440 was 6440 core 2.5 GHz Opteron with Infiniband: 51.88 TFlops Linpack Rmax/64.64 TF Rpeak, 152 kw (Source: November 2009 Little Green500 list www.green500.org)

1.7 times the Mflops/watt vs Sun X6440 cluster

IBM Power 755

Power 755 cluster delivers superior Linpack Mflops per watt performance compared to Sun 4-socket x86 cluster



26 Source: www.top500.org Mflops/watt is calculated by dividing “Rmax” by “Power”, Green500 will not publish until a later date

Meg

aflo

ps/w

att

Number shown in column is November 2009 TOP500 rank Power7 is

there !! !

Source: www.top500.org Notes: * Kraken power is scaled down from Jaguar, NUDT, Sandia/Sun did not provide power numbers

Rank Site Mfgr System Rmax MF/w Relative 1 ORNL Cray Jaguar XT5 HE 2.6 GHz 6C Opteron 1759 253 1.76 2 LANL IBM Roadrunner QS22/LS21 1042 444 1 3 U of Tenn Cray Kraken XT5 HE 2.6 GHz 6C Opteron * 831.7 253 1.76 4 Juelich IBM Blue Gene/P 825.5 364 1.22 5 NUDT Self Intel Nehalem/AMD Radeon GPU * 563.1 6 NASA Ames SGI QC 3.0 Xeon 544.3 232 1.92 7 LLNL IBM Blue Gene/L 478.2 205 2.16 8 ANL IBM Blue Gene/P 458.6 364 1.22 9 TACC Sun 2.3 GHz QC Opteron 433.2 217 2.05

10 Sandia Sun Red Sky 2.93 Nehalem * 423.9

6.9 Mw 2.3 Mw

2.2 Mw



27

Summary

  Power 755 targets Divisional and Departmental HPC Segments –  4S single node systems –  Clusters with GigE and IB networks

  Power 755 provides an ideal migration path for currently deployed systems such as: –  System p5 550 and Power 550 (POWER6) –  System p5 575 and Power 575 (POWER6) –  JS21 Blade Clusters

  Power 755 brings great improvement over Power 575 with : –  2 x improvement in price performance –  3 x improvement in power consumption (Air cooled vs. Water cooled) –  1.7 x improvement in density –  But, with lower interconnect bandwidth

  Application segments –  Weather –  Reservoir modeling –  Financial Services –  Computational Chemistry/Molecular Dynamics



28

...any Questions?

Thank you...

Power7 Performance Overview - Dashboard -...

Documents

Transcript of Power7 Performance Overview - Dashboard -...