Benchmark Performance of Different Compilers on a Cray XE6

Mike Stewart and Helen He NERSC User Services Group

May 23-26, CUG 2011

Outline

•  Introduction •  Available Compilers on Hopper •  Recommended Compiler Options •  Benchmarks Used in the study •  Performance Results from Each Compiler •  Summary and Recommendations

Hopper

•  Cray XE6, 6,384 nodes, 153,126 cores. •  Each node has 2 twelve-core AMD MagnyCours 2.1 GHz procs. •  1.28 Pflops/peak, 212 TB memory.

Available Compilers on Hopper

•  Portland Group Compilers –  This is the default compiler on Hopper

•  Pathscale Compilers –  % module swap PrgEnv-pgi PrgEnv-pathscale

•  Cray Compilers –  % module swap PrgEnv-pgi PrgEnv-cray

•  GNU Compilers –  % module swap PrgEnv-pgi PrgEnv-gnu

Compile Codes on Hopper

•  Cross compilation from login nodes to build executables to run on the compute nodes.

•  To use a particular compiler, first swap to the corresponding PrgEnv.

•  Then use compiler wrappers: –  ftn for Fortran codes –  cc for C codes –  CC for C++ codes

•  The wrappers can find the proper system and MPI libraries.

Compiler Flags Comparison

PGI Pathscale Cray GNU Explanation

-fast -Ofast -O3 -O3 High level optimization

-mp=nonuma -mp -h omp(default)

-fopenmp Enable OpenMP

-byteswapio -byteswapio -h byteswapio -fconvert=swap Read files in big-endian

-Mfixed -fixedform -f fixed -ffixed-form Fixed form source

-Mfree -freeform -f free -ffree-form Free form source

-V -dumpversion -V --version Show version info

not implemented

-zerouv -e 0 -finit-local-zero Zero fill uninitialized values

Recommended Options: PGI Compiler

•  NERSC recommends:   -fast or –fastsse

•  PGI User Documentation:  “-fast –Mipa=fast” is a good set of options.

•  Cray recommends:  -fast –Mipa=fast  If can be flexible with precision, also try

–Mfpreleaxed.

Recommended Options: Pathscale Compiler

•  NERSC recommends:   -Ofast

•  Pathscale User Documentation:  Start with –O2, then –O3,  then –O3 –OPT:Ofast, then -Ofast.

•  Cray recommends:  -Ofast

Recommended Options: Cray Compiler

•  NERSC recommends:   -O3

•  Cray recommends:  Use default –O2, which is equivalent to –O3 or

–fast in other compilers.  Use –O3,fp3 (or –O3 –hfp3)

 -O3 only slightly better than –O2  -hfp3 gives maximum freedom in floating point

optimization, may not conform to IEEE standard.

Recommended Options: GNU Compiler

•  NERSC recommends:   -O3

•  Cray recommends:  -O3 –ffast-math –funroll-loops

 -ffast-math: may not conform IEEE standard

NERSC6 Application Benchmarks

Benchmark Science Algorithm Concurrency Language

GTC Fusion PIC, finite difference

2048 (waeking scaling)

IMPACT-T Accelerator Physics

PIC, FFT 1024 (strong scaling)

MAESTRO Astrophysics Block structured-grid multiphysics

2048 (weak scaling)

MILC Lattice Gauge Physics (QCD)

Conjugate gradient, sparse matrix, FFT

1024 (weak scaling)

C, Assembly

PARATEC Material Science

DFT, FFT, BLAS

1024 (string scaling)

NPB 3.3 Benchmarks

Benchmark Full Name Level Concurrency BT Block Tridiagonal D 256 CG Conjugate Gradient E 256 EP Embarassingly Parallel E 256 FT Fast Fourier Transform D 256 LU Lower-Upper Symmetric

Gauss-Siedel E 256

MG MultiGrid E 256 SP Scalar Pentadiagonal D 256

PGI Compiler Results

•  Other 3 options do not significantly improve performance over “-fast”.

•  The NPB FT case D is an exception.

Pathscale Compiler Results

cxvxcbcb •  -O2 performs worse than other 3 options. •  -O3 optimizes almost all benchmarks well. •  Extra options on top of –O3 do not improve significantly.

Cray Compiler Results

•  Only one benchmark with –Ofp3 shows significant improvement over default –O2.

GNU Compiler Results

cxvxcbcb •  -O3 generally gives a good level of optimization. •  Worth to try –ffast-math option. Improves performance

significantly in some cases.

Overall Compilers Comparison

•  Pathscale fastest: 6 out of 12. •  Cray fastest: 3 out of 12. •  PGI fastest: 2 out of 12. •  GNU fastest: 1 out of 12. •  Mean against PGI: Cray 0.96, Pathscale 0 .94,

GNU 0.99

Summary and Recommendations

•  Users should experiment with different compilers and compiler options to tune their application performance on Hopper.

•  On the average the Pathscale and Cray compilers produce somewhat faster code on Hopper (or another Cray system), since they are specifically designed for these processors. In addition the Cray compilers make use of the Cray math libraries at compile time to further optimize codes.

•  PGI compilers are available on a wide variety of platforms other than Cray machines. Many existing codes have PGI targeted Makefiles, could generate very good performance.

•  Using the gnu compilers allows you to compile on virtually every Unix and Linux system. Although the performance on Hopper for some codes with GNU compilers is quite good, there is no guarantee for optimal performance on other platforms.

Benchmark Performance of Different Compilers on a Cray XE6 · NPB 3.3 Benchmarks Benchmark Full...

Transcript of Benchmark Performance of Different Compilers on a Cray XE6 · NPB 3.3 Benchmarks Benchmark Full...

Benchmark Performance of Different Compilers on a Cray XE6 · NPB 3.3 Benchmarks Benchmark Full...

Documents

Transcript of Benchmark Performance of Different Compilers on a Cray XE6 · NPB 3.3 Benchmarks Benchmark Full...

Cray XE6 Hardware - Prace Training Portal

sikania 256

GigaDevice Semiconductor Inc. GD32F303xx ARM Cortex -M4 32 ... · CC CE CG RC RE RG RI RK F h Code area (KB) 256 256 256 256 256 256 256 256 Data area (KB) 0 256 768 0 256 768 1792

Ohio Graduation Test 2007 – Math Data Analysis and Probability: Benchmark A 26, 43; Benchmark B 1; Benchmark C 5; Benchmark D 22, 31; Benchmark E 34; Benchmark.

Ohio Graduation Test 2008 – Math Data Analysis and Probability : Benchmark A 18; Benchmark B 39; Benchmark D 15; Benchmark F 6; Benchmark G 3, 22; Benchmark.

RAD Studio XE6 Feature Matrix | The complete app development ...

TPLANNER – TDBPLANNER TWUPDATE.pdf · TWebUpdate is available for Delphi 7,2007,2009,2010,XE,XE2,XE3,XE4,XE5,XE6,XE7,XE8,10

256 INQUIRER1

EEPROM 256

Lightweight Cryptography - COSIC · 2675 1744 2200 3001 80/88 128 160 224 256 Spongent Photon Quark Fair Comparison - Mission Possible? E) Hash Output Fixed Benchmark: 45 nm Open

Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.

Running Large Scale Jobs on a Cray XE6 System€¦ · Running Large Scale Jobs on a Cray XE6 System Yun (Helen) ... GB/sec to 21.3 GB/sec depending on the origin and destination of

Support 256

Elecrow ws2813. Thegraylevelsofeachpixelareof256levels,whichachieves“256*256*256=16777216”full-colordisplay,and therefreshfrequencyreachesto2KHz/s. ...

Telephones: 256-41-346723 256-41-258473 Education ......Telephones: 256-41-346723 256-41-258473 256-41-342347 256-41-234563 Toll free: 0800111432 Email: educationseMcecommissionug@gmaiLcom

Cray XE6 Performance WorkshopCray XE6 Performance Workshop Modern HPC Architectures David Henty d.henty@epcc.ed.ac.uk Modern HPC Architectures 2 Overview • Components • History

Benchmark and Benchmark Platinum Boilers

Cray XE6 architecture - Prace Training Portal: Events · The Cray XE6 node We are concentrating on the Cray XE6 installed in Edinburgh, which is called HECToR There are other XE6

MULTIVARIABLE PROPORTIONAL-INTEGRAL-PLUS (PIP) …ukacc.group.shef.ac.uk/proceedings/control2004/Papers/256.pdf · control to the ALSTOM Benchmark Challenge II. The PIPcontrollercan

Synology DiskStation DS212/DS212+ · 2019. 9. 18. · Max. Group Accounts 256 256 Max. Shared Folders 256 256 Max. Concurrent Connections 128 256 Max. Supported IP Cameras 8 12 File

Elecrow ws2813. Thegraylevelsofeachpixelareof256levels,whichachieves“256256256=16777216”full-colordisplay,and therefreshfrequencyreachesto2KHz/s. ...