Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and...

42
1 © NEC Corporation 2004 Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US Jan Fredin, Ph.D. NEC Solutions America March 5, 2004

Transcript of Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and...

Page 1: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

1

1© NEC Corporation 2004

Performance Optimization and Tuningfor NEC TX7

includingMOLPRO, MOLCAS, AMBER, GAMESS-US

Jan Fredin, Ph.D.NEC Solutions America

March 5, 2004

Page 2: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

2

2© NEC Corporation 2004

Talk Overview

NEC TX7 hardwareGeneral TX7 performance tuning techniquesCase Studies: examples of tuning efforts on Chemistry applications:

MOLPROMOLCASAMBERGAMESS-USUser Molecular Dynamics Code

Page 3: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

3

3© NEC Corporation 2004

NEC TX7 Servers

Second generation technical computing server with Intel’s latest 64-bit CPU “Itanium® 2 processor”.

Configured with up to 32 CPUs per node. NEC was the first company to release Itanium® 2 systems with 32 processors.

Designed and built with supercomputer technology from NEC.

Ideal systems for scientific and engineering applications.

Page 4: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

4

4© NEC Corporation 2004

TX7 Internal Structure

Memory ControllerMemory ControllerMemory Controller

Memory ControllerMemory ControllerMemory Controller

DDR DIMMs

CellCell

Up to 8 Cells

PCIPCI--XXUp to 112slots

Cross-bar interconnectCrossCross--bar interconnectbar interconnect

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridge PCI-X bridge14 PCI-X slots

PCI-X bridgePCIPCI--X bridgeX bridge PCI-X bridgePCIPCI--X bridgeX bridge14 PCI-X slots

CellController

CellCellControllerController

Itanium2Itanium2

Itanium2Itanium2Itanium2Itanium2

Itanium2Itanium2

Total band width>100GB/s

• ccNUMA architecture• Near-flat 32-way memory access• Flexible partitioning Cell Photo

Itanium2NEC Chipset DIMMs memory

TX7 / i6010, i6510, i9510

Page 5: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

5

5© NEC Corporation 2004

Itanium 2 Processor Memory Hierarchy

1.5 GHz Itanium® 2 Processor

L2256KB8-way128B lines5-7 CLKSBanked

48 GB/s

L36MB24-way128B lines7-14 CLKS

ExternalMemory

6.4 GB/s

48 GB/s

48 GB/s

L1D16KB64B lines1 CLK

L1I16KB64B lines1 CLK

128 FP Registers

128 General Registers

Core Pipeline (functional units)

Page 6: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

6

6© NEC Corporation 2004

NEC IPF Server RoadmapPe

rfor

man

ce /

Scal

abili

ty

16 way

Itanium® 2 / MadisonItaniumItanium®® 2 / Madison2 / Madison4 way

Express5800/1160Xb(AzusA)

2001 2002 2003

Itanium®ItaniumItanium®®

TX7 Series

32 way

Page 7: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

7

7© NEC Corporation 2004

● Subsidiaries ▲ Partner■ Branch

.

NEC AustraliaPty. Ltd.

SingaporeNEC SingaporePty. Ltd.

Australia●

JAPANHMD

U.S.A.

Brasil●

NEC doBrasil

S.A.

Germany

Groupe Bull

NEC France

● ●●▲

France

KoreaNECSupport Office

Canada

Cray

NECSAM

NEC HPCE

NEC Worldwide HPC Support andDevelopment Organizations

Page 8: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

8

8© NEC Corporation 2004

Software Support CyclePort - successfully run code for all QA available

typical compile efc -O2 -ftzreduce optimization for problem areas

Tune - improve performanceidentify time consuming routines for typical workloadmeasure performance using UNIX gprof, NEC compiler add-of ftrace, Intel VTUNE or timing callstest impact of changes in compile options, compile directives and source code restructurecan be limited by ISV code maintenance rules

Page 9: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

9

9© NEC Corporation 2004

Software Support Cycle

Update, supporting software vendor and user sitestest and verify full QA correctnessprovide changes through vendor distribution or release notesupdate for:- new vendor software release- compiler update- user problems, feedback and benchmarks- hardware upgrades

Repeat the tune and update steps regularly

Page 10: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

10

10© NEC Corporation 2004

Tuning Challengekeep fast functional units busy, providing all data required

2 FMA / clock - 6GFLOPS / sec for Itanium2 1.5GHzmaximize loop optimization

provide compiler with loops that can achieve high performancesoftware pipelining - similar to vectorizationinstructional parallelism - 6 (2 bundles) / clockprefetching - avoid waiting at the L3 cache bottleneck

high performance is achieved by balance in the loopscomputational complexitydata localitynumber of variables

Page 11: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

11

11© NEC Corporation 2004

Software Tuning Techniquesknow the good application input and algorithm choices

memory, processors, site defaultsuse tuned kernels like NEC MathKeisanchoose effective compile optionsrestructure loops help compiler optimize through source-code directives

Page 12: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

12

12© NEC Corporation 2004

Solution of large scale eigenvalue problemsARPACK

Direct solver for sparse symmetric systemsSOLVER

Parallel Matrix/Graph ordering and partition library. Uses MPI.ParMETIS

Matrix/Graph ordering and partitioning libraryMETIS

Fast Fourier TransformsFFT

Sparse BLAS (from ACM Algorithm 692)SBLAS

C interface to BLASCBLAS

Basic Linear Algebra Communication Subprograms. Uses MPI.BLACS

Scalable Linear Algebra PACKage (contains PBLAS)ScaLAPACK

Linear Algebra PACKage for high performance computers.LAPACK

Basic Linear Algebra SubprogramsBLAS

DescriptionName

MathKeisan is a set of NEC highly tuned math libraries.

NEC MathKeisan

Page 13: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

13

13© NEC Corporation 2004

DGEMM Performance

0

1000

2000

3000

4000

5000

6000

10 50 100 500 1000 1500Matrix size

MFL

OP/

s

Fortran -O3

MathKeisan 1.3

MathKeisan 1.5 ( target release 2Q CY2004 )

MathKeisan 1.5 Improvements

1.5 GHz Madison, 6 MB L3

Page 14: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

14

14© NEC Corporation 2004

Compile Choices

NEC compiler is a superset of the Intel compilerNEC compiler provides ftrace function for code section performance tracing

Compiler optimization level-O3 highest level of optimization-software pipelining on-activates prefetching-activates other loop optimizations

-O2 safe optimization level- software pipelining on

-O0 no optimization for debugging

Page 15: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

15

15© NEC Corporation 2004

Other Compile Options

-ftz flush-to-zero denormalized valuesshould always usedefault at compiler level -O3

-opt_report compiler optimization reportmessages about pipelining, dependencies, unrolling, other loop restructuring

-ip interprocedural optimization within object files-prof_gen and -prof_use profile generated optimization-restrict / -ansi_alias enable strict pointer aliasing-i8/-r8 integer and real default word size

Page 16: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

16

16© NEC Corporation 2004

Loop Restructuring

code developers know overall structure provide compiler with loops that can achieve maximum performancemulti-purpose scientific and engineering codes often have heavy usage of long complex loops with conditionals and indirect indexinggoal of hand recoding is to isolate sections that are hard for the compiler to optimization

too complex to analyzeexternal callsdata dependenciesconditionals

Page 17: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

17

17© NEC Corporation 2004

Restructuring Example - Loop Blockingbreaks up multi-level loop into blocks that fit into cachebest blocking sizes determined experimentallyblocked sections are prime targets for parallelism

Modified Codeparameter (kblock = 1024)do ks = 1, m, kblockdo j = 1, ndo i = 1, pdo k = ks, min(ks+kblock-1,m)c(k,j)=c(k,j)-a(k,i)*b(i,j)

enddoenddo

enddoenddo

Original Code

do j = 1, ndo i = 1, pdo k = 1, mc(k,j)=c(k,j)-a(k,i)*b(i,j)

enddoenddo

enddo

Page 18: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

18

18© NEC Corporation 2004

Compile Automated Loop Restructuring

Loop Unrolling Unroll and JamReduction OptimizationLoop InterchangeLoop FusionLoop Fission

Page 19: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

19

19© NEC Corporation 2004

Loop Unrollingperform n iterations of the inner loop on each passactives at compile level -O3, compiler sets unroll depth n

!DIR$ unroll (n) or !DIR$ nounrolldirective controls unroll depth n or switchs off

Modified Codesubroutine vadd(n, a, b)integer n, idouble precision a(*), b(*)

c explicitely unrolled by 4do i = 1,n,4

a(i) = a(i) + b(i)a(i+1) = a(i+1) + b(i+1)a(i+2) = a(i+2) + b(i+2)a(i+3) = a(i+3) + b(i+3)

enddodo i=i,n

a(i) = a(i) + b(i)enddoend

Original Codesubroutine vadd(n, a, b)integer n, idouble precision a(*),b(*)

!DIR$ UNROLL(4)do i = 1,n

a(i) = a(i) + b(i)enddoend

Output from -O3 -opt_reportBlock, Unroll, Jam Report:(loop line numbers, unroll factors and type of transformation)Loop at line 6 unrolled with remainder by 4

Page 20: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

20

20© NEC Corporation 2004

Modified CodeOriginal Codedo j = 1, ndo k = 1, pdo i = 1, mc(i,j)=c(i,j)+a(i,k)*b(k,j)

enddoenddo

enddoOutput from -opt_report at -O3: Block, Unroll, Jam Report:(loop line numbers, unroll factors and type of transformation)Loop at line 1 unrolled and jammed by 4Loop at line 2 unrolled and jammed by 4

do j = 1, n-3, 4do k = 1, pdo i = 1, mc(i,j+0)=c(i,j+0)+a(i,k)*b(k,j+0)c(i,j+1)=c(i,j+1)+a(i,k)*b(k,j+1)c(i,j+2)=c(i,j+2)+a(i,k)*b(k,j+2)c(i,j+3)=c(i,j+3)+a(i,k)*b(k,j+3)

enddoenddo

enddodo j = j, ndo k = 1, pdo i = 1, mc(i,j)=c(i,j)+a(i,k)*b(k,j)

enddoenddo

enddo

“jam” n iterations outer loop into each pass of inner loopactives at compile level -O3, no user control

Unroll and Jam

Page 21: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

21

21© NEC Corporation 2004

Reduction Optimizationsplit sum into n partial sums then collect partial sumsactives at compiler level -O3, no user control

Modified Codes1 = 0.0d0s2 = 0.0d0s3 = 0.0d0s4 = 0.0d0do i = 1, n-3, 4

s1 = s1 + a(i)s2 = s2 + a(i+1)s3 = s3 + a(i+2)s4 = s4 + a(i+3)

enddodo i = i, n

s1 = s1 + a(i)enddos = s1 + s2 + s3 + s4

Original Codes = 0.0d0do i = 1, ns = s + a(i)

enddo

Output from -opt_report at -O3Block, Unroll, Jam Report:(loop line numbers, unroll factors and type of transformation)Loop at line 2 unrolled with remainder by 8 (reduction)

Page 22: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

22

22© NEC Corporation 2004

Loop Interchangechange in nesting order of 2 or more loopsunit stride on loads to reuse data in cacheactivates at compile level -O3, no user control

Modified Codedo k = 1, pdo j = 1, ndo i = 1, mc(i,j)=c(i,j)-a(i,k)*b(k,j)

enddoenddo

enddo

Original Codedo i = 1, mdo j = 1, ndo k = 1, pc(i,j)=c(i,j)-a(i,k)*b(k,j)

enddoenddo

enddo

Output from -opt_report using -O3LOOP INTERCHANGE in test_ at line 1LOOP INTERCHANGE in test_ at line 2LOOP INTERCHANGE in test_ at line 3

Page 23: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

23

23© NEC Corporation 2004

Loop Fusioncombine 2 or more loops with same iteration count into 1reduces overhead, can promote data sharing and computational complexitycan block pipelining if too many registers get usedactivates at compile level -O3, no user control

subroutine vmul(n,alpha,x,y)integer n, idouble precision x(*), y(*)do i = 1, n

x(i) = x(i) * alphaenddodo i = 1, n

y(i) = y(i) * alphaenddoendOutput from -opt_report at -O3Fused Loops: ( 5 8 )

subroutine vmul(n,alpha,x,y)integer n, idouble precision x(*), y(*)do i = 1, n

x(i) = x(i) * alphay(i) = y(i) * alpha

enddoend

Original Code Modified Code

Page 24: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

24

24© NEC Corporation 2004

Loop Fissionalso called loop distribution or loop splittingsmaller loops for optimization like prefetching and pipeliningactivates at compile level -O3

!DIR DISTRIBUTE POINT focuses compiler on trying fusion-outside loop, compiler sets location-inside loop, programmer sets location

Modified Codedo i = lft, llt

fail(i)=1.0hgener(i)=0.0diagm(i)=1.e+31sieu(i)=ies(nnm1+i)

enddodo i = lft, lltx1(i)=x(1,ix1(i))y1(i)=x(2,ix1(i))z1(i)=x(3,ix1(i))

enddo

Original Codedo i = lft, llt

fail(i)=1.0hgener(i)=0.0diagm(i)=1.e+31sieu(i)=ies(nnm1+i)

!dir$ distribute pointx1(i)=x(1,ix1(i))y1(i)=x(2,ix1(i))z1(i)=x(3,ix1(i))

enddoOutput of -opt_report for -O3LOOP DISTRIBUTION in test_ at line 1Distributed for large ii at __ performed in test_ at line 7

Page 25: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

25

25© NEC Corporation 2004

Compile Directive - Variable Dependenciescompiler cannot determine array references overlap so pipelining is blocked!dir$ IVDEP - fortran loop level#pragma IVDEP - C loop levelC restrict pointer - routine call level

subroutine test(nu1,nu2, tx, sxx, dd1, dd2, np)integer nu1, nu2, i3, k, np(*)double precision tx(3,*), sxx(9,*), dd1, dd2

!dir$ ivdepdo k = nu1, nu2

i3 = np(k)tx(1,i3) = tx(1,i3) + sxx(1,k)*dd1 + sxx(2,k)*dd2tx(2,i3) = tx(2,i3) + sxx(4,k)*dd1 + sxx(5,k)*dd2tx(3,i3) = tx(3,i3) + sxx(7,k)*dd1 + sxx(8,k)*dd2

enddoend

Page 26: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

26

26© NEC Corporation 2004

Other Compile Directivesprefetch

data values are loaded into cache ahead of access to reduce latency in loading datatoo much prefetching can cause cache misses!dir$ prefetch a!dir$ noprefetch b

software pipeliningdefault in compile level -O2 and -O3directive to switch on/off software pipelining!dir$ swp!dir$ noswp

loopcounthint for compiler to optimize loop!dir$ maxloopcount_value

Page 27: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

27

27© NEC Corporation 2004

Case Studies

Examples of tuning efforts on Chemistry applications:

MOLPROMOLCASAMBERGAMESSUser Molecular Dynamics (MD) Code

Page 28: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

28

28© NEC Corporation 2004

MOLPRO 2002.6MOLPRO is designed and maintained by H.-J. Werner (University of Stuttgart) and P. J. Knowles (University of Birmingham)

http://www.molpro.net/

MOLPRO is a complete system of ab initio programs for molecular electronic structure calculations with an emphasis on highly accurate computations and extensive treatment of the electron correlation.

Mixed Fortran and C using PNNL Global Arrays (GA) for shared memory and parallel MPI tasks. Driver program checks license and exec’s computational executable. 64 bit real, integer, pointer

normal_dft datasetadrenaline SVWN, BLYP from MOLPRO bench directory

Page 29: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

29

29© NEC Corporation 2004

MOLPRO 2002.6upgrade source from MolPro2002.5 to MolPro2002.6compiler initial optimization -O2 and < 20 routines –O0upgrade MathKeisan for blas and lapackprofile top routines using MolPro development environment to run computational unit stand-alone: normal_dft

21.8% uncompress_double9.0% dfti_block7.8% dft_rho5.2% dform5.0% dgenrl

aioint, compress=0 turns off compressionat end of molpro2002.6/bin/molproi.rc add:nocompress

dft/dfti.f -O3 gives correct numerical results for all QA another customer test profile showed top routines addmx, mxmas, fzero tuned with -O3 (3 files) gave 22% improvement

Page 30: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

30

30© NEC Corporation 2004

MOLPRO 2002.6

0

200

400

600

800

1000

1200

initial final uncompress

real

sec

onds

1cpu2cpu4cpu

normal_dft

Page 31: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

31

31© NEC Corporation 2004

MOLCAS 5.4MOLCAS is developed and distributed from the Department of Theoretical Chemistry at Lund University

http://www.teokem.lu.se/molcas/

MOLCAS is a quantum chemistry software package with a focus on methods that will allow an accurate ab initio treatment of very general electronic structure problems for molecular systems in both ground and excited states. MOLCAS is a research product developed by scientists to be used by scientists.

Fortran with some C systems control routines, MPI parallel and shared memory through PNNL GA. 64 bit real, integer, pointer. Independent modules are called in sequence from MOLCAS-generated scripts.

test902 performance test provided in the MOLCAS distribution. Disk-based Hartree-Fock using modules seward and scf

Page 32: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

32

32© NEC Corporation 2004

MOLCAS 5.4initial compile optimization -O2 with 4 routines at -O0, static linkserial focus due to limited parallel implementation in MOLCAS 5.4

parallel support targeted for next release.4 main modules:

46% seward26% caspt216% scf12% rasscf

add MathKeisan blas – 38% improvement profile showed pack / nopack are in top routinesadded nopack keyword for Seward module input

Page 33: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

33

33© NEC Corporation 2004

MOLCAS 5.4

0100200300400500600700800900

initial final nopack

real

sec

onds rasscf

caspt2scfseward

test902

modules

Page 34: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

34

34© NEC Corporation 2004

AMBER 7.0Distributed by University of California, San Francisco

http://amber.scripps.edu/

About 60 programs that work reasonably well together solving molecular modeling and dynamics problems.

sander- Simulated annealing with NMR-derived energy restraints. - main program used for molecular dynamics simulations.

Mostly Fortran with C for dynamic memory and file handling and MPI parallel. Each module is a separate executable. Sander is built from 98 source files consisting of 408 subroutines.

sander gb_mb dataset provided in AMBER benchmarks directoryGeneralized Born myoglobin simulation, 2492 atoms in 100,000 waters20 Ang. cutoff, salt 0.2Molar, long-range forces every 4 cycles

Page 35: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

35

35© NEC Corporation 2004

AMBER 7.0initial distribution compile optimization

compile level2=-O2compile level3=-O2 used for 36 out of 96 files

profile shows 98% of time spent in egbsubroutine egb is composed of 3 large loops

upgrade MathKeisan blas and lapackcompile tuning

level3=-O3, QA shows egb has numerical errorslevel3=-O3 used for all other level3 files, used -O2 for egb

code restructuringtried splitting egb_loop1 and egb_loop2 - same performancetested impact of !dir$ IVDEP in many locations; no improvement

Page 36: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

36

36© NEC Corporation 2004

AMBER 7.0

Sander gb_mb

010203040506070

initial final

real

sec

onds

other

egb_loop3

egb_loop2

egb_loop1

Routines

Page 37: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

37

37© NEC Corporation 2004

GAMESS-USGAMESS is maintained by the members of the Gordon research group at Iowa State University

http://www.msg.ameslab.gov/GAMESS/GAMESS.html

The General Atomic and Molecular Electronic Structure System (GAMESS) is a program for ab initio quantum chemistry.

Fortran single executable using MPI over sockets for parallel. 64bit real, integer, pointer

gms_hf datasetnicotine Hartree Fock SCF single point calculationone of GAMESS benchmarks distributed in early 1990’s

Page 38: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

38

38© NEC Corporation 2004

GAMESS-USprofile gms_hf:

42.0% hstar9.5% dspdfs3.7% forms

initial compiler optimization -O2-O3 top 5 routines: slower-O1 hstar: slowerhstar: ivdep on nonpipelining loops: slowerturn off packing: sloweruse vector code for dspdfc & forms: slowerupdate source, compiler, MathKeisansaw some tests with 12% time in __libc_read: comes from compile libraries on cross-mounted disk: use static link and no shared object (lib*.so)

Page 39: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

39

39© NEC Corporation 2004

GAMESS-US

0

50

100

150

200

250

300

350

initial final

real

sec

onds

1cpu2cpu4cpu

gms_hf

Page 40: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

40

40© NEC Corporation 2004

User MD Tuning - compile optionscustomer provided molecular dynamics application, written in about 12,000 lines of C with customer test inputoriginal compile flags

ineffective with -mp (maintain precision) severly limits performanceecc -O3 -tpp2 -Zp16 -static -mp -IPF_fma -IPF_fltacc

tuned compile flagsecc -O3 -tpp2 -Zp16 -static -restrict -ftz -ip-restrict (or -ansi_alias) tells compiler that pointers labeled restrict in routine call have independent locations

- restrict pointers provides information about potential dependencies that can enable pipelining

- restrict pointers declared on most of MD’s malloc’ed locations -ip turns on inter-procedural optimizing within the file that is being compiled

Page 41: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

41

41© NEC Corporation 2004

User MD Tuning - code modificationsuse tuned library call

replace original code with MathKeisan function vdSinCoscomputes (sin(a[i],cos(a[i]) for i=0…npipelines multiple sin and cos calculations

programmer level loop recodingsplit main loop of important routine into 3

- non-bonded pair list generation- calculation of force potentials across all pairs- application of forces across all pairs

enables pipelining

Page 42: Performance Optimization and Tuning for NEC TX7 · 2009. 4. 3. · Performance Optimization and Tuning for NEC TX7 including MOLPRO, MOLCAS, AMBER, GAMESS-US ... help compiler optimize

42

42© NEC Corporation 2004

User MD

0

5

10

15

20

25

real

sec

onds

originalcompile flagsrestrict pointersvdSinCos()split loop