CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.
-
Upload
bradley-roger -
Category
Documents
-
view
223 -
download
4
Transcript of CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.
![Page 1: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/1.jpg)
CODE TUNING AND OPTIMIZATION
Kadin Tseng
Boston University
Scientific Computing and Visualization
![Page 2: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/2.jpg)
Outline• Introduction• Timing• Example Code• Profiling• Cache• Tuning• Parallel Performance
Code Tuning and Optimization 2
![Page 3: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/3.jpg)
Introduction• Timing
• Where is most time being used?
• Tuning• How to speed it up• Often as much art as science
• Parallel Performance• How to assess how well parallelization is working
Code Tuning and Optimization 3
![Page 4: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/4.jpg)
Timing
Code Tuning and Optimization 4
![Page 5: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/5.jpg)
Timing• When tuning/parallelizing a code, need to assess
effectiveness of your efforts• Can time whole code and/or specific sections• Some types of timers
• unix time command• function/subroutine calls• profiler
Code Tuning and Optimization 5
![Page 6: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/6.jpg)
CPU Time or Wall-Clock Time?• CPU time
• How much time the CPU is actually crunching away• User CPU time
• Time spent executing your source code
• System CPU time• Time spent in system calls such as i/o
• Wall-clock time• What you would measure with a stopwatch
Code Tuning and Optimization 6
![Page 7: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/7.jpg)
CPU Time or Wall-Clock Time? (cont’d)• Both are useful• For serial runs without interaction from keyboard, CPU
and wall-clock times are usually close• If you prompt for keyboard input, wall-clock time will accumulate if
you get a cup of coffee, but CPU time will not
Code Tuning and Optimization 7
![Page 8: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/8.jpg)
CPU Time or Wall-Clock Time? (3)• Parallel runs
• Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased
• Wall-clock time may not be accurate if sharing processors• Wall-clock timings should always be performed in batch mode
Code Tuning and Optimization 8
![Page 9: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/9.jpg)
Unix Time Command• easiest way to time code• simply type time before your run command• output differs between c-type shells (cshell, tcshell) and
Bourne-type shells (bsh, bash, ksh)
Code Tuning and Optimization 9
![Page 10: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/10.jpg)
Unix Time Command (cont’d)
katana:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w
user CPU time (s)
system CPU time (s)
wall-clock time (s)
(u+s)/wc
avg. shared + unsharedtext space
input + output operations
page faults + no. timesproc. was swapped
Code Tuning and Optimization 10
![Page 11: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/11.jpg)
Unix Time Command (3)• Bourne shell results
$ time mycodereal 0m1.62suser 0m1.57ssys 0m0.03s
wall-clock time
user CPU time
system CPU time
Code Tuning and Optimization 11
![Page 12: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/12.jpg)
Example Code
Code Tuning and Optimization 12
![Page 13: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/13.jpg)
Example Code• Simulation of response of eye to stimuli (CNS Dept.)• Based on Grossberg & Todorovic paper
• Contains 6 levels of response• Our code only contains levels 1 through 5• Level 6 takes a long time to compute, and would skew our timings!
Code Tuning and Optimization 13
![Page 14: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/14.jpg)
Example Code (cont’d)• All calculations done on a square array• Array size and other constants are defined in gt.h (C) or in
the “mods” module at the top of the code (Fortran)
Code Tuning and Optimization 14
![Page 15: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/15.jpg)
Level 1 Equations Computational domain is a square Defines square array I over domain (initial condition)
bright
dark
Code Tuning and Optimization 15
![Page 16: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/16.jpg)
Level 2 Equations
]})()[(exp{ 222 jqipCC pqij
]})()[(exp{ 222 jqipEEpqij
qppqpqijpqij
qppqpqijpqij
ij IECA
IDEBC
x
,
,
)(
)(
)0,max( ijij xx
Ipq=initial condition
Code Tuning and Optimization 16
![Page 17: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/17.jpg)
Level 3 Equations
]})()[(exp{ 222 jqipGpqij
]})()[(exp{ 222)(kk
kpqij njqmipH
)(
,
kpqij
qppqijk FXy
)()( kpqijpqij
kpqij HGF
K
kmk
2sin
K
knk
2cos
)0,max( ijkijk yy
Code Tuning and Optimization 17
![Page 18: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/18.jpg)
Level 4 Equations
)]2/([ Kkijijkijk YYz
)0,max( LzZ ijkijk
Code Tuning and Optimization 18
![Page 19: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/19.jpg)
Level 5 Equation
k
ijkij ZZ
Code Tuning and Optimization 19
![Page 20: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/20.jpg)
Exercise 1• Copy files from /scratch disc
Katana% cp /scratch/kadin/tuning/* .• Choose C (gt.c and gt.h) or Fortran (gt.f90)• Compile with no optimization:
pgcc –O0 –o gt gt.cc
pgf90 –O0 –o gt gt.f90
• Submit rungt script to batch queue
katana% qsub -b y rungt
capital ohsmall oh
zero
Code Tuning and Optimization 20
![Page 21: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/21.jpg)
Exercise 1 (cont’d)• Check status
qstat –u username
• After run has completed a file will appear named rungt.o??????, where ?????? represents the process number
• File contains result of time command• Write down wall-clock time
• Re-compile using –O3• Re-run and check time
Code Tuning and Optimization 21
![Page 22: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/22.jpg)
Function/Subroutine Calls• often need to time part of code• timers can be inserted in source code• language-dependent
Code Tuning and Optimization 22
![Page 23: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/23.jpg)
cpu_time• intrinsic subroutine in Fortran• returns user CPU time (in seconds)
• no system time is included
real :: t1, t2call cpu_time(t1) ! Start timer... perform computation here ... call cpu_time(t2) ! Stop timerprint*, 'CPU time = ', t2-t1, ' sec.'
Code Tuning and Optimization 23
![Page 24: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/24.jpg)
system_clock• intrinsic subroutine in Fortran• good for measuring wall-clock time
Code Tuning and Optimization 24
![Page 25: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/25.jpg)
system_clock (cont’d)• t1 and t2 are tic counts• count_rate is optional argument containing tics/sec.
integer :: t1, t2, count_rate call system_clock(t1, count_rate) ! Start clock ... perform computation here ... call system_clock(t2) ! Stop clock print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’
Code Tuning and Optimization 25
![Page 26: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/26.jpg)
times• can be called from C to obtain CPU time
#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); // start clock tic1 = timedat.tms_utime; … perform computation here … times(&timedat); // stop clock tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }
• can also get system time with tms_stime
Code Tuning and Optimization 26
![Page 27: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/27.jpg)
gettimeofday• can be called from C to obtain wall-clock time
#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); // start clock t1 = t.tv_sec + 1.0e-6*t.tv_usec; … perform computation here … gettimeofday(&t, NULL); // stop clock t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }
Code Tuning and Optimization 27
![Page 28: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/28.jpg)
MPI_Wtime• convenient wall-clock timer for MPI codes
Code Tuning and Optimization 28
![Page 29: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/29.jpg)
MPI_Wtime (cont’d)• Fortran
• C
double precision t1, t2t1 = mpi_wtime() ! Start clock ... perform computation here ...t2 = mpi_wtime() ! Stop clockprint*,'wall-clock time = ', t2-t1
double t1, t2;t1 = MPI_Wtime(); // start clock... perform computation here …t2 = MPI_Wtime(); // stop clockprintf(“wall-clock time = %5.3f\n”,t2-t1);
Code Tuning and Optimization 29
![Page 30: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/30.jpg)
omp_get_time• convenient wall-clock timer for OpenMP codes• resolution available by calling omp_get_wtick()
Code Tuning and Optimization 30
![Page 31: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/31.jpg)
omp_get_wtime (cont’d)• Fortran
• C
double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ! Start clock... perform computation here ...t2 = omp_get_wtime() ! Stop clockprint*,'wall-clock time = ', t2-t1
double t1, t2;t1 = omp_get_wtime(); // start clock... perform computation here ...t2 = omp_get_wtime(); // stop clockprintf(“wall-clock time = %5.3f\n”,t2-t1);
Code Tuning and Optimization 31
![Page 32: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/32.jpg)
Timer Summary
CPU Wall
Fortran cpu_time system_clock
C times gettimeofday
MPI MPI_Wtime
OpenMP omp_get_time
Code Tuning and Optimization 32
![Page 33: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/33.jpg)
Exercise 2• Put wall-clock timer around each “level” in the example
code• Print time for each level• Compile and run
Code Tuning and Optimization 33
![Page 34: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/34.jpg)
PROFILING
Code Tuning and Optimization 34
![Page 35: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/35.jpg)
Profilers• profile tells you how much time is spent in each routine
• gives a level of granularity not available with previous timers• e.g., function may be called from many places
• various profilers available, e.g.• gprof (GNU) -- function level profiling• pgprof (Portland Group) -- function and line level profiling
Code Tuning and Optimization 35
![Page 36: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/36.jpg)
gprof• compile with -pg• when you run executable, file gmon.out will be created• gprof executable > myprof
• this processes gmon.out into myprof
• for multiple processes (MPI), copy or link gmon.out.n to gmon.out, then run gprof
Code Tuning and Optimization 36
![Page 37: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/37.jpg)
gprof (cont’d)
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]
Code Tuning and Optimization 37
![Page 38: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/38.jpg)
gprof (3)
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
called/total parents index %time self descendents called+self name index called/total children
0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]
Code Tuning and Optimization 38
![Page 39: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/39.jpg)
pgprof• compile with Portland Group compiler
• pgf90 (pgf95, etc.)• pgcc• –Mprof=func
• similar to –pg• run code
• pgprof –exe executable• pops up window with flat profile
Code Tuning and Optimization 39
![Page 40: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/40.jpg)
pgprof (cont’d)
Code Tuning and Optimization 40
![Page 41: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/41.jpg)
pgprof (3)• To save profile data to a file:
• re-run pgprof using –text flag• at command prompt type p > filename
• filename is the name you want to give the profile file• type quit to get out of profiler
• Close pgprof as soon as you’re through• Leaving window open ties up a license (only a few available)
Code Tuning and Optimization 41
![Page 42: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/42.jpg)
Line-Level Profiling• Times individual lines• For pgprof, compile with the flag
–Mprof=line
• Optimizer will re-order lines• profiler will lump lines in some loops or other constructs• may want to compile without optimization, may not
• In flat profile, double-click on function to get line-level data
Code Tuning and Optimization 42
![Page 43: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/43.jpg)
Line-Level Profiling (cont’d)
Code Tuning and Optimization 43
![Page 44: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/44.jpg)
CACHE
Code Tuning and Optimization 44
![Page 45: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/45.jpg)
Cache• Cache is a small chunk of fast memory between the main
memory and the registers
secondary cache
registers
primary cache
main memory
Code Tuning and Optimization 45
![Page 46: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/46.jpg)
Cache (cont’d)• If variables are used repeatedly, code will run faster since
cache memory is much faster than main memory• Variables are moved from main memory to cache in lines
• L1 cache line sizes on our machines• Opteron (katana cluster) 64 bytes• Xeon (katana cluster) 64 bytes• Power4 (p-series) 128 bytes• PPC440 (Blue Gene) 32 bytes• Pentium III (linux cluster) 32 bytes
Code Tuning and Optimization 46
![Page 47: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/47.jpg)
Cache (3)• Why not just make the main memory out of the same stuff
as cache?• Expensive• Runs hot• This was actually done in Cray computers
• Liquid cooling system
Code Tuning and Optimization 47
![Page 48: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/48.jpg)
Cache (4)• Cache hit
• Required variable is in cache
• Cache miss• Required variable not in cache• If cache is full, something else must be thrown out (sent back to
main memory) to make room• Want to minimize number of cache misses
Code Tuning and Optimization 48
![Page 49: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/49.jpg)
Cache (5)
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
Main memory
“mini” cacheholds 2 lines, 4 words each
for(i=0; i<10; i++) x[i] = i;
ab…
Code Tuning and Optimization 49
![Page 50: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/50.jpg)
Cache (6)
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
• will ignore i for simplicity• need x[0], not in cache cache miss• load line from memory into cache• next 3 loop indices result in cache hits
for(i=0; i<10; i++) x[i] = i;
ab…
x[0]x[1]
x[2]x[3]
Code Tuning and Optimization 50
![Page 51: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/51.jpg)
Cache (7)
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
• need x[4], not in cache cache miss• load line from memory into cache• next 3 loop indices result in cache hits
for(i=0; i<10; i++) x[i] = i;
ab…
x[0]x[1]
x[2]x[3]
x[4]
x[5]x[6]x[7]
Code Tuning and Optimization 51
![Page 52: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/52.jpg)
Cache (8)
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
• need x[8], not in cache cache miss• load line from memory into cache• no room in cache!• replace old line
for(i=0; i<10; i++) x[i] = i;
ab…
x[4]
x[5]x[6]x[7]
x[8]x[9]
ab
Code Tuning and Optimization 52
![Page 53: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/53.jpg)
Cache (9)• Contiguous access is important• In C, multidimensional array is stored in memory as
a[0][0]
a[0][1]
a[0][2]
…
Code Tuning and Optimization 53
![Page 54: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/54.jpg)
Cache (10)• In Fortran and Matlab, multidimensional array is stored
the opposite way:
a(1,1)
a(2,1)
a(3,1)
…
Code Tuning and Optimization 54
![Page 55: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/55.jpg)
Cache (11)• Rule: Always order your loops appropriately
• will usually be taken care of by optimizer• suggestion: don’t rely on optimizer
for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}
do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo
C Fortran
Code Tuning and Optimization 55
![Page 56: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/56.jpg)
TUNING TIPS
Code Tuning and Optimization 56
![Page 57: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/57.jpg)
Tuning Tips
• Some of these tips will be taken care of by compiler optimization• It’s best to do them yourself, since compilers vary
• Two important rules• minimize number of operations• access cache contiguously
Code Tuning and Optimization 57
![Page 58: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/58.jpg)
Tuning Tips (cont’d)• Access arrays in contiguous order
• For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab
Bad Goodfor(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}
for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}
Code Tuning and Optimization 58
![Page 59: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/59.jpg)
Tuning Tips (3)• Eliminate redundant operations in loops
Bad:
Good:
for(i=0; i<N; i++){ x = 10;
}
…
x = 10;for(i=0; i<N; i++){ }
…
Code Tuning and Optimization 59
![Page 60: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/60.jpg)
Tuning Tips (4)• Eliminate or minimize if statements within loops
Bad:
if may inhibit pipelining
Good:
for(i=0; i<N; i++){
if(i = = 0)
perform i=0 calculations
else
perform i>0 calculations
}
Code Tuning and Optimization 60
perform i=0 calculations
for(i=1; i<N; i++){
perform i>0 calculations
}
![Page 61: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/61.jpg)
Tuning Tips (5)• Divides are expensive
• Intel x86 clock cycles per operation• add 3-6• multiply 4-8• divide 32-45
• Bad:
• Good:
for(i=0; i<N; i++) {
x[i] = y[i]/scalarval; }
qs = 1.0/scalarval;
for(i=0; i<N; i++) {
x[i] = y[i]*qs; }
Code Tuning and Optimization 61
![Page 62: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/62.jpg)
Tuning Tips (6)• There is overhead associated with a function call
Bad:
Good:
for(i=0; i<N; i++)
myfunc(i);
myfunc ( );
void myfunc( ){
for(int i=0; i<N; i++){
do stuff
}
}
Code Tuning and Optimization 62
![Page 63: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/63.jpg)
Tuning Tips (7)• Minimize calls to math functions
Bad:
Good:
for(i=0; i<N; i++)
z[i] = log(x[i]) * log(y[i]);
for(i=0; i<N; i++){
z[i] = log(x[i] + y[i]);
Code Tuning and Optimization 63
![Page 64: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/64.jpg)
Tuning Tips (8)• recasting may be costlier than you think
Bad:
Good:
sum = 0.0;
for(i=0; i<N; i++)
sum += (float) i
isum = 0;
for(i=0; i<N; i++)
isum += i;
sum = (float) isum
Code Tuning and Optimization 64
![Page 65: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/65.jpg)
Exercise 3 (not in class)• The example code provided is written in a clear, readable style,
that also happens to violate lots of the tuning tips that we have just reviewed.
• Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster?• We will discuss options as a group• come up with a strategy• modify code• re-compile and run• compare timings
• Re-examine line level profile, come up with another strategy, repeat procedure, etc.
Code Tuning and Optimization 65
![Page 66: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/66.jpg)
Speedup Ratio and Parallel Efficiency• S is ratio of T1 over TN , elapsed times of 1 and N workers.
• f is fraction of T1 due to code sections not parallelizable.
• Amdahl’s Law above states that a code with its parallelizable component comprising 90% of total computation time can at best achieve a 10X speedup with lots of workers. A code that is 50% parallelizable speeds up two-fold with lots of workers.
• The parallel efficiency is E = S / N Program that scales linearly (S = N) has parallel efficiency 1. A task-parallel program is usually more efficient than a data- parallel program. Parallel codes can sometimes achieve super-linear behavior due to efficient cache usage per worker.
Code Tuning and Optimization 66
NasfT
Nf
f
TTT
SN
)(
1
11
11
![Page 67: CODE TUNING AND OPTIMIZATION Kadin Tseng Boston University Scientific Computing and Visualization.](https://reader035.fdocuments.in/reader035/viewer/2022062417/5518a20b550346a61f8b48ee/html5/thumbnails/67.jpg)
Example of Speedup Ratio & Parallel Efficiency
Code Tuning and Optimization 67