Post on 01-Apr-2015
The Centre for Australian Weather and Climate ResearchA partnership between CSIRO and the Bureau of Meteorology
Basic Intel software requirements for Bureau applications specialists for development and
optimisation on SUN HPC platform
Ilia BermousSenior ITO, CAWCR22 January 2010
Stages in Optimisation ProcessStages in Optimisation Process
Compilation and building an executable
Execution
Optimisation or code restructuring for better performance
Compilation and Building Executable (1) Compilation and Building Executable (1)
Are any static analysis tools similar in flavour to Cray cflint available?
User should be able easily to identify what and how
Optimised
Vectorised
Parallelised
by the compiler an example can be transformation and formatted listing
provided with NEC SX compilers and cross-compilers
ExampleExample
subroutine vec_par(a,b,c,n,m)
real, dimension (n,n,n) :: a,b,c
!cdir concur
do i=1,n
do k=1,m
do j=1,n
a(k,j,i)=b(k,j,i) + c(k,j,i)
end do
end do
end do
end
parallelisation directive
Transformation ListingTransformation Listing
. do i = 1, n . if (n .gt. 0) then . J1 = and(n,3) . do j = 1, J1 . !CDIR NODEP . do k = 1, m . a(k,j,i) = b(k,j,i) + c(k,j,i) . end do . end do . do j = 1, n/4 . !CDIR NODEP . do k = 1, m . a(k,(j-1)*4+J1+1,i) = b(k,(j-1)*4+J1+1,i) + c(k,(j-1)* . 1 4+J1+1,i) . a(k,(j-1)*4+2+J1,i) = b(k,(j-1)*4+2+J1,i) + c(k,(j-1)* . 1 4+2+J1,i) . a(k,(j-1)*4+3+J1,i) = b(k,(j-1)*4+3+J1,i) + c(k,(j-1)* . 1 4+3+J1,i) . a(k,j*4+J1,i) = b(k,j*4+J1,i) + c(k,j*4+J1,i) . end do . end do . endif . end do . end do
loop interchange
loop unrolling
Formatted ListingFormatted Listing
. . . 4: P------> do i=1,n
5: |X-----> do k=1,m
6: ||+----> do j=1,n
7: ||| a(k,j,i)=b(k,j,i) + c(k,j,i)
8: ||+---- end do
9: |X----- end do
10: P------ end do
. . .
loops 5-9 are interchanged and vectorised
loop 4-10 is parallelised
Compilation and Building Executable (2) Compilation and Building Executable (2)
Requirement for more robust Fortran/C compilers we have been affected by compiler bugs: when we started to use Fortran compiler for our applications immediately an optimisation bug was detected, also a number of other compiler problems have been reported to Intel
Current Intel compiler versions still have too many bugs
According to a recent report (*), 169 bugs were fixed in the latest 11th version of Fortran compiler. From my point of view, some of them are very dangerous.
(*) http://software.intel.com/en-us/articles/intel-professional-edition-compilers-111-fixes-list/
Release notes for each compiler revision should include a section stating what has actually been implemented for this particular revision
ExecutionExecution
At the end of execution important performance characteristics should be readily available to the user to be able to identify whether the application has run efficiently or not
On NEC SX for any Fortran application the following characteristics are printed out with an environment term setting
MFLOPS Vector Operation Ratio & Average Vector Length Instruction/Operand Cache miss time Bank Conflict time
without any impact on the application performance
NEC ftrace tool provides similar performance characteristics for any program unit compiled with a special “-ftrace” option
with code directives, this info can be obtained for any code sections, starting and ending anywhere in any program unit
Program Information OutputProgram Information Output
Global Data of 3 processes : Min [U,R] Max [U,R] Average ==========================
Real Time (sec) : 544.678 [0,1] 554.766 [0,2] 549.728 User Time (sec) : 3383.378 [0,0] 3598.739 [0,2] 3479.353 System Time (sec) : 14.129 [0,1] 14.478 [0,2] 14.305 Vector Time (sec) : 334.675 [0,1] 346.198 [0,0] 340.617 Instruction Count : 38739990085 [0,1] 40868170971 [0,0] 40022002145 Vector Instruction Count : 7456076063 [0,1] 7942498968 [0,2] 7725412162 Vector Element Count : 997371328007 [0,1] 1069853462653 [0,2] 1031688872790 FLOP Count : 337475560162 [0,0] 342575608843 [0,2] 339235679815 MOPS : 297.648 [0,1] 313.572 [0,0] 305.847 MFLOPS : 95.193 [0,2] 99.745 [0,0] 97.547 Average Vector Length : 132.153 [0,0] 134.700 [0,2] 133.540 Vector Operation Ratio (%) : 96.881 [0,0] 97.050 [0,2] 96.963 Memory size used (MB) : 13040.000 [0,0] 13056.000 [0,1] 13050.667 MIPS : 11.210 [0,1] 12.079 [0,0] 11.510 Instruction Cache miss (sec): 23.864 [0,1] 24.546 [0,2] 24.153 Operand Cache miss (sec): 25.692 [0,0] 26.588 [0,2] 26.193 Bank Conflict Time (sec): 8.762 [0,0] 11.791 [0,2] 9.997 Max. Concurrent Processes : 8 [0,0] 8 [0,0] 8 MOPS (concurrent) : 2090.405 [0,0] 2141.803 [0,2] 2114.935 MFLOPS (concurrent) : 664.944 [0,0] 693.459 [0,1] 674.666 MIPS (concurrent) : 78.606 [0,2] 80.524 [0,0] 79.564 Event Busy Count : 0 [0,0] 0 [0,0] 0 Event Wait (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Lock Busy Count : 35636 [0,2] 39030 [0,0] 36770 Lock Wait (sec) : 2.106 [0,0] 2.487 [0,1] 2.331 Barrier Busy Count : 0 [0,0] 0 [0,0] 0 Barrier Wait (sec) : 0.000 [0,0] 0.000 [0,0] 0.000
I/O InformationI/O Information ****** File Information ****** Unit No. : 20 File Name : BX2005092518 Named : YES Current Directory : /bm/flush3/iliab/gasp/test/2005092612
I/O Exec. Count : READ WRITE OPEN CLOSE INQUIRE 178 0 1 0 0 FIND DEFINE FILE 0 0
Format : UNFORMATTED Blank : ---- Access : DIRECT Recl (Byte) : 45056 Max Record No. : 3911 File Size (Byte) : 179818496 File Descriptor : 3 File System Type : NFS Open Mode : READWRITE Terminal Assignment : NO
I/O Buffer Size (KByte,F_SETBUF) : 1024
Total(In/Out) Input Output Total Data Size (Byte) : 8010552, 8010552, 0 Max Data Size (Byte) : 45056, 0 Min Data Size (Byte) : 35640, 0 Ave Data Size (Byte) : 45003, 45003, 0 Transfer Rate (KByte/sec) : 5746.793, 5746.793, 0.000
Total(In/Out/Aux) Input Output RTP-call Count : 535, 534, 0 System-call Count (read/write) Exec. Count : 68, 0 Ave Data Size (Byte) : 1048576, 0
Real Time (sec) : 1.367772, 1.361247, 0.000000 User Time (sec) : 0.007263, 0.007138, 0.000000
F_INPUT Option : NO F_OUTPUT Option : NO F_NORCW Option : NO F_PARTRCW Option : NO F_EXPRCW Option : NO F_UFMTFLOAT1 Option : NO F_UFMTFLOAT2 Option : NO F_UFMTIEEE Option : NO F_UFMTENDIAN Option : NO F_UFMTADJUST Option : NO F_HSDIR Option : NO F_VSPACING Option : NO F_PROMOTE Option : NO
Optimisation (1)Optimisation (1)
Need to know
What are the primary performance characteristics for an application performance improvement?
How should these primary performance characteristics be measured?
How should these primary performance characteristics be addressed?
Optimisation (2)Optimisation (2)
Optimisation manuals and documentation:
Manuals need to include description of technique illustrated by simple examples.
Significant improvement is required in the existing manuals, for example “Intel(R) Fortran Compiler Optimizing Applications Document Number: 307781-003US”
The document "Consistency of Floating-Point Results using the Intel Compiler" was very useful for understanding on how to get reproducible results, but it content should be included in the main manuals
Are there Intel websites available where further related information can be found?
Manuals and Release notes should be in one place with good indexing and searching.
SummarySummary
We need to have a user friendly software environment at each stage during performance tuning procedure