High Performance Programming with IBM XL Compilers and...
Transcript of High Performance Programming with IBM XL Compilers and...
copy 2011 IBM Corporation
High Performance Programming with IBM XL Compilers and Libraries
SPXXLScicomP-17 2011 Summer Workshop
Yaoqing Gao ygaocaibmcom
Rauacutel E Silvera raulscaibmcom
IBM Toronto Lab
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
IBM Rational Disclaimer
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any way IBM the IBM logo Rational the Rational logo Telelogic the Telelogic logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Agenda
Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance
ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Overview of XL Compiler Family
Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities
ndashExploitation and tuning for latest hardware implementations
ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)
ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)
ndashProfile-driven optimization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance
ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features
Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback
Productivity enhancementndashXML compiler transformation reports
Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
HPC Performance Tuning with XL Compilers
-O3 ndashqarch=pwr7 or
ndashO3 ndashqhot ndashqarch=pwr7
with ndashqnostrict or ndashqstrict
Profiling for hot spot detection
Compiler instrumentation ndashqpdf1=level=12 pdf2
-pg for gprofxprofiler -qlist for tprof
User-provided profile functions -qfunctrace
SIMDization
Automatic SIMDization ndashO3 or above with ndashqsimd
User explicit SIMD program -qaltivec
Loop transformations
Loop transformations ndashO3 or above
Whole program optimizations
-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization
Parallelization
User explicit parallelization only -qsmp=omp
Auto parallelization -qsmp (-qsmp=auto)
Polyhedral framework-qsmp with -qhot=level=2
XML Transformation Reports
-qlistfmt=xml=all
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
IBM Rational Disclaimer
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any way IBM the IBM logo Rational the Rational logo Telelogic the Telelogic logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Agenda
Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance
ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Overview of XL Compiler Family
Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities
ndashExploitation and tuning for latest hardware implementations
ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)
ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)
ndashProfile-driven optimization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance
ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features
Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback
Productivity enhancementndashXML compiler transformation reports
Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
HPC Performance Tuning with XL Compilers
-O3 ndashqarch=pwr7 or
ndashO3 ndashqhot ndashqarch=pwr7
with ndashqnostrict or ndashqstrict
Profiling for hot spot detection
Compiler instrumentation ndashqpdf1=level=12 pdf2
-pg for gprofxprofiler -qlist for tprof
User-provided profile functions -qfunctrace
SIMDization
Automatic SIMDization ndashO3 or above with ndashqsimd
User explicit SIMD program -qaltivec
Loop transformations
Loop transformations ndashO3 or above
Whole program optimizations
-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization
Parallelization
User explicit parallelization only -qsmp=omp
Auto parallelization -qsmp (-qsmp=auto)
Polyhedral framework-qsmp with -qhot=level=2
XML Transformation Reports
-qlistfmt=xml=all
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Agenda
Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance
ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Overview of XL Compiler Family
Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities
ndashExploitation and tuning for latest hardware implementations
ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)
ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)
ndashProfile-driven optimization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance
ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features
Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback
Productivity enhancementndashXML compiler transformation reports
Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
HPC Performance Tuning with XL Compilers
-O3 ndashqarch=pwr7 or
ndashO3 ndashqhot ndashqarch=pwr7
with ndashqnostrict or ndashqstrict
Profiling for hot spot detection
Compiler instrumentation ndashqpdf1=level=12 pdf2
-pg for gprofxprofiler -qlist for tprof
User-provided profile functions -qfunctrace
SIMDization
Automatic SIMDization ndashO3 or above with ndashqsimd
User explicit SIMD program -qaltivec
Loop transformations
Loop transformations ndashO3 or above
Whole program optimizations
-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization
Parallelization
User explicit parallelization only -qsmp=omp
Auto parallelization -qsmp (-qsmp=auto)
Polyhedral framework-qsmp with -qhot=level=2
XML Transformation Reports
-qlistfmt=xml=all
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Overview of XL Compiler Family
Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities
ndashExploitation and tuning for latest hardware implementations
ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)
ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)
ndashProfile-driven optimization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance
ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features
Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback
Productivity enhancementndashXML compiler transformation reports
Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
HPC Performance Tuning with XL Compilers
-O3 ndashqarch=pwr7 or
ndashO3 ndashqhot ndashqarch=pwr7
with ndashqnostrict or ndashqstrict
Profiling for hot spot detection
Compiler instrumentation ndashqpdf1=level=12 pdf2
-pg for gprofxprofiler -qlist for tprof
User-provided profile functions -qfunctrace
SIMDization
Automatic SIMDization ndashO3 or above with ndashqsimd
User explicit SIMD program -qaltivec
Loop transformations
Loop transformations ndashO3 or above
Whole program optimizations
-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization
Parallelization
User explicit parallelization only -qsmp=omp
Auto parallelization -qsmp (-qsmp=auto)
Polyhedral framework-qsmp with -qhot=level=2
XML Transformation Reports
-qlistfmt=xml=all
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance
ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features
Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback
Productivity enhancementndashXML compiler transformation reports
Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
HPC Performance Tuning with XL Compilers
-O3 ndashqarch=pwr7 or
ndashO3 ndashqhot ndashqarch=pwr7
with ndashqnostrict or ndashqstrict
Profiling for hot spot detection
Compiler instrumentation ndashqpdf1=level=12 pdf2
-pg for gprofxprofiler -qlist for tprof
User-provided profile functions -qfunctrace
SIMDization
Automatic SIMDization ndashO3 or above with ndashqsimd
User explicit SIMD program -qaltivec
Loop transformations
Loop transformations ndashO3 or above
Whole program optimizations
-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization
Parallelization
User explicit parallelization only -qsmp=omp
Auto parallelization -qsmp (-qsmp=auto)
Polyhedral framework-qsmp with -qhot=level=2
XML Transformation Reports
-qlistfmt=xml=all
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
HPC Performance Tuning with XL Compilers
-O3 ndashqarch=pwr7 or
ndashO3 ndashqhot ndashqarch=pwr7
with ndashqnostrict or ndashqstrict
Profiling for hot spot detection
Compiler instrumentation ndashqpdf1=level=12 pdf2
-pg for gprofxprofiler -qlist for tprof
User-provided profile functions -qfunctrace
SIMDization
Automatic SIMDization ndashO3 or above with ndashqsimd
User explicit SIMD program -qaltivec
Loop transformations
Loop transformations ndashO3 or above
Whole program optimizations
-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization
Parallelization
User explicit parallelization only -qsmp=omp
Auto parallelization -qsmp (-qsmp=auto)
Polyhedral framework-qsmp with -qhot=level=2
XML Transformation Reports
-qlistfmt=xml=all
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Migration from GCC to IBM XL Compilers
CompatibilityndashSource level
bull Supports many gccg++ language extensions and annotations
ndashBinary levelbull Link objects from gccg++ and XL CC++
gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++
ndashModify the contents of the gxlccfg to meet your specific compilation requirements
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
gxlc gxlc++ gxlC
gxlc
gxlc++
gxlC
-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename
Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)
gxlccfg format
abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo
eg
nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__
nnn -B -B
nnn -C -C
nnn -c -c
nnn -dM -qshowmacros
nnn -D -D
E E
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool
integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities
Consistent support among Fortran CC++ Controlled under option
-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode
Transformations at both high and low-level optimizationsndash Intra-procedural transformations
bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling
ndash Inter-procedural transformationsbull Inliningbull Data reorganization
Profiling informationndash Basic block countersndash Call countersndash Cache miss counters
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XL Compiler Assisted Performance Analysis and Tuning
Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options
and directivesndash Automatic compiler optimizations through profile
directed feedback
XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help
detect bottlenecks and identify solutions
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Compiler Feedback View
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SOURCE CODE
COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement
COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations
INSTRUMENTEDAPPLICATION
OPTIMIZEDAPPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Basic Block and Call Counter Information
Call Counter InformationRegion 1
Region Execution Count
1
Call Coverage 32 49
Call Counters
Call Name Call Execution Count
Line
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Step 1 compile the application with ndashqpdf1 to generate an instrumented executable
Step 2 run the executable with typical input data set to gather profiling information
Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report
Region 1
Region Execution Count 5
Block Coverage 81 81
Block Counters
Block Index
Block Execution Count
Start Line
End Line
3 5 1 33
4 5 33 34
Basic Block Counter Information
helliphellip
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Cache Miss Information
Memory Reference Region Line Cache
Level Miss Count Miss Rate
((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 408 2 17446 24
((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 412 2 9999 14
((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]
3 413 2 9974 14
((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]
3 525 2 1033429 6
((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]
3 557 2 11553 16Delinquent load
Source code location
Cache miss
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Source location
Loop informationLoop TableLoopIndex
StartLine
EndLine
Parent Loop Index Nest Level
Minimum Cost
Maximum Cost
Iteration Count
Attributes
1 203 1 19630 19630 75 (array)bull
well behavedbull
bump normalizedbull
lower bound normalized
2 188 1 20413 20413 149 (array)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
3 141 1 13300 13300 100 (default)
bull
perfect nestbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
4 203 1 19630 19630 5 (PDF)
bull
residualbull
well behavedbull
bump normalizedbull
guardedbull
lower bound normalized
Loop iteration count based on static analysis or dynamic profiling
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Transformation Reports
Seq Type Phase Region Line Loop Index
Descriptio n
Attributes
2LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorizati on was performed
not available
3 LoopFusion (success)
High Level Optimizer
4 108 3 Loops were fused
bullLoop Line Number 108bullLoop Line Number 206
4LoopVector Version (success)
High Level Optimizer 4 108 3
Vector versioning was performed
not available
20ModuloSch edule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled
bullInitiation Interval 12
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
filec
foo (float p float q float r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all
filec
foo (float restrict p float restrict q float restrict r int n)
for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]
filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing
Original source file modified source file
filexmlLoop was automatically parallelizedLoop was modulo scheduled
Tuning
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Explicit SIMD programming for POWER7 Enabled under -qaltivec
Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors
vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)
New vector operations vec_mul vec_div hellipUnaligned load and store operations
ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
21
Automatic SIMDizationAutomatic SIMDization for VMX and VSX
ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX
FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable alignment
Use __attribute__((aligned(n)) to set data alignment
Use __alignx(16 a) to indicate the data alignment to the compiler
Use -qassert=refalign if all references are naturally aligned
Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
Use fewer pointers when possible
Use pragma independent if it has no loop carried dependency
Use pragma disjoint (a b) if a and b are disjoint
Use restrict keyword or compiler option ndashqrestrict
User actionsTransformation report
Loop was SIMD vectorized
Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable
to vectorize
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
SIMDization Tuning
memory accesses have
non-vectorizable strides
Loop interchange for stride-one accesses when possible
Data layout reshape for stride-one accesses
Higher optimization to propagate compile known stride information
Stride versioning
Do statement splitting and loop splitting
User actionsTransformation report
either operation or data type is not suitable for SIMD vectorization
Convert while-loops into do-loops when possible
Limited use of control flow in a loop
Use MIN MAX instead of if-then-else
Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
MASS enhancements and Auto-vectorization
MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)
bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV
ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming
Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations
for (i=0iltni++)
b[i]=sqrt(a[i])
__vsqrt_P7(ban)
Loop vectorization was performed
Transformation report
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Software-controlled data prefetching for POWER7
Software control over POWER7 prefetch engine supporting up to 12 data streams
Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above
ndashMore aggressive exploitation under option ndash qprefetch=aggressive
Global analysis for coarse grained prefetch engine control at optimization level -O5
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Built-in functions for POWER7 data prefetching and cache control
Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)
Partial cache line touchvoid __partial_dcbt(void address)
Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)
Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count
depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth
stream_ID)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Example of POWER7 data prefetching
for (i=0 ilt n i++) a[i] = b[i] +
__protected_store_stream_set(FORWARD ampa 11)
__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)
__protected_stream_set(FORWARD ampb 0)
__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)
__eieio()
__protected_stream_go()
Store stream prefetch for array a
transient stream prefetch for array bStream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Loop Optimization
Traditional unimodular loop transformations for prefect regular loop nests
ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling
ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above
Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp
bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Examples
Dependence analysis
do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)
Enable interchange of j and k loops to improve access locality for b
ndashIdentifies independence of memory accesses to c
Affects all optimization levels that include -qhot
Loop transformations
Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)
Also allows transformation of imperfect loop nests
ndashIntervening code between loops
Only available at -qhot=level=2
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Polyhedral Loop Transformation Example
Sequence of Imperfect Loop nests
for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip
for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]
InputParallelism amp Locality
Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]
Output
Loop fusionLoop skewing to enable tiling
Loop tiling for cache
Loop skewing forPipeline parallelization
Loop tiling for registers
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Data ReorganizationData reorganization transformations to reduce memory latency
ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout
Enabled at O5
Data Reorganization Report
Seq
Type Phase Data Name
Category Region Line Description
1 ArraySplitting High Level Optimizer
iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types
27 ArrayCoalescing High Level Optimizer
net Global variables were aggregated
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
S[0]F0 S[0]F1 S[0]F2 S[0]F3
S[1]F0 S[1]F1 S[1]F2 S[1]F3
S[2]F0 S[2]F1 S[2]F2 S[2]F3
S[3]F0 S[3]F1 S[3]F2 S[3]F3
hellip hellip hellip hellip
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
hellip
hellip
hellip
hellip
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3]A[0][1]A[0][0]
A[1][2] A[1][3]A[1][1]A[1][0]
A[2][2] A[2][3]A[2][1]A[2][0]
A[3][2] A[3][3]A[3][1]A[3][0]
hellip
hellip
hellip
hellip
A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
hellip hellip hellip hellip
hellip
hellip
hellip
hellip
hellip
Arsquo[0][0]
Arsquo[1][0]
Arsquo[2][0]
Arsquo[3][0]
Arsquo[0][1]
Arsquo[1][1]
Arsquo[2][1]
Arsquo[3][1]
Arsquo[0][2]
Arsquo[1][2]
Arsquo[2][2]
Arsquo[3][2]
Arsquo[0][3]
Arsquo[1][3]
Arsquo[2][3]
Arsquo[3][3]
hellip
hellip
hellip
hellip
hellip helliphelliphelliphellip
Array splitting
Array merging Array transposing
Data locality cache utilization
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran
ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays
Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
Improved interaction between OpenMP and automatic SIMD
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Automatic parallelizationImprovements to automatic parallelization
ndashMore effective array data flow analysis for array privatization
ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing
SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation
Enable automatic parallelization with the compiler option -qsmp
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
XLSMPOPTS Environment Variable for Runtime Tuning
XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix
ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)
bull Note that the default schedule has changed from runtime to auto in V11V13
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Inlining
Single control knob to enable inliningndashSimplifies inlining control for programmer
-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available
User inline control with ndashqinline+|-ltfunction_namegt
Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Control over optimizations that may affect program results -qstrict suboptions
Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries
-qstrict guarantees identical result to noopt at the expense of optimization
ndashSuboptions allow fine-grain control over this guaranteendashExamples
-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations
Can be combined -qstrict=precisionnonans
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011
IBM | Software Group | Rational
Fortran Cafe on IBM developerWorks
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Feature RequestRequest for a feature to be supported by our compilers
CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811
Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812
Or send e-mail to xl_featurecaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
Documentation
An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package
Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174
Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175
Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-
copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011
IBM | Software Group | Rational
43
copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others
- High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
- IBM Rational Disclaimer
- Agenda
- Overview of XL Compiler Family
- Major Features of XLC111 XLF131
- HPC Performance Tuning with XL Compilers
- Migration from GCC to IBM XL Compilers
- gxlc gxlc++ gxlC
- XML Compiler Transformation Reports
- Compiler Transformation Report Contents
- XL Compiler Assisted Performance Analysis and Tuning
- Compiler Feedback View
- Slide Number 13
- Basic Block and Call Counter Information
- Cache Miss Information
- Loop information
- Loop Transformation Reports
- Slide Number 18
- Explicit SIMD programming for POWER7Enabled under -qaltivec
- Automatic SIMDization
- SIMDization Tuning
- SIMDization Tuning
- MASS enhancements and Auto-vectorization
- Software-controlled data prefetching for POWER7
- Built-in functions for POWER7 data prefetching and cache control
- Example of POWER7 data prefetching
- Loop Optimization
- Polyhedral Loop Transformation Examples
- Polyhedral Loop Transformation Example
- Data Reorganization
- Slide Number 33
- User Explicit Parallelization with OpenMP
- Automatic parallelization
- XLSMPOPTS Environment Variable for Runtime Tuning
- Inlining
- Control over optimizations that may affect program results-qstrict suboptions
- The IBM Rational CC++ Cafeacute on IBM developerWorks
- Fortran Cafe on IBM developerWorks
- Feature Request
- Documentation
- Slide Number 43
-