High Performance Programming with IBM XL Compilers and...

copy 2011 IBM Corporation

High Performance Programming with IBM XL Compilers and Libraries

SPXXLScicomP-17 2011 Summer Workshop

Yaoqing Gao ygaocaibmcom

Rauacutel E Silvera raulscaibmcom

IBM Toronto Lab

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

IBM Rational Disclaimer

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any way IBM the IBM logo Rational the Rational logo Telelogic the Telelogic logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others



Agenda

Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance

ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization



Overview of XL Compiler Family

Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities

ndashExploitation and tuning for latest hardware implementations

ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)

ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)

ndashProfile-driven optimization



Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance

ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features

Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback

Productivity enhancementndashXML compiler transformation reports

Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives



HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all



Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements



gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E



XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content



Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters



XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions



Compiler Feedback View



SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement



Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip



Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss



Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull


Loop iteration count based on static analysis or dynamic profiling



Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)


Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12



filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)


filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning



Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2



21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd



SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize



SIMDization Tuning


non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting


either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization



MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report



Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5



Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)



Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch



Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)



Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot


Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2



Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers



Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated



S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization



User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD



Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp



XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13



Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures



Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans



The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011


Fortran Cafe on IBM developerWorks



Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom



Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom



43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop


Agenda


Major Features of XLC111 XLF131



gxlc gxlc++ gxlC

XML Compiler Transformation Reports

Compiler Transformation Report Contents



Slide Number 13



Loop information


Slide Number 18

Explicit SIMD programming for POWER7Enabled under -qaltivec

Automatic SIMDization

SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33

User Explicit Parallelization with OpenMP

Automatic parallelization


Inlining

Control over optimizations that may affect program results-qstrict suboptions

The IBM Rational CC++ Cafeacute on IBM developerWorks


Feature Request

Documentation

Slide Number 43




copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any way IBM the IBM logo Rational the Rational logo Telelogic the Telelogic logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others



Agenda




























SIMDization







Parallelization





-qlistfmt=xml=all











gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC



gxlccfg format


eg


nnn -B -B

nnn -C -C

nnn -c -c


nnn -D -D

E E



















directed feedback








SOURCE CODE





PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS










1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



Agenda




























SIMDization







Parallelization





-qlistfmt=xml=all











gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC



gxlccfg format


eg


nnn -B -B

nnn -C -C

nnn -c -c


nnn -D -D

E E



















directed feedback








SOURCE CODE





PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS










1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43


























SIMDization







Parallelization





-qlistfmt=xml=all











gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC



gxlccfg format


eg


nnn -B -B

nnn -C -C

nnn -c -c


nnn -D -D

E E



















directed feedback








SOURCE CODE





PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS










1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43











gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC



gxlccfg format


eg


nnn -B -B

nnn -C -C

nnn -c -c


nnn -D -D

E E



















directed feedback








SOURCE CODE





PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS










1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



















directed feedback








SOURCE CODE





PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS










1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43






SOURCE CODE





PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS










1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43






1

Call Coverage 32 49

Call Counters


Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38




Region 1



Block Counters

Block Index


Start Line

End Line

3 5 1 33

4 5 33 34


helliphellip







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43







3 408 2 17446 24


3 412 2 9999 14


3 413 2 9974 14


3 525 2 1033429 6




Cache miss



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



Source location


StartLine

EndLine


Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull


2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull


4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull







Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43





Descriptio n

Attributes




not available








not available







filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



filec




filec






Tuning












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43












21






SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



SIMDization Tuning















to vectorize



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



SIMDization Tuning






Stride versioning
















for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43








for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)


















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43
















stream_ID)




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43




for (i=0 ilt n i++) a[i] = b[i] +





__eieio()




Stream length

Stream direction

Prefetch depth




Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



Loop Optimization











Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43




Dependence analysis


















Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43









Output









Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43





Enabled at O5


Seq









S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3


F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]


hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip


Array splitting


























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43
























Inlining



























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43























Documentation







43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43



43




Agenda





gxlc gxlc++ gxlC





Slide Number 13



Loop information


Slide Number 18



SIMDization Tuning

SIMDization Tuning





Loop Optimization



Data Reorganization

Slide Number 33




Inlining




Feature Request

Documentation

Slide Number 43

High Performance Programming with IBM XL Compilers and...

Documents

Transcript of High Performance Programming with IBM XL Compilers and...