pSeries Program Optimization
Transcript of pSeries Program Optimization
![Page 1: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/1.jpg)
© 2005 IBM
Louisiana State UniversityBaton Rouge, Louisiana
Charles GrasslIBM
June, 2005
pSeries Program Optimization
![Page 2: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/2.jpg)
2 © 2005 IBM Corporation
Agenda
• Programming Concerns• Tactics
![Page 3: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/3.jpg)
3 © 2005 IBM Corporation
Understanding Performance
• Fused floating multiply add functional units•Latency•Balance
• Memory Access•Stride•Latency•Bandwidth
![Page 4: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/4.jpg)
4 © 2005 IBM Corporation
FMA Functional Unit
• Multiply and Add• Fused• 6 clock period latency
D = A+B*C
![Page 5: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/5.jpg)
5 © 2005 IBM Corporation
Performance Expectations
• 100's Mflop/s to 2-4 Gflop/s•Limitations:
• System bandwidth• Program model
• Floating point operations• Adds and Multiplies• Divides
• Memory access• Copying
• 100 Mbyte/s - 10 Gbyte/s•Strided access is slow•Multiple streams
![Page 6: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/6.jpg)
6 © 2005 IBM Corporation
Performance Expectations Example:Program POP
• Structured Fortran 90• Data moves• Low Computational Intensity• Low FMA Percentage
0.869Comp. Intensity53 %FMA percentage372 Mflip/sFloat. Point Instr. + RMA rate10141 MFloat. Point Inst. + FMSs0.3HW float points Inst. per Cycle0.9Instructions per cycle2.9 MInstructions per load/store
![Page 7: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/7.jpg)
7 © 2005 IBM Corporation
Program POP:Computation Time Distribution
02
468
1012
% T
ime
__st
ate_
mod
_MO
D_s
t
__vm
ix_r
ich_
MO
D_v
m
trace
r_up
date
clin
ic
__hm
ix_d
el2_
MO
D_h
d
__ve
rtica
l_m
ix_M
OD
_i
__ve
rtica
l_m
ix_M
OD
_i
__di
agno
stic
s_M
OD
_d
![Page 8: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/8.jpg)
8 © 2005 IBM Corporation
Performance Expectations Example:Program SPPM
• Loose Fortran 77• Optimized• High Computational Intensity• Vector intrinsics• Low FMA Percentage
1.8Computation Intensity55 %FMA percentage979 Mflip/sFloat point Inst. + RMA rate191752Float. Point + FMSs0.7HW Float Inst. Per cycle1.2Instr. Per cycle2.9Instr. Per load/store
![Page 9: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/9.jpg)
9 © 2005 IBM Corporation
Program SPPM:Computation Time Distribution
0102030405060
% T
ime
sppm
difu
ze
__vs
rec_
GP
inte
rf
dint
rf
__vs
rsqr
t_63
0
hydy
z
hydz
y
![Page 10: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/10.jpg)
10 © 2005 IBM Corporation
Programming Concerns
• Superscalar design•Concurrency:
• Branches• Loop control• Procedure calls• Nested “if” statements
•Program “control” is very efficient• Cache based microprocessor
•Critical resource: memory bandwidth• Tactics:
• Increase Computational Intensity• Exploit "prefetch"
![Page 11: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/11.jpg)
11 © 2005 IBM Corporation
Strategies for Optimization
• Exploit multiple functional units:• Expose load/store "streams"• Optimized libraries• Pipelined operations
![Page 12: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/12.jpg)
12 © 2005 IBM Corporation
Tactics for Optimization
• Cache reuse•Blocking
• Unit stride•Use entire loaded cache lines
• Limit range of indirect addressing•Sort indirect addresses
• Increase computational intensity of the loops•Ratio of flop's to byte's
• Load streams
![Page 13: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/13.jpg)
13 © 2005 IBM Corporation
Computational Intensity
• Ratio of floating point operations per memory references (loads and stores)
• Higher is better• Example:
for (i=0;i<n;i++)A[i] = A[i] +B[i]*s+C[i]
•Loads and stores: 3+1•Floating point operations: 3•Computational intensity: 3/4
![Page 14: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/14.jpg)
14 © 2005 IBM Corporation
Loop Unrolling Strategy:Increase Computational Intensity
• Find variable which is constant with respect to outer loop•Unroll such that this variable is loaded once but used multiple times
• Unroll outer loop•Minimizes load/stores
![Page 15: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/15.jpg)
15 © 2005 IBM Corporation
Outer Loop Unroll
• 2 flops / 2 loads• Comp. Int.: 1
• 8 flops / 5 Loads• Comp. Int.: 1.6
DO I = 1, NDO J = 1, N
s = s +X(J)*A(J,I)END DO
END DO
DO I= 1, N, 4DO J = 1, N
s = s +X(J)*A(J,I+0)+X(J)*A(J,I+1)+X(J)*A(J,I+2)+X(J)*A(J,I+3)
END DOEND DO
Unroll
![Page 16: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/16.jpg)
16 © 2005 IBM Corporation
Outer Loop Unroll Test
050
100150200250300350400450500
Mflo
p/s
-O3 -O3 -qhot
Unroll124812
1.3 GHz POWER4
![Page 17: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/17.jpg)
17 © 2005 IBM Corporation
Outer Loop Unroll Test
050
100150200250300350400450500
Mflo
p/s
1 2 4 8 12
-O3-O3 -qhot
1.3 GHz POWER4
![Page 18: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/18.jpg)
18 © 2005 IBM Corporation
Loop Unroll Analysis
• Strategy:•Unroll up to 8 times
• Exploit 8 prefetch streams• Compiler:
•Unrolls up to 4 times• “Near” optimal performance
•Combines inner and outer loop unrolling
![Page 19: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/19.jpg)
19 © 2005 IBM Corporation
Loop Unrolling Strategies
• Inner loop strategy:•Reduces data dependency•Eliminate intermediate loads and stores•Expose functional units•Expose registers•Examples:
• Linear recurrences• Simple loops
• Single operation
![Page 20: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/20.jpg)
20 © 2005 IBM Corporation
Inner Loop Unroll: Dependencies
• Eliminate data dependence (half)• Eliminate intermediate loads and stores
•Compiler will do some of this at -O3 and higher
do i=2,n-1a(i+1) = a(i)*s1 + a(i-1)*s2
end do
do i=2,n-2,2a(i+1) = a(i )*s1 + a(i-1)*s2a(i+2) = a(i+1)*s1 + a(i )*s2
end do
![Page 21: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/21.jpg)
21 © 2005 IBM Corporation
Inner Loop Unroll: Dependencies
0
50
100
150
200
250
300
350
Mflo
p/s
-qnounroll -qunroll
Unroll 1Unroll 2Unroll 4
![Page 22: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/22.jpg)
22 © 2005 IBM Corporation
Inner Loop Unroll: Dependencies
0
50
100
150
200
250
300
350
Mflo
p/s
Unroll 1 Unroll 2 Unroll 4
-qnounroll-qunroll
![Page 23: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/23.jpg)
23 © 2005 IBM Corporation
Inner Loop Unroll: Dependencies
• Compiler unrolling helps •Does not help manually unrolled loops• Conflicts
• Turn off compiler unrolling if manually unrolled
![Page 24: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/24.jpg)
24 © 2005 IBM Corporation
Inner Loop Unroll: Registers and Functional Units
• Expose functional units• Expose registers
•Compiler will do some of this at -O3 and higher
do j=1,ndo i=1,m
sum = sum + X(i)*A(i,j)end do
end do
do j=1,ndo i=1,m,2
sum = sum + X(i) *A(i,j) &+ X(i+1)*A(i+1,j)
end doend do
![Page 25: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/25.jpg)
25 © 2005 IBM Corporation
Inner Loop Unroll: Registers and Functional Units
050
100150200250300350400450500
Mflo
p/s
-qnounroll -qunroll
Unroll 1Unroll 2Unroll 4Unroll 8
1.45 GHz POWER4
![Page 26: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/26.jpg)
26 © 2005 IBM Corporation
Inner Loop Unroll: Registers and Functional Units
050
100150200250300350400450500
Mflo
p/s
Unroll 1 Unroll 2 Unroll 4 Unroll 8
-qnounroll-qunroll
1.45 GHz POWER4
![Page 27: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/27.jpg)
27 © 2005 IBM Corporation
Inner Loop Unroll: Registers and Functional Units
• Compiler does adequate job of unrolling
• Do not manually unroll inner loop
![Page 28: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/28.jpg)
28 © 2005 IBM Corporation
Outer Loop Unrolling Strategies
• Expose prefetch streams•Up to 8 streams
do j=1,ndo i=1,m
sum = sum + X(i)*A(i,j)end do
end do
do j=1,n,2do i=1,m
sum = sum + X(i)*A(i,j) &+ X(i)*A(i,j+1)
end doend do
![Page 29: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/29.jpg)
29 © 2005 IBM Corporation
Outer Loop Unroll: Streams
0100200300400500600700800
Mflo
p/s
-qnounroll -qunroll
Unroll 1Unroll 2Unroll 4Unroll 8
![Page 30: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/30.jpg)
30 © 2005 IBM Corporation
Outer Loop Unroll: Streams
0100200300400500600700800
Mflo
p/s
Unroll 1 Unroll 2 Unroll 4 Unroll 8
-qnounroll-qunroll
![Page 31: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/31.jpg)
31 © 2005 IBM Corporation
Outer Loop Unroll: Streams
• Compiler does a “fair”job of outer loop unrolling
• Manual unrolling can outperform compiler unrolling
![Page 32: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/32.jpg)
32 © 2005 IBM Corporation
Strides: Cache Lines
• Stided memory accesses use partial cache lines•Reduced efficiency
![Page 33: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/33.jpg)
33 © 2005 IBM Corporation
Memory
Strided Memory Access
for (i=0;i<n;i+=2)sum+=a[i];
![Page 34: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/34.jpg)
34 © 2005 IBM Corporation
Strides
• Cache line size is 128 bytes• Double precision: 16 words• Single precision: 32 words
Bandwidth Reduction
1/161/32641/161/32321/161/16161/81/88¼¼4½½21x1x1
DoubleSingleStride
![Page 35: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/35.jpg)
35 © 2005 IBM Corporation
Stride Test
0
500
1000
1500
2000
2500
3000
Mby
te/s
-16
-10 -6 -2 1 4 8 12
POWER4 1.3 GHz
![Page 36: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/36.jpg)
36 © 2005 IBM Corporation
Correcting Stides
• Interleave code• Example:
•Real and Imaginary part arithmetic
do i=1,n… AIMAG(Z(i))
end dodo i=1,n
…REAL(Z(i))end do
do i=1,n… AIMAG(Z(i))
……REAL(Z(i))
end do
![Page 37: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/37.jpg)
37 © 2005 IBM Corporation
Correcting Strides
• Interchange loops
do i=1,ndo j=1,m
sum=sum+A(I,j)end do
end do
Interchange
do j=1,mdo i=1,n
sum=sum+A(I,j)end do
end do
![Page 38: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/38.jpg)
38 © 2005 IBM Corporation
Correcting Strides:Loop Interchange
020406080
100120140160
Mflo
p/s -O3
-O3 -qhotInterchanged
1.45 GHz POWER4
![Page 39: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/39.jpg)
39 © 2005 IBM Corporation
Effect of Large Strides (TLB Misses)
050
100150200250300350400450
Mby
te/s
8 16 32 68 132
260
516
1028
2000
Large Stride
![Page 40: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/40.jpg)
40 © 2005 IBM Corporation
Working Set Size
• Reduce spanning size of random memory accesses
• More MPI tasks usually helps
![Page 41: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/41.jpg)
41 © 2005 IBM Corporation
Random Memory Access
010203040506070
G u
pdat
e / s
1 2 4 8 16 32 64 128 256 512Work Set Size (Mbyte)
![Page 42: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/42.jpg)
42 © 2005 IBM Corporation
Blocking
• Common technique in Linear Algebra (LA)•Similar to unrolling•Utilize cache lines•Linear Algebra NB:
• Typically 96-256
![Page 43: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/43.jpg)
43 © 2005 IBM Corporation
Blocking
Blocking
![Page 44: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/44.jpg)
44 © 2005 IBM Corporation
Blocking Example: Transpose
• Especially useful for bad strides
do i = 1,ndo j = 1,mB(j,i) = A(i,j)
end doend do
do j1 = 1,n-nb+1,nbj2 = min(j1+nb-1,n)do i1 = 1,m-nb+1,nb
i2 = min(i1+nb-1,m)do i = i1, i2do j = j1, j2
B(j,i) = A(i,j)end do
end doend do
end do
Blocking
![Page 45: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/45.jpg)
45 © 2005 IBM Corporation
Blocking Example: Transpose
0200400600800
10001200140016001800
Mby
te/s
1 32 64 128
192
dget
mi
Blocking Size
![Page 46: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/46.jpg)
46 © 2005 IBM Corporation
Hardware Prefetch
• Detects adjacent cache line references• Forward and backward• Up to eight concurrent streams• Prefetches up to two lines ahead per stream• Twelve prefetch filter queues prevents
rolling• No prefetch on store misses • (when a store instruction causes a cache
line miss)
![Page 47: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/47.jpg)
47 © 2005 IBM Corporation
Prefetch: Stride Pattern Recognition
• Upon a cache miss:• Biased guess is made as to the direction of
that stream• Guess is based upon where in the cache line
the address associated with that miss occurred
• If it is in the first 3/4, then the direction is guessed as ascending
• If in the last 1/4, the direction is guessed descending
![Page 48: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/48.jpg)
48 © 2005 IBM Corporation
Memory Bandwidth
0
500
1000
1500
2000
2500
3000
Mby
te/s
1 2 4 6 8 10 12Right Hand Sides
Small Pages
1.3 GHz POWER4
![Page 49: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/49.jpg)
49 © 2005 IBM Corporation
Memory Bandwidth
0500
100015002000250030003500400045005000
Mby
te/s
1 2 4 6 8 10 12Right Hand Sides
Large Pages
1.3 GHz POWER4
![Page 50: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/50.jpg)
50 © 2005 IBM Corporation
Memory Bandwidth
0500
100015002000250030003500400045005000
Mby
te/s
1 2 4 6 8 10 12Right Hand Sides
Small PagesLarge Pages
1.3 GHz POWER4
![Page 51: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/51.jpg)
51 © 2005 IBM Corporation
Memory Bandwidth
• Bandwidth is proportional to number of streams•Streams are roughly the number of right hand side arrays
•Up to eight streams
![Page 52: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/52.jpg)
52 © 2005 IBM Corporation
Exploiting Prefetch
• Merge Loops• Strategy
•Combine loops to get up to 8 right hand sides
for (j=1; j<= n; j++)A[j] = A[j-1]+B[j]
for (j=1; j<= n; j++) D[j] = D[j+1]+C[j]*s
for (j=1; j<= n; j++){A[j] = A[j-1]+B[j]D[j] = D[j+1]+C[j]*s}
![Page 53: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/53.jpg)
53 © 2005 IBM Corporation
Loop Merge Example
050
100150200250300350400
Mflo
p/s
Original -O3 -qhot Merged
1.3 GHz POWER41.3 GHz POWER4
![Page 54: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/54.jpg)
54 © 2005 IBM Corporation
Folding
• Fold loop to increase number of streams:• Strategy
•Fold loops to get up to 4 times
do i = 1,nsum = sum +A(i)
end do
do i = 1,n/4sum = sum +A(i )
+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)
end do
![Page 55: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/55.jpg)
55 © 2005 IBM Corporation
Folding Example
0500
1000150020002500300035004000
Mflo
p/s
0 2 4 6 8 12Folds
1st Qtr
1.3 GHz POWER4
![Page 56: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/56.jpg)
56 © 2005 IBM Corporation
Effect of Precision
• Available floating point formats:• real (kind=4)• real (kind=8)• real (kind=16)
• Advantage of smaller data types:•Require less bandwidth•More effective cache use
REAL*8 A,pi,e…do i=1,nA(i) = pi*A(i) + e
end do
REAL*4 A,pi,e…do i=1,nA(i) = pi*A(i) + e
end do
![Page 57: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/57.jpg)
57 © 2005 IBM Corporation
Effect of Precision
0200400600800
1000120014001600
Mflo
p/s
REAL*4 REAL*8 REAL*16
Small (10000)Large (90M)
1.3 GHz POWER41.3 GHz POWER4
![Page 58: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/58.jpg)
58 © 2005 IBM Corporation
Divide and Sqrt
• POWER4 special functions:• Divide• Sqrt• Use FMA functional unit
• 2 simultaneous divide or sqrt (or rsqrt)• NOT pipelined
3838Fsqrt3232Fdiv66Fma
Double(Cycles)
Single(Cycles)Instruction
![Page 59: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/59.jpg)
59 © 2005 IBM Corporation
Hardware DIV, SQRT, RSQRT
01020304050607080
Mflo
p/s
Divide: inst. SQRT: Inst. RSQRT: inst.
Hardware
1.3 GHz POWER41.3 GHz POWER4
![Page 60: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/60.jpg)
60 © 2005 IBM Corporation
Hardware DIV, SQRT, RSQRT
0
20
40
60
80
100
120
Mflo
p/s
Divide: inst. SQRT: Inst. RSQRT:inst.
Software Pipelined
1.3 GHz POWER41.3 GHz POWER4
![Page 61: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/61.jpg)
61 © 2005 IBM Corporation
Hardware DIV, SQRT, RSQRT
0
20
40
60
80
100
120
Mflo
p/s
Divide: inst. SQRT: Inst. RSQRT:inst.
HardwareSoftware Pipelined
1.3 GHz POWER41.3 GHz POWER4
![Page 62: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/62.jpg)
62 © 2005 IBM Corporation
Intrinsic Function Vectorization
do i = -nbdy+3,n+nbdy-1prl = qrprl(i)pll = qrmrl(i)pavg = vtmp1(i)wllfac(i) = 5*gammp1*pavg + gamma * pllwrlfac(i) = 5*gammp1 * pavg +gamma *prlhrholl = rho(1,i-1)hrhorl = rho(1,i)wll(i) = 1/sqrt(hrholl * wllfac(i))wrl(i) = 1/sqrt(hrhorl * wrlfac(i))
end do
![Page 63: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/63.jpg)
63 © 2005 IBM Corporation
Intrinsic Function Vectorization
allocate(t1,n+2*nbdy-3)allocate(t2,n+2*nbdy-3)do i = -nbdy+3,n+nbdy-1
prl = qrprl(i)...t1(i) =hrholl * wllfac(i)t2(i) =hrhorl * wrlfac(i)
end docall __vrsqrt(t1,wrl,n+2*nbdy-3)call __vrsqrt(t2,wll,n+2*nbdy-3)
![Page 64: pSeries Program Optimization](https://reader030.fdocuments.in/reader030/viewer/2022012416/617095f8ba14e81200722222/html5/thumbnails/64.jpg)
64 © 2005 IBM Corporation
Vectorization Analysis
• Dependencies• Compiler overhead:
• Generate (malloc) local temporary arrays• Extra memory traffic
• Moderate vector lengths required
302545N½
258020CrossoverLength
RSQRTSQRTREC