Copyright © 2007 Intel Corporation.
RR
®®
SP 3D Running SP 3D Running
Average Average Implementation SSE + OpenMPImplementation SSE + OpenMP
Benchmarking on different platformsBenchmarking on different platforms
Dr. Zvi Danovich, Dr. Zvi Danovich, Senior Application EngineerSenior Application Engineer
January 2008January 2008
Copyright © 2008 Intel Corporation. 2
AgendaAgenda What is 3D Running Average (RA) ?What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 3
3D RA is computed for each voxel 3D RA is computed for each voxel VV as normalized sum inside as normalized sum inside kkxxkkxxk cube (k is odd) located “around” given voxel:k cube (k is odd) located “around” given voxel:
where ‘where ‘v’v’ is source voxels. is source voxels.
In another words, 3D RA can be considered as 3D convolution In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(kwith kernel having all components equal to 1/(kxxkkxxk).k).
3D Running Average (RA) – what is it ?3D Running Average (RA) – what is it ?
2
2
2
2
2
2
000
0
0
0
0
0
0
,,1D3
,,
k
k
k
k
k
k
ll
ll
mm
mm
nn
nn
nmlkkknml vV
vv vv vv
vv vv vv
vv vv vv
V = (1/k3)sum
k
k
k
Copyright © 2008 Intel Corporation. 4
AgendaAgenda What is 3D Running Average (RA) ?What is 3D Running Average (RA) ? From 1D to 3D RA implementationFrom 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 5
1D Running Average (RA)1D Running Average (RA)
Unlike 1D convolution, 1D RA can be computed Unlike 1D convolution, 1D RA can be computed with complexity (Owith complexity (O11) using following aproach:) using following aproach:
– Prolog: compute sum S of first k voxelsProlog: compute sum S of first k voxels
– Main step: to compute next sum SMain step: to compute next sum S+1 +1 , first member of , first member of previous sum (previous sum (v0) should be subtracted, and next ) should be subtracted, and next component (component (vk) should be added ) should be added
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
S = ∑(v)0,k-1
S+1 = ∑(v)1,k = S – v0 + vk
Copyright © 2008 Intel Corporation. 6
Extending 1D Running Average toward 2DExtending 1D Running Average toward 2D
Giving slice (plane) with all lines (LGiving slice (plane) with all lines (Lii) 1D-averaged, we ) 1D-averaged, we can extend averaging for 2D by the same approach:can extend averaging for 2D by the same approach:
– Prolog: compute sum S of first k linesProlog: compute sum S of first k lines
S = ∑(L)0,k-1
S+1 = ∑(L)1,k = S– L0 + Lk
L 0
L k
–Main step: to compute next sum SMain step: to compute next sum S+1+1 , first line of previous sum , first line of previous sum ((L0) should be subtracted, and next line () should be subtracted, and next line (Lk) should be added) should be added
Copyright © 2008 Intel Corporation. 7
Extending 2D Running Average toward 3DExtending 2D Running Average toward 3D
Giving stack of planes with all planes (PGiving stack of planes with all planes (Pii) 2D-averaged, ) 2D-averaged, we can extend averaing for 3D by the same approach:we can extend averaing for 3D by the same approach:
– Prolog: compute sum S of first k planesProlog: compute sum S of first k planes
– Main step: to compute next sum SMain step: to compute next sum S+1+1 , first plane of previous sum , first plane of previous sum ((P0) should be subtracted, and next plane () should be subtracted, and next plane (Pk) should be added) should be added
S = ∑(P)0,k-1
S+1 = ∑(P)1,k = S– P0 + Pk
kkk-1k-1k-2k-2……2211 00
Copyright © 2008 Intel Corporation. 8
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 9
How it can be transformed ?
Array of Structures (AoS ) => Structure of Arrays (SoA)Array of Structures (AoS ) => Structure of Arrays (SoA)Why should we transform it to vectorize 1D Running Average ?Why should we transform it to vectorize 1D Running Average ?
Origin “natural” serial Origin “natural” serial data structure: AoSdata structure: AoS
NOTNOT enabled for SSE enabled for SSE
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk Ms = ∑(m)0,k-1
Ms+1 = ∑(m)1,k = Ms – m0 + mk
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
L0
L1
L2
L3vv
33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33
k-3
k-3
vv22
k-3
k-3
vv11
k-3
k-3
vv00
k-3
k-3
vv33
k-2
k-2
vv22
k-2
k-2
vv11
k-2
k-2
vv00
k-2
k-2
vv33
k-1
k-1
vv22
k-1
k-1
vv11
k-1
k-1
vv00
k-1
k-1
vv33
k k
vv22
k k
vv11
k k
vv00 kk
S = ∑(v)0,k-1S+1 = ∑(v)1,k
= S – v0 + vk
““Transposed” Transposed” data structure: SoAdata structure: SoA ENABLEDENABLED for SSE ! for SSE !
Copyright © 2008 Intel Corporation. 10
Array of Structures (AoS ) => Structure of Arrays (SoA)Array of Structures (AoS ) => Structure of Arrays (SoA) Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Takes 12 SSE operations per 16 components.Takes 12 SSE operations per 16 components.
xx00 yy00 zz00 ww00
xx11 yy11 zz11 ww11
xx22 yy22 zz22 ww22
xx33 yy33 zz33 ww33
xx00 yy00 xx11 yy11 zz00 ww00 zz11 ww11
xx22 yy22 xx33 yy33 zz22 ww22 zz33 ww33
L1 org
L2 org
L3 org
ww00 ww11 ww22 ww33zz00 zz11 zz22 zz33xx00 xx11 xx22 xx33 yy00 yy11 yy22 yy33
intermediate
intermediate
loadhi,loadlo loadhi,
loadlo
loadhi,loadlo
loadhi,
loadlo
L0 org
xy10
xy32
zw10
zw32
shuffle(xy10 , xy32 , (3,1,3,1))
FINAL SSE regs
shuffle(zw10 , zw32 , (2,0,2,0))
shuffle(xy10 , xy32 , (2,0,2,0))
shuffle(zw10 , zw32 , (3,1,3,1))
Copyright © 2008 Intel Corporation. 11
Array of Structures (AoS ) Array of Structures (AoS ) <=<= Structure of Arrays (SoA) Structure of Arrays (SoA) Presented below: Presented below: (inverse)(inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. transposition of 4 x, y, w, z SSE regs into 4 memory places. Takes 12 SSE operations per 16 components.Takes 12 SSE operations per 16 components.
xx00 yy00 zz00 ww00
xx11 yy11 zz11 ww11
xx22 yy22 zz22 ww22
xx33 yy33 zz33 ww33
xx00 yy00 xx11 yy11 zz00 ww00 zz11 ww11
xx22 yy22 xx33 yy33 zz22 ww22 zz33 ww33
L1 ptr
L2 ptr
L3 ptr
ww00 ww11 ww22 ww33zz00 zz11 zz22 zz33xx00 xx11 xx22 xx33 yy00 yy11 yy22 yy33
shuffle(xy10, zw10, …)+storestore
L0 ptr
xy10
xy32
zw10
zw32
Org SSE regs
unpack_lo unpack_lo
unpack_hi unpack_hi
shuffle(xy23, zw23, …)+store
Copyright © 2008 Intel Corporation. 12
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 13
1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) Cyclic SSE array bufferCyclic SSE array buffer
AoS=>SoA transform loads 4 SSE regs. AoS=>SoA transform loads 4 SSE regs.
RA with width 11 needs to maintain together 12 regs, they can fit RA with width 11 needs to maintain together 12 regs, they can fit in 3 QUADs of regs, but can crawl to 4 QUADs as wellin 3 QUADs of regs, but can crawl to 4 QUADs as well
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33 77
vv22 77
vv11 77
vv00 77
vv33 88
vv22 88
vv11 88
vv00 88
vv33 99
vv22 99
vv11 99
vv00 99
vv33
1010
vv22
1010
vv11
1010
vv00
1010
vv33
1111
vv22
1111
vv11
1111
vv00
1111
vv33
1212
vv22
1212
vv11
1212
vv00
1212
vv33
1313
vv22
1313
vv11
1313
vv00
1313
vv33
1414
vv22
1414
vv11
1414
vv00
1414
vv33
1515
vv22
1515
vv11
1515
vv00
1515
12: fits in 3 QUADS
12: crawls to 4 QUADS
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
Can be filled by AoS=>SoA as oS=>SoA as ““next” QUAD next” QUAD
So, 16 regs (4 QUADs) must be allocated and used in So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. AoS=>SoA with next QUAD values.
Fill by AoS=>SoAFill by AoS=>SoA
Copyright © 2008 Intel Corporation. 14
1.1. Loading Loading 1212 SSE regs by AoS=>SoA SSE regs by AoS=>SoA
2.2. Summing up (accumulate) Summing up (accumulate) 55 first first
3.3. 44 times: (sum-up next, save result in SSE regs – SoA form) times: (sum-up next, save result in SSE regs – SoA form)– Save QUAD of results in memory by AoS<=SoASave QUAD of results in memory by AoS<=SoA
4.4. 22 times: (sum-up next, save result in SSE regs – SoA form) times: (sum-up next, save result in SSE regs – SoA form)
5.5. 11 time: Sum-up next, subrtact first, save result in SSE reg time: Sum-up next, subrtact first, save result in SSE reg
Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are NOTNOT saved saved
1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) PrologProlog
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33 77
vv22 77
vv11 77
vv00 77
vv33 88
vv22 88
vv11 88
vv00 88
vv33 99
vv22 99
vv11 99
vv00 99
vv33
1010
vv22
1010
vv11
1010
vv00
1010
vv33
1111
vv22
1111
vv11
1111
vv00
1111
Accumulaterr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
rr33 33
rr22 33
rr11 33
rr00 33
Accumulate & save
+= += += += += += +–=
Save in memory by AoSSave in memory by AoS<=<=SoASoA
rr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
Add last & subtract the very first
Will be subtracted at the end of prolog
3 3 NOTNOT saved in prolog saved in prolog
Copyright © 2008 Intel Corporation. 15
Main stepMain step1.1. Loading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic bufferLoading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic buffer
2.2. Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4 thth – Save QUAD of results in memory by AoS<=SoASave QUAD of results in memory by AoS<=SoA
3.3. 3 times: (sum-up next, subrtact first, save result in SSE reg)3 times: (sum-up next, subrtact first, save result in SSE reg)
During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, and 3 resulted regs are and 3 resulted regs are NOTNOT saved saved
1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) Main step & epilogMain step & epilog
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33 77
vv22 77
vv11 77
vv00 77
vv33 88
vv22 88
vv11 88
vv00 88
vv33 99
vv22 99
vv11 99
vv00 99
vv33
1010
vv22
1010
vv11
1010
vv00
1010
vv33
1111
vv22
1111
vv11
1111
vv00
1111
rr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
Added in current step
+–= +–= +–= +–=
Save in memory by AoSSave in memory by AoS<=<=SoASoA
rr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
3 3 NOTNOT saved saved in current stepin current step
vv33
1212
vv22
1212
vv11
1212
vv00
1212
vv33
1313
vv22
1313
vv11
1313
vv00
1313
vv33
1414
vv22
1414
vv11
1414
vv00
1414
vv33
1515
vv22
1515
vv11
1515
vv00
1515
rr33 33
rr22 33
rr11 33
rr00 33
3 from prev | new
Subtracted in current step
Are freed afterAre freed aftercurrent stepcurrent step
EpilogEpilog
For 5 last results, For 5 last results, subtraction ONLY is donesubtraction ONLY is done
Copyright © 2008 Intel Corporation. 16
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 17
Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation.implementation.
To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs.To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs.
PrologProlog
1.1. Computation Computation 1212 1D RA lines by 3 calls to 1D RA 4-lines routine 1D RA lines by 3 calls to 1D RA 4-lines routine
2.2. Summing up (accumulate) Summing up (accumulate) 55 first in working memory first in working memory
3.3. 66 times: (sum-up next line, save result in final place) times: (sum-up next line, save result in final place)
4.4. 11 time: sum-up next line, subrtact first line, save result in final place time: sum-up next line, subrtact first line, save result in final place
Here all 12 1D RA lines are used: 5+6+1Here all 12 1D RA lines are used: 5+6+1
22ndnd dimension completion dimension completion 2D RA: based on 4-lines 1D SSE implementation - prolog2D RA: based on 4-lines 1D SSE implementation - prolog
1D
RA
L0
1D
RA
L1
1D
RA
L1
1
1D
RA
L2
1D
RA
L3
1D
RA
L4
1D
RA
L5
1D
RA
L6
1D
RA
L7
1D
RA
L8
1D
RA
L9
1D
RA
L1
0
Accumulate Accumulate & save
+ + + + + + +–
2D
RA
L0
<=
2D
RA
L1
<=
2D
RA
L2
<=
2D
RA
L3
<=
2D
RA
L4
<=
2D
RA
L5
<=
2D
RA
L6
<=
Resulting 2D Running Average linesResulting 2D Running Average lines
Add last & subtract the very first
Will be subtracted at the end of prolog
Copyright © 2008 Intel Corporation. 18
22ndnd dimension completion dimension completion 2D RA: based on 4-lines 1D SSE implementation – main step & epilog2D RA: based on 4-lines 1D SSE implementation – main step & epilog
1D
RA
L0
1D
RA
L1
1D
RA
L1
1
1D
RA
L2
1D
RA
L3
1D
RA
L4
1D
RA
L5
1D
RA
L6
1D
RA
L7
1D
RA
L8
1D
RA
L9
1D
RA
L1
0
+– +– +– +–
2D
RA
+0
<=
2D
RA
+1
<=
2D
RA
+2
<=
2D
RA
+3
<=
Resulting 2D Running Average linesResulting 2D Running Average lines
Main stepMain step
– Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 “last” lines from working lines cyclic buffer“last” lines from working lines cyclic buffer
– 4 times - sum-up next, subrtact (next-11), save result in final place 4 times - sum-up next, subrtact (next-11), save result in final place
1D
RA
L1
5
1D
RA
L1
2
1D
RA
L1
3
1D
RA
L1
4
Added in current stepSubtracted in current step
Are freed afterAre freed aftercurrent stepcurrent step
EpilogEpilog
– For 5 last results, subtraction ONLY is doneFor 5 last results, subtraction ONLY is done
Important cash-related note: typical line length is Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash.16 lines is ~26K => less than 32K, L1 cash.
Most of data manipulation is done in L1 cash !Most of data manipulation is done in L1 cash !
Copyright © 2008 Intel Corporation. 19
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completiondimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 20
33rdrd dimension (in-place) computation is done after completion of dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). 2D computations for all the stack of images (planes).
It is straight-forward, as it is fully independent from previously It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that computed 2D results – in opposite to 2D computation that includes 1D computation as internal part.includes 1D computation as internal part.
In general, its logical flow is very similar to 2D one. The important In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are difference is, that (because of “in placing”) the results are firstlyfirstly saved in cyclic buffer, and are copied to final place only saved in cyclic buffer, and are copied to final place only afterafter using appropriate line for subtracting.using appropriate line for subtracting.
33rdrd dimension completion dimension completion
3D
RA
L0
3D
RA
L1
3D
RA
L1
1
3D
RA
L2
3D
RA
L3
3D
RA
L4
3D
RA
L5
3D
RA
L6
3D
RA
L7
3D
RA
L8
3D
RA
L9
3D
RA
L1
0
2D
RA
L0
2D
RA
L1
2D
RA
L1
1
2D
RA
L2
2D
RA
L3
2D
RA
L4
2D
RA
L5
2D
RA
L6
2D
RA
L7
2D
RA
L8
2D
RA
L9
2D
RA
L1
0
Subtract
Add
FirstSecond:Second:
copycopy
Source: 2d RA
Pool of 12 working lines-cyclic buffer
Copyright © 2008 Intel Corporation. 21
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completiondimension completion 33rdrd dimension completiondimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 22
Parallelizing by OpenMP and benchmarkingParallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP, 16 working lines To parallelize the above algorithm by using OpenMP, 16 working lines
for each thread are allocated.for each thread are allocated. Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each
plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil).loop in “y” direction (explained on appropriate foil).
Results for several platforms benchmarking:Results for several platforms benchmarking:
Pentium-MPentium-M
T43 laptopT43 laptop
1.86 GHz1.86 GHz
MeromMerom
T61 laptopT61 laptop
2.0 GHz2.0 GHz
Conroe Conroe WSWS
2.4 GHz2.4 GHz
WoodCrestWoodCrest
WSWS
2.66 GHz2.66 GHz
HPTNHPTN
BensleyBensley
2.8 GHz2.8 GHz
SSE run time msecSSE run time msec 3232 1515 1414 12.512.5 9.49.4
Speed-upSpeed-up
Serial/SSESerial/SSE
2.5x2.5x 4x4x 3.2x3.2x 3.6x3.6x 4.2x4.2x
SSE+OpenMPSSE+OpenMP
run time msecrun time msec
NANA 1313 ?? ?? 5.75.7
Speed-upSpeed-up
SSE/SSE+OpenMPSSE/SSE+OpenMP
NANA 1.15x1.15x ?? ?? 1.6x1.6x
Conclusions:Conclusions:– SSE/serial speed-up for Penryn/Merom is ~SSE/serial speed-up for Penryn/Merom is ~4x4x, 30% better than for “old” Pentium-M (2.5x), 30% better than for “old” Pentium-M (2.5x)
– Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and >3x>3x better for better for Penrin (9.4 msec).Penrin (9.4 msec).
– OpenMP scalability is very low, it seems that performance is restricted by FSB speed.OpenMP scalability is very low, it seems that performance is restricted by FSB speed.
Top Related