Download - SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms

Copyright © 2007 Intel Corporation.

RR

®®

SP 3D Running SP 3D Running

Average Average Implementation SSE + OpenMPImplementation SSE + OpenMP

Benchmarking on different platformsBenchmarking on different platforms

Dr. Zvi Danovich, Dr. Zvi Danovich, Senior Application EngineerSenior Application Engineer

January 2008January 2008

Copyright © 2008 Intel Corporation. 2

AgendaAgenda What is 3D Running Average (RA) ?What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions


3D RA is computed for each voxel 3D RA is computed for each voxel VV as normalized sum inside as normalized sum inside kkxxkkxxk cube (k is odd) located “around” given voxel:k cube (k is odd) located “around” given voxel:

where ‘where ‘v’v’ is source voxels. is source voxels.

In another words, 3D RA can be considered as 3D convolution In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(kwith kernel having all components equal to 1/(kxxkkxxk).k).

3D Running Average (RA) – what is it ?3D Running Average (RA) – what is it ?

2

2

2

2

2

2

000

0

0

0

0

0

0

,,1D3

,,

k

k

k

k

k

k

ll

ll

mm

mm

nn

nn

nmlkkknml vV

vv vv vv

vv vv vv

vv vv vv

V = (1/k3)sum

k

k

k


AgendaAgenda What is 3D Running Average (RA) ?What is 3D Running Average (RA) ? From 1D to 3D RA implementationFrom 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions


1D Running Average (RA)1D Running Average (RA)

Unlike 1D convolution, 1D RA can be computed Unlike 1D convolution, 1D RA can be computed with complexity (Owith complexity (O11) using following aproach:) using following aproach:

– Prolog: compute sum S of first k voxelsProlog: compute sum S of first k voxels

– Main step: to compute next sum SMain step: to compute next sum S+1 +1 , first member of , first member of previous sum (previous sum (v0) should be subtracted, and next ) should be subtracted, and next component (component (vk) should be added ) should be added

00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk

S = ∑(v)0,k-1

S+1 = ∑(v)1,k = S – v0 + vk


Extending 1D Running Average toward 2DExtending 1D Running Average toward 2D

Giving slice (plane) with all lines (LGiving slice (plane) with all lines (Lii) 1D-averaged, we ) 1D-averaged, we can extend averaging for 2D by the same approach:can extend averaging for 2D by the same approach:

– Prolog: compute sum S of first k linesProlog: compute sum S of first k lines

S = ∑(L)0,k-1

S+1 = ∑(L)1,k = S– L0 + Lk

L 0

L k

–Main step: to compute next sum SMain step: to compute next sum S+1+1 , first line of previous sum , first line of previous sum ((L0) should be subtracted, and next line () should be subtracted, and next line (Lk) should be added) should be added


Extending 2D Running Average toward 3DExtending 2D Running Average toward 3D

Giving stack of planes with all planes (PGiving stack of planes with all planes (Pii) 2D-averaged, ) 2D-averaged, we can extend averaing for 3D by the same approach:we can extend averaing for 3D by the same approach:

– Prolog: compute sum S of first k planesProlog: compute sum S of first k planes

– Main step: to compute next sum SMain step: to compute next sum S+1+1 , first plane of previous sum , first plane of previous sum ((P0) should be subtracted, and next plane () should be subtracted, and next plane (Pk) should be added) should be added

S = ∑(P)0,k-1

S+1 = ∑(P)1,k = S– P0 + Pk

kkk-1k-1k-2k-2……2211 00


AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions


How it can be transformed ?

Array of Structures (AoS ) => Structure of Arrays (SoA)Array of Structures (AoS ) => Structure of Arrays (SoA)Why should we transform it to vectorize 1D Running Average ?Why should we transform it to vectorize 1D Running Average ?

Origin “natural” serial Origin “natural” serial data structure: AoSdata structure: AoS

NOTNOT enabled for SSE enabled for SSE

00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk Ms = ∑(m)0,k-1

Ms+1 = ∑(m)1,k = Ms – m0 + mk

00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk

00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk

00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk

L0

L1

L2

L3vv

33 00

vv22 00

vv11 00

vv00 00

vv33 11

vv22 11

vv11 11

vv00 11

vv33 22

vv22 22

vv11 22

vv00 22

vv33 33

vv22 33

vv11 33

vv00 33

vv33 44

vv22 44

vv11 44

vv00 44

vv33 55

vv22 55

vv11 55

vv00 55

vv33 66

vv22 66

vv11 66

vv00 66

vv33

k-3

k-3

vv22

k-3

k-3

vv11

k-3

k-3

vv00

k-3

k-3

vv33

k-2

k-2

vv22

k-2

k-2

vv11

k-2

k-2

vv00

k-2

k-2

vv33

k-1

k-1

vv22

k-1

k-1

vv11

k-1

k-1

vv00

k-1

k-1

vv33

k k

vv22

k k

vv11

k k

vv00 kk

S = ∑(v)0,k-1S+1 = ∑(v)1,k

= S – v0 + vk

““Transposed” Transposed” data structure: SoAdata structure: SoA ENABLEDENABLED for SSE ! for SSE !


Array of Structures (AoS ) => Structure of Arrays (SoA)Array of Structures (AoS ) => Structure of Arrays (SoA) Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Takes 12 SSE operations per 16 components.Takes 12 SSE operations per 16 components.

xx00 yy00 zz00 ww00

xx11 yy11 zz11 ww11

xx22 yy22 zz22 ww22

xx33 yy33 zz33 ww33

xx00 yy00 xx11 yy11 zz00 ww00 zz11 ww11


L1 org

L2 org

L3 org

ww00 ww11 ww22 ww33zz00 zz11 zz22 zz33xx00 xx11 xx22 xx33 yy00 yy11 yy22 yy33

intermediate

intermediate

loadhi,loadlo loadhi,

loadlo

loadhi,loadlo

loadhi,

loadlo

L0 org

xy10

xy32

zw10

zw32

shuffle(xy10 , xy32 , (3,1,3,1))

FINAL SSE regs

shuffle(zw10 , zw32 , (2,0,2,0))

shuffle(xy10 , xy32 , (2,0,2,0))

shuffle(zw10 , zw32 , (3,1,3,1))


Array of Structures (AoS ) Array of Structures (AoS ) <=<= Structure of Arrays (SoA) Structure of Arrays (SoA) Presented below: Presented below: (inverse)(inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. transposition of 4 x, y, w, z SSE regs into 4 memory places. Takes 12 SSE operations per 16 components.Takes 12 SSE operations per 16 components.

xx00 yy00 zz00 ww00

xx11 yy11 zz11 ww11

xx22 yy22 zz22 ww22

xx33 yy33 zz33 ww33



L1 ptr

L2 ptr

L3 ptr

ww00 ww11 ww22 ww33zz00 zz11 zz22 zz33xx00 xx11 xx22 xx33 yy00 yy11 yy22 yy33

shuffle(xy10, zw10, …)+storestore

L0 ptr

xy10

xy32

zw10

zw32

Org SSE regs

unpack_lo unpack_lo

unpack_hi unpack_hi

shuffle(xy23, zw23, …)+store


1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) Cyclic SSE array bufferCyclic SSE array buffer

AoS=>SoA transform loads 4 SSE regs. AoS=>SoA transform loads 4 SSE regs.

RA with width 11 needs to maintain together 12 regs, they can fit RA with width 11 needs to maintain together 12 regs, they can fit in 3 QUADs of regs, but can crawl to 4 QUADs as wellin 3 QUADs of regs, but can crawl to 4 QUADs as well

vv33 00

vv22 00

vv11 00

vv00 00

vv33 11

vv22 11

vv11 11

vv00 11

vv33 22

vv22 22

vv11 22

vv00 22

vv33 33

vv22 33

vv11 33

vv00 33

vv33 44

vv22 44

vv11 44

vv00 44

vv33 55

vv22 55

vv11 55

vv00 55

vv33 66

vv22 66

vv11 66

vv00 66

vv33 77

vv22 77

vv11 77

vv00 77

vv33 88

vv22 88

vv11 88

vv00 88

vv33 99

vv22 99

vv11 99

vv00 99

vv33

1010

vv22

1010

vv11

1010

vv00

1010

vv33

1111

vv22

1111

vv11

1111

vv00

1111

vv33

1212

vv22

1212

vv11

1212

vv00

1212

vv33

1313

vv22

1313

vv11

1313

vv00

1313

vv33

1414

vv22

1414

vv11

1414

vv00

1414

vv33

1515

vv22

1515

vv11

1515

vv00

1515

12: fits in 3 QUADS

12: crawls to 4 QUADS

vv33 00

vv22 00

vv11 00

vv00 00

vv33 11

vv22 11

vv11 11

vv00 11

vv33 22

vv22 22

vv11 22

vv00 22

vv33 33

vv22 33

vv11 33

vv00 33

Can be filled by AoS=>SoA as oS=>SoA as ““next” QUAD next” QUAD

So, 16 regs (4 QUADs) must be allocated and used in So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. AoS=>SoA with next QUAD values.

Fill by AoS=>SoAFill by AoS=>SoA


1.1. Loading Loading 1212 SSE regs by AoS=>SoA SSE regs by AoS=>SoA

2.2. Summing up (accumulate) Summing up (accumulate) 55 first first

3.3. 44 times: (sum-up next, save result in SSE regs – SoA form) times: (sum-up next, save result in SSE regs – SoA form)– Save QUAD of results in memory by AoS<=SoASave QUAD of results in memory by AoS<=SoA

4.4. 22 times: (sum-up next, save result in SSE regs – SoA form) times: (sum-up next, save result in SSE regs – SoA form)

5.5. 11 time: Sum-up next, subrtact first, save result in SSE reg time: Sum-up next, subrtact first, save result in SSE reg

Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are NOTNOT saved saved

1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) PrologProlog

vv33 00

vv22 00

vv11 00

vv00 00

vv33 11

vv22 11

vv11 11

vv00 11

vv33 22

vv22 22

vv11 22

vv00 22

vv33 33

vv22 33

vv11 33

vv00 33

vv33 44

vv22 44

vv11 44

vv00 44

vv33 55

vv22 55

vv11 55

vv00 55

vv33 66

vv22 66

vv11 66

vv00 66

vv33 77

vv22 77

vv11 77

vv00 77

vv33 88

vv22 88

vv11 88

vv00 88

vv33 99

vv22 99

vv11 99

vv00 99

vv33

1010

vv22

1010

vv11

1010

vv00

1010

vv33

1111

vv22

1111

vv11

1111

vv00

1111

Accumulaterr33 00

rr22 00

rr11 00

rr00 00

rr33 11

rr22 11

rr11 11

rr00 11

rr33 22

rr22 22

rr11 22

rr00 22

rr33 33

rr22 33

rr11 33

rr00 33

Accumulate & save

+= += += += += += +–=

Save in memory by AoSSave in memory by AoS<=<=SoASoA

rr33 00

rr22 00

rr11 00

rr00 00

rr33 11

rr22 11

rr11 11

rr00 11

rr33 22

rr22 22

rr11 22

rr00 22

Add last & subtract the very first

Will be subtracted at the end of prolog

3 3 NOTNOT saved in prolog saved in prolog


Main stepMain step1.1. Loading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic bufferLoading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic buffer

2.2. Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4 thth – Save QUAD of results in memory by AoS<=SoASave QUAD of results in memory by AoS<=SoA

3.3. 3 times: (sum-up next, subrtact first, save result in SSE reg)3 times: (sum-up next, subrtact first, save result in SSE reg)

During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, and 3 resulted regs are and 3 resulted regs are NOTNOT saved saved

1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) Main step & epilogMain step & epilog

vv33 00

vv22 00

vv11 00

vv00 00

vv33 11

vv22 11

vv11 11

vv00 11

vv33 22

vv22 22

vv11 22

vv00 22

vv33 33

vv22 33

vv11 33

vv00 33

vv33 44

vv22 44

vv11 44

vv00 44

vv33 55

vv22 55

vv11 55

vv00 55

vv33 66

vv22 66

vv11 66

vv00 66

vv33 77

vv22 77

vv11 77

vv00 77

vv33 88

vv22 88

vv11 88

vv00 88

vv33 99

vv22 99

vv11 99

vv00 99

vv33

1010

vv22

1010

vv11

1010

vv00

1010

vv33

1111

vv22

1111

vv11

1111

vv00

1111

rr33 00

rr22 00

rr11 00

rr00 00

rr33 11

rr22 11

rr11 11

rr00 11

rr33 22

rr22 22

rr11 22

rr00 22

Added in current step

+–= +–= +–= +–=

Save in memory by AoSSave in memory by AoS<=<=SoASoA

rr33 00

rr22 00

rr11 00

rr00 00

rr33 11

rr22 11

rr11 11

rr00 11

rr33 22

rr22 22

rr11 22

rr00 22

3 3 NOTNOT saved saved in current stepin current step

vv33

1212

vv22

1212

vv11

1212

vv00

1212

vv33

1313

vv22

1313

vv11

1313

vv00

1313

vv33

1414

vv22

1414

vv11

1414

vv00

1414

vv33

1515

vv22

1515

vv11

1515

vv00

1515

rr33 33

rr22 33

rr11 33

rr00 33

3 from prev | new

Subtracted in current step

Are freed afterAre freed aftercurrent stepcurrent step

EpilogEpilog

For 5 last results, For 5 last results, subtraction ONLY is donesubtraction ONLY is done


Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation.implementation.

To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs.To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs.

PrologProlog

1.1. Computation Computation 1212 1D RA lines by 3 calls to 1D RA 4-lines routine 1D RA lines by 3 calls to 1D RA 4-lines routine

2.2. Summing up (accumulate) Summing up (accumulate) 55 first in working memory first in working memory

3.3. 66 times: (sum-up next line, save result in final place) times: (sum-up next line, save result in final place)

4.4. 11 time: sum-up next line, subrtact first line, save result in final place time: sum-up next line, subrtact first line, save result in final place

Here all 12 1D RA lines are used: 5+6+1Here all 12 1D RA lines are used: 5+6+1

22ndnd dimension completion dimension completion 2D RA: based on 4-lines 1D SSE implementation - prolog2D RA: based on 4-lines 1D SSE implementation - prolog

1D

RA

L0

1D

RA

L1

1D

RA

L1

1

1D

RA

L2

1D

RA

L3

1D

RA

L4

1D

RA

L5

1D

RA

L6

1D

RA

L7

1D

RA

L8

1D

RA

L9

1D

RA

L1

0

Accumulate Accumulate & save

+ + + + + + +–

2D

RA

L0

<=

2D

RA

L1

<=

2D

RA

L2

<=

2D

RA

L3

<=

2D

RA

L4

<=

2D

RA

L5

<=

2D

RA

L6

<=

Resulting 2D Running Average linesResulting 2D Running Average lines

Add last & subtract the very first

Will be subtracted at the end of prolog


22ndnd dimension completion dimension completion 2D RA: based on 4-lines 1D SSE implementation – main step & epilog2D RA: based on 4-lines 1D SSE implementation – main step & epilog

1D

RA

L0

1D

RA

L1

1D

RA

L1

1

1D

RA

L2

1D

RA

L3

1D

RA

L4

1D

RA

L5

1D

RA

L6

1D

RA

L7

1D

RA

L8

1D

RA

L9

1D

RA

L1

0

+– +– +– +–

2D

RA

+0

<=

2D

RA

+1

<=

2D

RA

+2

<=

2D

RA

+3

<=

Resulting 2D Running Average linesResulting 2D Running Average lines

Main stepMain step

– Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 “last” lines from working lines cyclic buffer“last” lines from working lines cyclic buffer

– 4 times - sum-up next, subrtact (next-11), save result in final place 4 times - sum-up next, subrtact (next-11), save result in final place

1D

RA

L1

5

1D

RA

L1

2

1D

RA

L1

3

1D

RA

L1

4

Added in current stepSubtracted in current step

Are freed afterAre freed aftercurrent stepcurrent step

EpilogEpilog

– For 5 last results, subtraction ONLY is doneFor 5 last results, subtraction ONLY is done

Important cash-related note: typical line length is Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash.16 lines is ~26K => less than 32K, L1 cash.

Most of data manipulation is done in L1 cash !Most of data manipulation is done in L1 cash !


AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completiondimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions


33rdrd dimension (in-place) computation is done after completion of dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). 2D computations for all the stack of images (planes).

It is straight-forward, as it is fully independent from previously It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that computed 2D results – in opposite to 2D computation that includes 1D computation as internal part.includes 1D computation as internal part.

In general, its logical flow is very similar to 2D one. The important In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are difference is, that (because of “in placing”) the results are firstlyfirstly saved in cyclic buffer, and are copied to final place only saved in cyclic buffer, and are copied to final place only afterafter using appropriate line for subtracting.using appropriate line for subtracting.

33rdrd dimension completion dimension completion

3D

RA

L0

3D

RA

L1

3D

RA

L1

1

3D

RA

L2

3D

RA

L3

3D

RA

L4

3D

RA

L5

3D

RA

L6

3D

RA

L7

3D

RA

L8

3D

RA

L9

3D

RA

L1

0

2D

RA

L0

2D

RA

L1

2D

RA

L1

1

2D

RA

L2

2D

RA

L3

2D

RA

L4

2D

RA

L5

2D

RA

L6

2D

RA

L7

2D

RA

L8

2D

RA

L9

2D

RA

L1

0

Subtract

Add

FirstSecond:Second:

copycopy

Source: 2d RA

Pool of 12 working lines-cyclic buffer


AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completiondimension completion 33rdrd dimension completiondimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions


Parallelizing by OpenMP and benchmarkingParallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP, 16 working lines To parallelize the above algorithm by using OpenMP, 16 working lines

for each thread are allocated.for each thread are allocated. Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each

plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil).loop in “y” direction (explained on appropriate foil).

Results for several platforms benchmarking:Results for several platforms benchmarking:

Pentium-MPentium-M

T43 laptopT43 laptop

1.86 GHz1.86 GHz

MeromMerom

T61 laptopT61 laptop

2.0 GHz2.0 GHz

Conroe Conroe WSWS

2.4 GHz2.4 GHz

WoodCrestWoodCrest

WSWS

2.66 GHz2.66 GHz

HPTNHPTN

BensleyBensley

2.8 GHz2.8 GHz

SSE run time msecSSE run time msec 3232 1515 1414 12.512.5 9.49.4

Speed-upSpeed-up

Serial/SSESerial/SSE

2.5x2.5x 4x4x 3.2x3.2x 3.6x3.6x 4.2x4.2x

SSE+OpenMPSSE+OpenMP

run time msecrun time msec

NANA 1313 ?? ?? 5.75.7

Speed-upSpeed-up

SSE/SSE+OpenMPSSE/SSE+OpenMP

NANA 1.15x1.15x ?? ?? 1.6x1.6x

Conclusions:Conclusions:– SSE/serial speed-up for Penryn/Merom is ~SSE/serial speed-up for Penryn/Merom is ~4x4x, 30% better than for “old” Pentium-M (2.5x), 30% better than for “old” Pentium-M (2.5x)

– Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and >3x>3x better for better for Penrin (9.4 msec).Penrin (9.4 msec).

– OpenMP scalability is very low, it seems that performance is restricted by FSB speed.OpenMP scalability is very low, it seems that performance is restricted by FSB speed.