convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 1/14

Copyright © 2007 Intel Corporation.

R

®

16bit 3D Convolution

Implementation SSE + OpenMP Benchmarking on Penryn

Dr. Zvi Danovich,

Senior Appl icat ion Engin eer

January 2008



Copyright © 2008 Intel Corporation. 2

Agenda

Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution

Basic routine of algorithm: 2D convolution – 1 line

Main routine of algorithm: 3D convolution – line by line

Adding OpenMP, benchmarking, conclusions




3D convolution (with 3x3x3 kernel K ) is computed

for each pixel P as

wherep

is source pixels andK

– convolution kernelvalues.

In another words, each new pixel is the sum of 27

products of source pixels values with appropriate

kernel values inside kernel cubic:

3D convolution – what is it ?

10

10

10

10

10

10

,,1,1,1

D3

,, 000000

l l

l l

mm

mm

nn

nn

nml nnmml l nml p K P

Kp

Kp

Kp

Kp

Kp

Kp

Kp

Kp

Kp

P = sum




Recombination from 1D convolutions

If 1D convolution is defined as

therefore final line of 3D convolution is

i.e. 3D convolution can be presented as double sum of 91D convolutions – 3 planes with 3 lines in plane

121101

D1

00000

10

10

nnnnnnn p K p K p K p K P

nn

nn

10

10

10

10

,

D3

, 00

l l

l l

mm

mm

ml ml P P




Agenda






http://slidepdf.com/reader/full/convol3d16bit 7/14Copyright © 2008 Intel Corporation. 7

Agenda







Main loop is treating sequential EIGHTs of 16bit pixels for 3adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit

form) is computed for 2 QUADs of each EIGHT, results for 3lines are summed up, therefore forming 2D convolution results.

To avoid using “if”s in the main loop, the very first step isseparated into prolog part, being simpler than general step.

Below is the description of 1 line (from 3 lines) computations in

general main loop step.It starts from loading EIGHT 16bit source pixels and unpackingthem into 2 32bit QUADs :


p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3

p4 p5 p6 p7

p0 p1 p2 p3

p4 p5 p6 p7

Load EIGHT of 16 bit source pixels

Shuffle

Shuffle

Equivalence

Equivalence

First unpacked 32bit QUAD

Second unpacked 32bit QUAD



As already mentioned, each step treats and sums up data from 3adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly.

Prolog step doesn’t include PREVIOUS sum computation and certainlydoesn’t save it.

The epilog step includes the very last 2D convolution QUAD computationand store that is fully similar to PREVIOUS computation in regular step.

Final ly, the above rou t ine bui lds ONE 32bit l ine of 2D con volu t ion resul t ing poin ts.

Basic routine of algorithm: 2d convolution – 1 linefinalizing



Agenda






http://slidepdf.com/reader/full/convol3d16bit 14/14Copyright © 2008 Intel Corporation 14

Parallelizing by OpenMP and benchmarking

To parallelize the above algorithm by using OpenMP for external (slices)loop, 3 32bit working lines for each thread are allocated.

See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).

3 runs – equivalent of 3D gradient computation:

SSE only SSE+OpenMP

Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3

10 runs: SSE only SSE+OpenMP

Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6

Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations !

Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !

convol3d16bit

Documents

Transcript of convol3d16bit