convol3d16bit

14
Copyright © 2007 Intel Corporation. R  ® 16bit 3D Convolution Implementation SSE + OpenMP  Benchmarking on Penryn Dr. Zvi Da no vic h, Senior Application Engineer  Janu ary 2008  

Transcript of convol3d16bit

Page 1: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 1/14

Copyright © 2007 Intel Corporation.

 ® 

16bit 3D Convolution

Implementation SSE + OpenMP Benchmarking on Penryn

Dr. Zvi Danovich,

Senior Appl icat ion Engin eer 

January 2008 

Page 2: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 2/14

Copyright © 2008 Intel Corporation. 2

Agenda

Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution

Basic routine of algorithm: 2D convolution – 1 line

Main routine of algorithm: 3D convolution – line by line

Adding OpenMP, benchmarking, conclusions

Page 3: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 3/14

Copyright © 2008 Intel Corporation. 3

3D convolution (with 3x3x3 kernel K ) is computed

for each pixel P as

wherep 

is source pixels andK 

  – convolution kernelvalues. 

In another words, each new pixel is the sum of 27

products of source pixels values with appropriate

kernel values inside kernel cubic:

3D convolution – what is it ?

10

10

10

10

10

10

,,1,1,1

D3

,, 000000

l l 

l l 

mm

mm

nn

nn

nml nnmml l nml  p K  P 

Kp

Kp

Kp

Kp

Kp

Kp

Kp

Kp

Kp

P = sum

Page 4: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 4/14

Copyright © 2008 Intel Corporation. 4

Recombination from 1D convolutions

If 1D convolution is defined as

therefore final line of 3D convolution is

i.e. 3D convolution can be presented as double sum of 91D convolutions – 3 planes with 3 lines in plane 

121101

D1

00000

10

10

nnnnnnn p K  p K  p K  p K  P 

nn

nn

10

10

10

10

,

D3

, 00

l l 

l l 

mm

mm

ml ml  P  P 

Page 5: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 5/14

Copyright © 2008 Intel Corporation. 5

Agenda

Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution

Basic routine of algorithm: 2D convolution – 1 line

Main routine of algorithm: 3D convolution – line by line

Adding OpenMP, benchmarking, conclusions

Page 6: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 6/14

Page 7: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 7/14Copyright © 2008 Intel Corporation. 7

Agenda

Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution

Basic routine of algorithm: 2D convolution – 1 line

Main routine of algorithm: 3D convolution – line by line

Adding OpenMP, benchmarking, conclusions

Page 8: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 8/14Copyright © 2008 Intel Corporation. 8

Main loop is treating sequential EIGHTs of 16bit pixels for 3adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit

form) is computed for 2 QUADs of each EIGHT, results for 3lines are summed up, therefore forming 2D convolution results.

To avoid using “if”s in the main loop, the very first step isseparated into prolog part, being simpler than general step.

Below is the description of 1 line (from 3 lines) computations in

general main loop step.It starts from loading EIGHT 16bit source pixels and unpackingthem into 2 32bit QUADs :

Basic routine of algorithm: 2D convolution – 1 line

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3

p4 p5 p6 p7

p0 p1 p2 p3

p4 p5 p6 p7

Load EIGHT of 16 bit source pixels

Shuffle

Shuffle

Equivalence

Equivalence

First unpacked 32bit QUAD

Second unpacked 32bit QUAD

Page 9: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 9/14

Page 10: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 10/14Copyright © 2008 Intel Corporation. 10

As already mentioned, each step treats and sums up data from 3adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly.

Prolog step doesn’t include PREVIOUS sum computation and certainlydoesn’t save it. 

The epilog step includes the very last 2D convolution QUAD computationand store that is fully similar to PREVIOUS computation in regular step.

Final ly, the above rou t ine bui lds ONE 32bit l ine of 2D con volu t ion resul t ing poin ts.

Basic routine of algorithm: 2d convolution – 1 linefinalizing

Page 11: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 11/14Copyright © 2008 Intel Corporation. 11

Agenda

Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution

Basic routine of algorithm: 2D convolution – 1 line

Main routine of algorithm: 3D convolution – line by line

Adding OpenMP, benchmarking, conclusions

Page 12: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 12/14

Page 13: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 13/14

Page 14: convol3d16bit

7/28/2019 convol3d16bit

http://slidepdf.com/reader/full/convol3d16bit 14/14Copyright © 2008 Intel Corporation 14

Parallelizing by OpenMP and benchmarking

To parallelize the above algorithm by using OpenMP for external (slices)loop, 3 32bit working lines for each thread are allocated.

See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).

3 runs – equivalent of 3D gradient computation:

SSE only SSE+OpenMP

Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3

10 runs: SSE only SSE+OpenMP

Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6

Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations !

Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !