convol3d16bit
Transcript of convol3d16bit
![Page 1: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/1.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 1/14
Copyright © 2007 Intel Corporation.
R
®
16bit 3D Convolution
Implementation SSE + OpenMP Benchmarking on Penryn
Dr. Zvi Danovich,
Senior Appl icat ion Engin eer
January 2008
![Page 2: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/2.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 2/14
Copyright © 2008 Intel Corporation. 2
Agenda
Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution
Basic routine of algorithm: 2D convolution – 1 line
Main routine of algorithm: 3D convolution – line by line
Adding OpenMP, benchmarking, conclusions
![Page 3: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/3.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 3/14
Copyright © 2008 Intel Corporation. 3
3D convolution (with 3x3x3 kernel K ) is computed
for each pixel P as
wherep
is source pixels andK
– convolution kernelvalues.
In another words, each new pixel is the sum of 27
products of source pixels values with appropriate
kernel values inside kernel cubic:
3D convolution – what is it ?
10
10
10
10
10
10
,,1,1,1
D3
,, 000000
l l
l l
mm
mm
nn
nn
nml nnmml l nml p K P
Kp
Kp
Kp
Kp
Kp
Kp
Kp
Kp
Kp
P = sum
![Page 4: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/4.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 4/14
Copyright © 2008 Intel Corporation. 4
Recombination from 1D convolutions
If 1D convolution is defined as
therefore final line of 3D convolution is
i.e. 3D convolution can be presented as double sum of 91D convolutions – 3 planes with 3 lines in plane
121101
D1
00000
10
10
nnnnnnn p K p K p K p K P
nn
nn
10
10
10
10
,
D3
, 00
l l
l l
mm
mm
ml ml P P
![Page 5: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/5.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 5/14
Copyright © 2008 Intel Corporation. 5
Agenda
Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution
Basic routine of algorithm: 2D convolution – 1 line
Main routine of algorithm: 3D convolution – line by line
Adding OpenMP, benchmarking, conclusions
![Page 6: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/6.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 6/14
![Page 7: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/7.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 7/14Copyright © 2008 Intel Corporation. 7
Agenda
Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution
Basic routine of algorithm: 2D convolution – 1 line
Main routine of algorithm: 3D convolution – line by line
Adding OpenMP, benchmarking, conclusions
![Page 8: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/8.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 8/14Copyright © 2008 Intel Corporation. 8
Main loop is treating sequential EIGHTs of 16bit pixels for 3adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit
form) is computed for 2 QUADs of each EIGHT, results for 3lines are summed up, therefore forming 2D convolution results.
To avoid using “if”s in the main loop, the very first step isseparated into prolog part, being simpler than general step.
Below is the description of 1 line (from 3 lines) computations in
general main loop step.It starts from loading EIGHT 16bit source pixels and unpackingthem into 2 32bit QUADs :
Basic routine of algorithm: 2D convolution – 1 line
p0 p1 p2 p3 p4 p5 p6 p7
p0 p1 p2 p3
p4 p5 p6 p7
p0 p1 p2 p3
p4 p5 p6 p7
Load EIGHT of 16 bit source pixels
Shuffle
Shuffle
Equivalence
Equivalence
First unpacked 32bit QUAD
Second unpacked 32bit QUAD
![Page 9: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/9.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 9/14
![Page 10: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/10.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 10/14Copyright © 2008 Intel Corporation. 10
As already mentioned, each step treats and sums up data from 3adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly.
Prolog step doesn’t include PREVIOUS sum computation and certainlydoesn’t save it.
The epilog step includes the very last 2D convolution QUAD computationand store that is fully similar to PREVIOUS computation in regular step.
Final ly, the above rou t ine bui lds ONE 32bit l ine of 2D con volu t ion resul t ing poin ts.
Basic routine of algorithm: 2d convolution – 1 linefinalizing
![Page 11: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/11.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 11/14Copyright © 2008 Intel Corporation. 11
Agenda
Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution
Basic routine of algorithm: 2D convolution – 1 line
Main routine of algorithm: 3D convolution – line by line
Adding OpenMP, benchmarking, conclusions
![Page 12: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/12.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 12/14
![Page 13: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/13.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 13/14
![Page 14: convol3d16bit](https://reader030.fdocuments.in/reader030/viewer/2022021122/577cdb821a28ab9e78a85e93/html5/thumbnails/14.jpg)
7/28/2019 convol3d16bit
http://slidepdf.com/reader/full/convol3d16bit 14/14Copyright © 2008 Intel Corporation 14
Parallelizing by OpenMP and benchmarking
To parallelize the above algorithm by using OpenMP for external (slices)loop, 3 32bit working lines for each thread are allocated.
See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).
3 runs – equivalent of 3D gradient computation:
SSE only SSE+OpenMP
Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3
10 runs: SSE only SSE+OpenMP
Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6
Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations !
Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !