3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An...
Transcript of 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An...
![Page 1: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/1.jpg)
Evaluation of FPGAs resurgence for hardware acceleration applied to computed tomography
3D Tomography back-projection parallelization on
FPGAs using OpenCL
Presented by : Maxime MARTELLI , 1st year PhD Student
L2S, SATIE, TSA
1
2017 GPU Winter School, Grenoble, FR
![Page 2: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/2.jpg)
CONTEXT
Moore’s law end announced for 2021
Architecture Algorithm Adequacy- Granular hardware specialization - Processors will offload specific processing to a suited architecture
Software FPGA design tools multiplication
2
![Page 3: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/3.jpg)
HYPOTHESISThe idea
Does HLS tools progress means a resurgence of FPGAs for computed tomography?
3
With the rise of Accelerator-as-a-Service (AaaS), what is the future landscape for FPGAs?
![Page 4: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/4.jpg)
Summary
4
I. What is OpenCL ?II. Why use HLS on FPGAs ?III. Use case highlightIV.OpenCL Memory modelV. Custom implementationsVI.Conclusion and perspectives
![Page 5: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/5.jpg)
I. WHAT IS OPENCL?
5
![Page 6: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/6.jpg)
• Open, royalty-free standard for parallel, compute intensive applica
tion development
• Initiated by Apple, specification maintained by the Khronos group
• Supports multiple device classes, CPUs, GPUs, DSPs, Cell, etc.
• First release on December 2008
• Specification currently at version 2.0
• SDKs and tools are provided by compliant device vendors
OpenCL basics
6
![Page 7: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/7.jpg)
• Proprietary technology for GPGPU programming from Nvidi
a
• Not just API and tools, but name for the whole architecture
• Targets Nvidia hardware and GPUs only
• First SDK released February 2007
• SDK and tools available to 32- and 64-bit Windows, Linux a
nd Mac OS
• Tools and SDK are available for free from Nvidia.
CUDA basics
7
![Page 8: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/8.jpg)
Basics compared
8
CUDA OpenCLWhat it is HW architecture,
programming language, API, SDK
and tools
Open API and language
specification
Propietary or open technology
Proprietary Open and royalty-free
When introduced Q4 2006 Q4 2008SDK vendor Nvidia Implementation
vendorsFree SDK Yes Depends on vendor
Heterogeneous device support
No, just NVIDIA GPUs
Yes (Apple, Nvidia, AMD, IBM, Intel,
…)
![Page 9: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/9.jpg)
OpenCL Memory Architecture
9
![Page 10: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/10.jpg)
CUDA Memory Architecture
10
![Page 11: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/11.jpg)
OpenCL Execution model
11
![Page 12: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/12.jpg)
II. WHY USE HLS ON FPGAS ?
12
![Page 13: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/13.jpg)
Field Programmable Gate Array (FPGA)
13Programmable Switch FabricSource : Intel
![Page 14: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/14.jpg)
CPU instruction mapping
14Source : Intel
![Page 15: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/15.jpg)
CPU execution path (1)
15Source : Intel
![Page 16: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/16.jpg)
CPU execution path (2)
16Source : Intel
![Page 17: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/17.jpg)
CPU vs FPGA execution
17Source : Intel
![Page 18: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/18.jpg)
• Custom data-path that matches your algorithms
• Uses exactly what you need (Operation, Data Width, memory
configuration, …)
• Timing closure and reduced power consumption
• Much easier programming than VHDL
Advantages of FPGA HLS
18
![Page 19: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/19.jpg)
II. USE CASE HIGHLIGHT
19
![Page 20: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/20.jpg)
Brief history
In 2004, FPGA were widely used in Tomography
For 10 years now, GPU dominates the field
With the evolution of HLS tools, a new interest for FPGAs emerge
20
![Page 21: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/21.jpg)
3D Computed Tomography Projection
21
![Page 22: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/22.jpg)
Back-projection algorithm
Memory bound algorithm
!"#$%&"'(&*+,"-./00+$$
= 3,5
Density calculation :d(𝑐)=∫ 𝑠𝑖𝑛𝑜89
:;< 𝑢(𝜑, 𝑐 . 𝑣 𝜑, 𝑐 , 𝜑). 𝑤(𝜑, 𝑐):𝑑𝜑
Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]
For z = 0 to dimZ - 1For y = 0 to dimY - 1
For x = 0 to dimX - 1voxelsum = 0For ϕ = 0 to dimϕ - 1| Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum
22
Massively parallel2563 voxels
256 angles variations
![Page 23: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/23.jpg)
Back-projection results on FPGA
23
![Page 24: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/24.jpg)
Benchmark the different memory structures
Main contributions
01
Implement algorithm-focused optimizations02
Assessing OpenCL code optimization for FPGA03
24
![Page 25: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/25.jpg)
III. OPENCL MEMORY MODEL
25
![Page 26: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/26.jpg)
Memory structure latency on an Altera Cyclone V
240
10 15 3
Global Constant Local Private
Mean latency (cycles)
Tricky situations for calculation (LSU embedded cache)
Custom benchmark (random reads)
26
![Page 27: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/27.jpg)
IV. CUSTOM IMPLEMENTATIONS
27
![Page 28: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/28.jpg)
OpenCL work-group enqueueing mechanism
Data parallelism : ND Range
Task parallelism : Single Work Item (SWI)
28
![Page 29: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/29.jpg)
- Max FPGA frequency : 205 MHz- Intel FPGA SDK for OpenCL 16.0
Experiment setup : DE1-SoC
29
- 1 Gb of DDR3 memory- Dual core ARM Cortex A9 processor and FPGA fabric within an Altera Cyclone V
![Page 30: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/30.jpg)
Implementation 1 : Shift-Register Pattern (TP)
Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]For ϕ = 0 to dimϕ - 1
SRP[ϕ]= (α[ϕ], β[ϕ]) ◁ SRP initializationFor z = 0 to dimZ - 1
For y = 0 to dimY - 1For x = 0 to dimX - 1
voxelsum = 0#pragma unroll ◁ Task parallelismFor ϕ = 0 to dimϕ – 1| SRP[dimϕ – 1] = SRP[0] || For i = 0 to dimϕ – 2 |-- SRP implementation | SRP[i] = SRP [i+1] || Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum
30
![Page 31: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/31.jpg)
Implementation 2 : Memory pre-fetching(DP)
Input : const α [dimϕ], const β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]Local int local_sinogram[Xoff* Yoff]/* Recovery of work-item characteristics (x,y,z) */voxelsum = 0For ϕ = 0 to dimϕ – 1| /* Calculate Un, Vn coordinates */| /* Dispatch min, max coordinates computation | between local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| /* Global sinogram fetching by local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| voxelsum += local_sinogram[localU, localV]volume[x,y,z] = voxelsum
31
![Page 32: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/32.jpg)
Kernels implementations on Cyclone V SoC
222,9
67,5
32,2616,9 31,3 30,8
SWI+Naive SWI+SRP ND+Naive ND+2CU ND+MF ND+Backbone
Raw Execution Time (s)
ND+2CU : linear extrapolation model verification
ND+Backbone : irreducible logic utilization
ND + MF uses less logic than naïve NDrange
SWI + SRP uses less logic and is faster than naïve SWI
Key Points
32
Logic Utilization (%)
4936
55
96
4021
![Page 33: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/33.jpg)
Kernels implementations on Cyclone V SoC
109,2
24,317,7 16,2
12,56,47
Normalized Execution Time (s)
Speedup SWI+Naïve à ND+MF
8,74
33
Matching VHDL FPGA implementations for
ND+MF
Computation rate 137 M“voxel”/s
68 MHz 112 MHz 140 MHz 140 MHz 140 MHz 140 MHz
![Page 34: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/34.jpg)
250
15 2,270
50
100
150
200
250
300 Power(W)
GPU vs FPGA with OpenCL
An embedded GPU is more energy
efficient
Algorithm inadequacy implies
longer FPGA execution time
Low FPGA consumption
12 94
991
0
200
400
600
800
1000
1200
Executiontime(ms)
0,83
0,39
0,63
0
0,2
0,4
0,6
0,8
1
TitanXPascal(GPU) JetsonTK2(GPU) Arria10FPGA
Energy(mWh)
34
![Page 35: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/35.jpg)
V. CONCLUSION AND PERSPECTIVES
35
![Page 36: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/36.jpg)
Intel SDK guarantees one “voxel” computation per clock
Achieved speedup of 8.74 with little hardware knowledge
FPGAs still fall short compared to embedded GPU (performance and power)
for this family of CT algorithm
FPGA (2009) = FPGA OpenCL
Efficient tool for software developers
FPGA < Embedded GPU
CONCLUSION
36
Room for improvement
By reducing kernel footprint or increasing kernel frequency
![Page 37: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/37.jpg)
Many algorithms, like radar clutter computation, are well adapted to
FPGAs strength
Old algorithms not fit for GPUs can re-emerge
Adapted Use-Case Computed Tomography with FPGA?
PERSPECTIVES
37
- Bigger card- Xilinx SDx evaluation- New adapted algorithm
Next ?
![Page 38: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/38.jpg)
THANK YOU
Any questions or comments are welcomed !
38
![Page 39: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution](https://reader033.fdocuments.in/reader033/viewer/2022060908/60a27cd8a7e34010003d6475/html5/thumbnails/39.jpg)
FPGA key numbers
6,9 %
2015 Global Market
6,36 billion
Intel
Xilinx
Others
In 2016, FPGAs outgrew the overall semiconductor market
(resp. 6,9 % vs 1,5 %)
The market is expected to reach 10 billion $ by 2024
Xilinx stays as the first FPGA manufacturer
Market sharesAverage annual gross
39