DATE FRIDAY WORKSHOP: W6 Designing for Embedded...
Transcript of DATE FRIDAY WORKSHOP: W6 Designing for Embedded...
DATE FRIDAY WORKSHOP: W6 Designing for Embedded Parallel Computing Platforms: Architectures, Design Tools, and Applications
TRACK: APPLICATIONS TITLE OF POSTER: “UWB Microwave Imaging for Breast Cancer Detection: Many-core, GPU, or FPGA?”
AUTHORS: Mario R. Casu (a), Francesco Colonna (a), Marco Crepaldi (b), Danilo Demarchi (a)(b), Mariagrazia Graziano (a), and Maurizio Zamboni (a)
(a) Politecnico di Torino, (b) Italian Institute of Technology
ABSTRACT: An UWB microwave imaging system for breast cancer detection consists of antennas, transceivers, and a high-performance embedded system for elaborating the received signals and reconstructing breast images. In this paper we focus on this embedded system. To accelerate the image reconstruction, the Beamforming phase has to be implemented in a parallel fashion. We assess its implementation in three currently available high-end platforms based on a multi-core CPU, a GPU, and an FPGA, respectively. We then project the results applying technology scaling rules to future many-core CPUs, many-thread GPUs, and advanced FPGAs. We consider an optimistic case in which available resources increase according to Moore’s law only, and a pessimistic case in which only a fraction of those resources are available due to a limited power budget. In both scenarios, an implementation that includes a high-end FPGA outperforms the other alternatives. Since the number of effectively usable cores in future many-cores will be power-limited, and there is a trend toward the integration of power-efficient accelerators, we conjecture that a chip consisting of a many-core section and a reconfigurable logic section will be the perfect platform for this application.
THREE ARCHITECTURE ALTERNATIVES
UWB Microwave Imaging for Breast Cancer Detection:
Many-core, GPU, or FPGA?
Mario R. Casu(a), Francesco Colonna(a), Marco Crepaldi(b),
Danilo Demarchi(a)(b), Mariagrazia Graziano(a), Maurizio Zamboni(a)
(a) Politecnico di Torino, VLSI Lab, (b) Italian Institute of Technology@PoliTo
Radiologist Map of Backscattered Energy
High-Performance Processing
Front-End
FPGA
Received and Digitalized Signal
UWB transmitted pulse
Signal Backscattered
by Tumor Ratio 1:10
Signal Backscattered by Safe Tissue
Tumor
Conclusions • Many-core implementation will not help make BEAF time low enough
• Scaled GPU could make it if not constrained by limited power budget
• Scaled FPGA makes it in
both area-limited and
power-limited scaling
scenarios, and with about
one order of magnitude
less power
• With power-limited scaling,
chips with specialized and
power efficient accelerators
will emerge. Solutions that
embed an FPGA with few
cores are already gaining
momentum (e.g. Xilinx Zynq).
Should in the future appear
a many-core architecture
with embedded FPGA, it
would be the perfect
platform for this application.
Contacts Mario R. Casu, [email protected]
Mariagrazia Graziano, [email protected]
Motivation: In Ultra-Wideband (UWB) Microwave Imaging, sub-nanosecond pulses irradiate the breast. A large
reflected signal may indicate the presence of tumor. A heavy processing of back-scattered signals produce a map of
reflected energy. A high-performance embedded system keeps processing time low enough for a rapid diagnosis.
Question: What’s the more suited architecture in current and in scaled technologies?
BEAF exec time with
up to 16 active threads
and extrapolation for a
greater number. An
unrealistic number of
threads is needed to
make BEAF execution
time equal to Non-BEAF
Antennas
High-Performance Embedded System
Big
Multi-core
(Many-core)
CPU
Small
Front-
End
FPGA
CPU
Small
Front-
End
FPGA
GPU CPU
LARGE
FPGA:
Front-End
and
Processing
Example of 2D slice from a
3D breast map. Tumor is in
the middle, enhanced by
the imaging algorithms
Two main algorithms:
A) Calibration removes
skin artifacts (SKAR)
B) Beamforming (BEAF)
focuses reflected signals on
a 3D volume pixel, a voxel.
BEAF is repeated Nvox
times and takes the largest
part of computation time.
Example with 43 antennas
and Nvox=4.25 106 on a
single core/single thread of
a Xeon E7-4870:
Phase Time (s) Percent
W-SKAR 4.155 0.35%
SKAR 0.049 ~0%
BEAF 1173 99.7%
Total 1177 100%BEAF NEEDS
ACCELERATION
BEAMFORMING ALGORITHM MAIN STEPS
For each voxel v located in rv=(xv,yv,zv) do:
1) Time-align signals received by N antennas: shift all samples
in time by ni(rv), i=1..N
2) Apply K-taps FIR filters to time-aligned signals xi ,i=1..N
with W-BEAF weights wi, and sum all over the N antennas
3) Determine time window h[n,rv] around the point of interest
where the tumor response is expected
4) Compute the energy scattered by location rv by squaring
and summing all the windowed and filtered signal samples
N
1i
K
1kviiiv)](rnk[n[k]xw]rz[n,
n
2vvv
|]r]h[n,rz[n,|]p[r
PROBLEM SIZE
WEIGHTS STRESS MEMORY BANDWIDTH!
Data Type Size Bytes
Samples double (8B) Nant × Nsamples 66.9k
Delays int (4B) Nvox × Nant 0.68G
W-Beaf double (8B) Nvox × Nant × K 74.9G
Energy double (8B) Nvox 32.4M
BEAF ON MULTI-CORES AND MANY-CORES Objective: Reduce BEAF execution time down to at lest Non-BEAF time
(i.e. W-SKAR+SKAR) by leveraging multiple threads
Methodology: 1) BEAF code compiled with gcc and POSIX Threads
Library and executed on a multi-core, multi-thread Xeon E7-4780 (2.4
GHz, 30 MB L3 cache, 32nm technology). 2) Experimental results
projected on a scaled many-core architecture using ITRS predictions
Two scaling approaches: Area-limited vs Power-limited
year 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
tech node (nm) 32 28 25 23 20 17.9 15.9 14.2 12.6 11.3 10
area scaling 1.00 0.71 0.57 0.46 0.37 0.29 0.23 0.18 0.14 0.11 0.09
cores number scaling 10 14 18 22 27 35 44 56 70 90 113
threads scaling (area
limited) 16 22 28 35 43 56 71 89 112 144 181
frequency scaling 1.00 1.04 1.08 1.12 1.17 1.22 1.27 1.32 1.37 1.42 1.48
power scaling (1 core) 1.00 0.95 0.89 0.84 0.78 0.74 0.70 0.66 0.60 0.57 0.53
power scaling (N cores) 1.00 1.33 1.55 1.84 2.10 2.58 3.09 3.66 4.22 5.09 6.01
turned-on cores 10 10 11 12 13 14 14 15 17 18 19
dark cores 0 4 6 10 14 21 30 40 53 72 94
threads scaling (power
limited) 16 17 18 19 21 22 23 24 27 28 30
TECHNOLOGY SCALING (ITRS PREDICTIONS)
Number of cores per die follows Moore’s law, but the number of cores that
are simultaneously turned-on is limited by power budget (dark cores)
BEAF requested
memory bandwidth is
lower than the maximum
bandwidth of the Xeon
architecture (80 GB/s),
also in ITRS scaled
technologies. Pressure on
memory is low, execution
time is NOT limited by
memory bandwidth
BEAF exec time in scaled
technologies extrapolated
considering frequency
scaling and speed-up
scaling with increasingly
available threads. Under
both area-limited and
power-limited scenarios it
is NOT possible to make
BEAF and Non-BEAF
exec time equal
BEAF ON GPU Non-BEAF code runs on a CPU core. BEAF accelerated by NVIDIA Tesla
C2070 GPU (448 threads at 1.15 GHz, 40nm) connected via PCIexpress
Methodology: 1) BEAF kernel coded in OpenCL 2) Experimental results
projected on a future GPU using ITRS predictions
Assumptions: PCIe communication perfectly overlaps w/ computation,
and memory latency fully hidden in GPU.
Scaling approaches: Area-limited vs Power-limited (same CPU rules)
Area-limited scenario:
BEAF and Non-BEAF
exec time are equal
@15.9nm technology in
2018. NOT POSSIBLE
in the more realistic
power-limited scenario
year 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
tech node (nm) 32 28 25 23 20 17.9 15.9 14.2 12.6 11.3 10
threads scaling (area
limited) 700 980 1225 1531 1885 2450 3101 3889 4900 6282 7903
threads scaling (power
limited) 476 499 538 566 612 646 683 723 789 839 894
In both scaling
scenarios, performance
will be limited by PCIe
communication
bandwidth, unless a
newer and faster
protocol will be adopted
BEAF ON FPGA BEAF accelerated by the largest Xilinx Virtex6 (180 MHz, 40nm)
Methodology: 1) BEAF kernel coded in VHDL 2) Experimental results
projected on a future FPGA using ITRS predictions
Assumptions: PCIexpress communication overlaps w/ computation
Scaling approaches: Area-limited vs Power-limited (same CPU rules)
Two alternatives: for a given number of FPGA DSP resources, more
Voxel Accelerators (VA) in parallel or more FIR filters in each VA?
BEAF exec time varying
FIR and VA number.
Solutions with 1 FIR and
multiple VAs use DSP
resources more efficiently.
Lowest exec time @100%
DSP resources with
1 FIR and 34 VAs.
BEAF exec time varying
VA number and 1 FIR.
Linear speed-up would
make BEAF and Non-
BEAF exec time equal if
80 VAs were available
In both scaling scenarios,
BEAF exec time
becomes comparable
with Non-BEAF
In both scaling scenarios,
performance limited by
current PCIe bandwidth,
unless a newer and faster
protocol will be adopted
tech node (nm) 32 28 25 23 20 17.9 15.9 14.2 12.6 11.3 10
VA scaling (area limited) 53 74 93 116 143 186 235 294 371 476 598
VA scaling (power limited) 36 38 41 43 46 49 52 55 60 63 68
VA SCALING (ITRS PREDICTIONS)