DATE FRIDAY WORKSHOP: W6 Designing for Embedded...

2
DATE FRIDAY WORKSHOP: W6 Designing for Embedded Parallel Computing Platforms: Architectures, Design Tools, and Applications TRACK: APPLICATIONS TITLE OF POSTER: UWB Microwave Imaging for Breast Cancer Detection: Many-core, GPU, or FPGA?AUTHORS: Mario R. Casu (a), Francesco Colonna (a), Marco Crepaldi (b), Danilo Demarchi (a)(b), Mariagrazia Graziano (a), and Maurizio Zamboni (a) (a) Politecnico di Torino, (b) Italian Institute of Technology ABSTRACT: An UWB microwave imaging system for breast cancer detection consists of antennas, transceivers, and a high-performance embedded system for elaborating the received signals and reconstructing breast images. In this paper we focus on this embedded system. To accelerate the image reconstruction, the Beamforming phase has to be implemented in a parallel fashion. We assess its implementation in three currently available high-end platforms based on a multi-core CPU, a GPU, and an FPGA, respectively. We then project the results applying technology scaling rules to future many-core CPUs, many-thread GPUs, and advanced FPGAs. We consider an optimistic case in which available resources increase according to Moore’s law only, and a pessimistic case in which only a fraction of those resources are available due to a limited power budget. In both scenarios, an implementation that includes a high-end FPGA outperforms the other alternatives. Since the number of effectively usable cores in future many-cores will be power-limited, and there is a trend toward the integration of power- efficient accelerators, we conjecture that a chip consisting of a many-core section and a reconfigurable logic section will be the perfect platform for this application.

Transcript of DATE FRIDAY WORKSHOP: W6 Designing for Embedded...

DATE FRIDAY WORKSHOP: W6 Designing for Embedded Parallel Computing Platforms: Architectures, Design Tools, and Applications

TRACK: APPLICATIONS TITLE OF POSTER: “UWB Microwave Imaging for Breast Cancer Detection: Many-core, GPU, or FPGA?”

AUTHORS: Mario R. Casu (a), Francesco Colonna (a), Marco Crepaldi (b), Danilo Demarchi (a)(b), Mariagrazia Graziano (a), and Maurizio Zamboni (a)

(a) Politecnico di Torino, (b) Italian Institute of Technology

ABSTRACT: An UWB microwave imaging system for breast cancer detection consists of antennas, transceivers, and a high-performance embedded system for elaborating the received signals and reconstructing breast images. In this paper we focus on this embedded system. To accelerate the image reconstruction, the Beamforming phase has to be implemented in a parallel fashion. We assess its implementation in three currently available high-end platforms based on a multi-core CPU, a GPU, and an FPGA, respectively. We then project the results applying technology scaling rules to future many-core CPUs, many-thread GPUs, and advanced FPGAs. We consider an optimistic case in which available resources increase according to Moore’s law only, and a pessimistic case in which only a fraction of those resources are available due to a limited power budget. In both scenarios, an implementation that includes a high-end FPGA outperforms the other alternatives. Since the number of effectively usable cores in future many-cores will be power-limited, and there is a trend toward the integration of power-efficient accelerators, we conjecture that a chip consisting of a many-core section and a reconfigurable logic section will be the perfect platform for this application.

THREE ARCHITECTURE ALTERNATIVES

UWB Microwave Imaging for Breast Cancer Detection:

Many-core, GPU, or FPGA?

Mario R. Casu(a), Francesco Colonna(a), Marco Crepaldi(b),

Danilo Demarchi(a)(b), Mariagrazia Graziano(a), Maurizio Zamboni(a)

(a) Politecnico di Torino, VLSI Lab, (b) Italian Institute of Technology@PoliTo

Radiologist Map of Backscattered Energy

High-Performance Processing

Front-End

FPGA

Received and Digitalized Signal

UWB transmitted pulse

Signal Backscattered

by Tumor Ratio 1:10

Signal Backscattered by Safe Tissue

Tumor

Conclusions • Many-core implementation will not help make BEAF time low enough

• Scaled GPU could make it if not constrained by limited power budget

• Scaled FPGA makes it in

both area-limited and

power-limited scaling

scenarios, and with about

one order of magnitude

less power

• With power-limited scaling,

chips with specialized and

power efficient accelerators

will emerge. Solutions that

embed an FPGA with few

cores are already gaining

momentum (e.g. Xilinx Zynq).

Should in the future appear

a many-core architecture

with embedded FPGA, it

would be the perfect

platform for this application.

Contacts Mario R. Casu, [email protected]

Mariagrazia Graziano, [email protected]

Motivation: In Ultra-Wideband (UWB) Microwave Imaging, sub-nanosecond pulses irradiate the breast. A large

reflected signal may indicate the presence of tumor. A heavy processing of back-scattered signals produce a map of

reflected energy. A high-performance embedded system keeps processing time low enough for a rapid diagnosis.

Question: What’s the more suited architecture in current and in scaled technologies?

BEAF exec time with

up to 16 active threads

and extrapolation for a

greater number. An

unrealistic number of

threads is needed to

make BEAF execution

time equal to Non-BEAF

Antennas

High-Performance Embedded System

Big

Multi-core

(Many-core)

CPU

Small

Front-

End

FPGA

CPU

Small

Front-

End

FPGA

GPU CPU

LARGE

FPGA:

Front-End

and

Processing

Example of 2D slice from a

3D breast map. Tumor is in

the middle, enhanced by

the imaging algorithms

Two main algorithms:

A) Calibration removes

skin artifacts (SKAR)

B) Beamforming (BEAF)

focuses reflected signals on

a 3D volume pixel, a voxel.

BEAF is repeated Nvox

times and takes the largest

part of computation time.

Example with 43 antennas

and Nvox=4.25 106 on a

single core/single thread of

a Xeon E7-4870:

Phase Time (s) Percent

W-SKAR 4.155 0.35%

SKAR 0.049 ~0%

BEAF 1173 99.7%

Total 1177 100%BEAF NEEDS

ACCELERATION

BEAMFORMING ALGORITHM MAIN STEPS

For each voxel v located in rv=(xv,yv,zv) do:

1) Time-align signals received by N antennas: shift all samples

in time by ni(rv), i=1..N

2) Apply K-taps FIR filters to time-aligned signals xi ,i=1..N

with W-BEAF weights wi, and sum all over the N antennas

3) Determine time window h[n,rv] around the point of interest

where the tumor response is expected

4) Compute the energy scattered by location rv by squaring

and summing all the windowed and filtered signal samples

N

1i

K

1kviiiv)](rnk[n[k]xw]rz[n,

n

2vvv

|]r]h[n,rz[n,|]p[r

PROBLEM SIZE

WEIGHTS STRESS MEMORY BANDWIDTH!

Data Type Size Bytes

Samples double (8B) Nant × Nsamples 66.9k

Delays int (4B) Nvox × Nant 0.68G

W-Beaf double (8B) Nvox × Nant × K 74.9G

Energy double (8B) Nvox 32.4M

BEAF ON MULTI-CORES AND MANY-CORES Objective: Reduce BEAF execution time down to at lest Non-BEAF time

(i.e. W-SKAR+SKAR) by leveraging multiple threads

Methodology: 1) BEAF code compiled with gcc and POSIX Threads

Library and executed on a multi-core, multi-thread Xeon E7-4780 (2.4

GHz, 30 MB L3 cache, 32nm technology). 2) Experimental results

projected on a scaled many-core architecture using ITRS predictions

Two scaling approaches: Area-limited vs Power-limited

year 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022

tech node (nm) 32 28 25 23 20 17.9 15.9 14.2 12.6 11.3 10

area scaling 1.00 0.71 0.57 0.46 0.37 0.29 0.23 0.18 0.14 0.11 0.09

cores number scaling 10 14 18 22 27 35 44 56 70 90 113

threads scaling (area

limited) 16 22 28 35 43 56 71 89 112 144 181

frequency scaling 1.00 1.04 1.08 1.12 1.17 1.22 1.27 1.32 1.37 1.42 1.48

power scaling (1 core) 1.00 0.95 0.89 0.84 0.78 0.74 0.70 0.66 0.60 0.57 0.53

power scaling (N cores) 1.00 1.33 1.55 1.84 2.10 2.58 3.09 3.66 4.22 5.09 6.01

turned-on cores 10 10 11 12 13 14 14 15 17 18 19

dark cores 0 4 6 10 14 21 30 40 53 72 94

threads scaling (power

limited) 16 17 18 19 21 22 23 24 27 28 30

TECHNOLOGY SCALING (ITRS PREDICTIONS)

Number of cores per die follows Moore’s law, but the number of cores that

are simultaneously turned-on is limited by power budget (dark cores)

BEAF requested

memory bandwidth is

lower than the maximum

bandwidth of the Xeon

architecture (80 GB/s),

also in ITRS scaled

technologies. Pressure on

memory is low, execution

time is NOT limited by

memory bandwidth

BEAF exec time in scaled

technologies extrapolated

considering frequency

scaling and speed-up

scaling with increasingly

available threads. Under

both area-limited and

power-limited scenarios it

is NOT possible to make

BEAF and Non-BEAF

exec time equal

BEAF ON GPU Non-BEAF code runs on a CPU core. BEAF accelerated by NVIDIA Tesla

C2070 GPU (448 threads at 1.15 GHz, 40nm) connected via PCIexpress

Methodology: 1) BEAF kernel coded in OpenCL 2) Experimental results

projected on a future GPU using ITRS predictions

Assumptions: PCIe communication perfectly overlaps w/ computation,

and memory latency fully hidden in GPU.

Scaling approaches: Area-limited vs Power-limited (same CPU rules)

Area-limited scenario:

BEAF and Non-BEAF

exec time are equal

@15.9nm technology in

2018. NOT POSSIBLE

in the more realistic

power-limited scenario

year 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022

tech node (nm) 32 28 25 23 20 17.9 15.9 14.2 12.6 11.3 10

threads scaling (area

limited) 700 980 1225 1531 1885 2450 3101 3889 4900 6282 7903

threads scaling (power

limited) 476 499 538 566 612 646 683 723 789 839 894

In both scaling

scenarios, performance

will be limited by PCIe

communication

bandwidth, unless a

newer and faster

protocol will be adopted

BEAF ON FPGA BEAF accelerated by the largest Xilinx Virtex6 (180 MHz, 40nm)

Methodology: 1) BEAF kernel coded in VHDL 2) Experimental results

projected on a future FPGA using ITRS predictions

Assumptions: PCIexpress communication overlaps w/ computation

Scaling approaches: Area-limited vs Power-limited (same CPU rules)

Two alternatives: for a given number of FPGA DSP resources, more

Voxel Accelerators (VA) in parallel or more FIR filters in each VA?

BEAF exec time varying

FIR and VA number.

Solutions with 1 FIR and

multiple VAs use DSP

resources more efficiently.

Lowest exec time @100%

DSP resources with

1 FIR and 34 VAs.

BEAF exec time varying

VA number and 1 FIR.

Linear speed-up would

make BEAF and Non-

BEAF exec time equal if

80 VAs were available

In both scaling scenarios,

BEAF exec time

becomes comparable

with Non-BEAF

In both scaling scenarios,

performance limited by

current PCIe bandwidth,

unless a newer and faster

protocol will be adopted

tech node (nm) 32 28 25 23 20 17.9 15.9 14.2 12.6 11.3 10

VA scaling (area limited) 53 74 93 116 143 186 235 294 371 476 598

VA scaling (power limited) 36 38 41 43 46 49 52 55 60 63 68

VA SCALING (ITRS PREDICTIONS)