Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13,...
-
Upload
gerald-small -
Category
Documents
-
view
216 -
download
0
Transcript of Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13,...
Advanced Computer Architecture, CSE 520
Generating FPGA-Accelerated DFT Libraries
Chi-Li YuNov. 13, 2007
Advanced Computer Architecture, CSE 520
Overview
Application:1D/2D Discrete Fourier Transform
Problem:Hardware-Software PartitioningAcceleration Based on FPGA
Results (compared to software-only solution):
Up to 7.5 times higher performanceUp to 2.5 times better energy efficiency
Advanced Computer Architecture, CSE 520
Why DFT?
Discrete Fourier Transform (DFT) is an important primitive underlying many DSP applications.
Imaging/speech processingCommunication systems
Computation-intensiveData/memory-intensive
Advanced Computer Architecture, CSE 520
Review of DFT
21
0
DFT: [ ] [ ]nkN j
N
n
X k x n e
Requires N2 complex multiplies and N(N-1) complex additions
1
02/
1
02/
1
0
21
0
2
1
0
22
22
]12[]2[
)](12[)](2[
][][
][][
NN
NN
r
rkN
kN
r
rkN
r
rkN
kN
r
rkN
oddn
nkN
evenn
nkN
N
n
nkN
WrxWWrx
WrxWWrx
WnxWnx
WnxkX
When N is a power-of-two, 2p:
Advanced Computer Architecture, CSE 520
Pipelined streaming architecture of FFT
Data flow diagram of Fast Fourier Transform (FFT)
Pipelined streaming architecture(Throughput: 1 sample/clock)
Advanced Computer Architecture, CSE 520
Problem
Pure hardware implementationN should be a power-of-twoN is usually fixed
Arbitrary sized DFT is hard to be implementedFlexible programmability/Fast execution timeHardware-Software heterogeneous architectureHW-SW partitioning
Advanced Computer Architecture, CSE 520
Principles of HW-SW partitioning
Hardware:The most computation intensive kernels that are conducive to hardware acceleration are extracted from an algorithm and realized as hardware.
Software:Remaining computations are carried out in software.Control-intensive part.
Xilinx Virtex-II Pro Platform FPGA
Advanced Computer Architecture, CSE 520
Xilinx Virtex-II Pro Platform FPGA
Field Programmable Gate Array: FPGAProcess: 0.13um, 1.5vFlexible Logic Resources
Up to 1M gate-count capacityUp to 8 Mb of True Dual-Port RAM
Embedded IBM PowerPC 405 RISC processor blocks
provide performance up to 400 MHz
Advanced Computer Architecture, CSE 520
The way to achieve hardware acceleration for DFT
When considering power-of-2 problem sizes (i.e., DFTs on 2p points), we only need to consider two-power sized DFT kernels (i.e., DFT2
q ).
By off-loading the appropriate kernels into hardware, the software receives the benefit of hardware acceleration and yet can still compute arbitrary sized DFTs on top of the available kernels.
Advanced Computer Architecture, CSE 520
Research problem
Different kernels in hardware yield Different performance (e.g., operations per second) Different amounts of resources (e.g., logic, number of BRAM, or power consumption).
DFT partitioning problem Selecting the appropriate set of throughput optimized two-power sized DFT cores to satisfy a given resource constraint (logic, power, energy) while maximizing a scalar metric, such as performance.
Advanced Computer Architecture, CSE 520
Test platform based on the FPGA
Notice that the data cache of PowerPC is 16kB.
Advanced Computer Architecture, CSE 520
Architecture of the generated hardware DFT IP cores
FPGA
Advanced Computer Architecture, CSE 520
DFT Performance (N is a power-of-two)
The highest performance is reached at the core’s native size.Data does not fit into data cache at N = 8192.
Memory bandwidth becomes the main bottleneck and practically reduces all possible speedups.
Advanced Computer Architecture, CSE 520
DFT Performance (N is not a power-of-two)
N=3*2k and N=5*2k
Radix-3 and Radix-5 operations are done in software.
Advanced Computer Architecture, CSE 520
DFT Precision
Advanced Computer Architecture, CSE 520
1D DFT with different core sizes
Up to 7.5 times speedup.The best choice depends on the targeted applications.For small problem sizes, software is the most energy-efficient choice.
Advanced Computer Architecture, CSE 520
2D DFT with different core sizes
Up to 4 times speedup.Again, for small problem sizes, software is the most energy-efficient choice.All sizes larger than or equal to 64x128 do not fit into data cache of PPC, which leads to a performance degradation.
Advanced Computer Architecture, CSE 520
Area/performance
There is also a 3 times variation in the power consumed by the DFT calculations.
In other words, by allowing up to 3 times more power (or 4 times more area) to be consumed, one can speed up a whole library up to 4 times (averaged across the library).
Advanced Computer Architecture, CSE 520
Power/performance
There is a 4 times variation in both area consumption and normalized runtime across all possible.
Advanced Computer Architecture, CSE 520
Conclusions
In the experiments on a Xilinx Virtex-II Pro, the automatically partitioned and generated FPGA-accelerated library has between 2 and 7.5 times higher performance and up to 2.5 times better energy efficiency than the software-only version.We have integrated this approach in the “Spiral linear-transform code-generation framework” to support push-button automatic implementation.
Advanced Computer Architecture, CSE 520
Conclusions
Architectures with tightly integrated FPGAs and general purpose processors are starting to play an important role in both embedded and high performance computing settings.The tight integration makes it possible to offload fine and coarse grain functionalities from processors to the FPGA fabric, combining the strengths of both components.
Advanced Computer Architecture, CSE 520
My critiques about this paper
Strength: Detailed analysis on the HW-SW partitioning.Comparisons on performance and energy efficiency are very valuable.
Weakness:2D DFT on this platform is not efficient.Communications between PPC and FPGA slow down the whole operation.
Advanced Computer Architecture, CSE 520
What is relative to our class?
A heterogeneous architecture combining two different cores: one RISC CPU and one programmable hardware, FPGA.Discussions on the power consumption of this kind of platform are interesting.
Advanced Computer Architecture, CSE 520
What is relative to our project?
The same applicationsDiscrete Fourier Transform.
The same platformXilinx FPGA
Reduce the workload of PPC.Introduce the concept of multi-core architectures to our hardware design.
Advanced Computer Architecture, CSE 520
Paper
Paolo D’Alberto, et al., “Generating FPGA-Accelerated DFT Libraries,” in Proceedings of 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'07), pp. 173-184, Napa Valley, CA, US, 23-25th, April 2007.