Reconfigurable Cell Array for DSP Applications...Reconfigurable Cell Array for DSP Applications...
Transcript of Reconfigurable Cell Array for DSP Applications...Reconfigurable Cell Array for DSP Applications...
Reconfigurable Cell Arrayfor DSP Applications
Chenxin Zhang
Department of Electrical and Information TechnologyLund University, Sweden
ETI180 DSP-design Dec. 06th, 2011
Department of Electrical and Information Technology, Lund University
Outline
• Reconfigurable computing
• Coarse-grained reconfigurable cell array
– Processing cell
– Memory cell
– Network router cell
– System reconfiguration
� ����������– Reconfigurable FIR
– Reconfigurable FFT processor
– Multi-standard OFDM coarse time synchronization
Department of Electrical and Information Technology, Lund University
Reconfigurable computing
• Updates on the data path in addition to the control flow.
• Combined flexibility with high performance at a feasible
hardware cost.
• Software-centric programming approach.
• Coarse-grained granularity – trade-off between efficiency,
flexibility, and programmability.
• Dynamic reconfigurability.
Department of Electrical and Information Technology, Lund University
High performance real-time DSP computing
Department of Electrical and Information Technology, Lund University
Media.Processor
Apps.Processor
GPS
Multiple standards
CellularApps.Processor
BT
WLAN
DVB-H
Wimax
LTE-A
WCDMA
Media.Processor
…
5G?
Apps.Processor
Department of Electrical and Information Technology, Lund University
Software-defined hardware
• Hardware sharing
– Accelerators: poor hardware reusability
– Reconfigurable architecture
+ Multi-task
+ Multi-standard
+ Multi-algorithm
− Control overhead, e.g. area, power.
A B C D
Processing chain
Department of Electrical and Information Technology, Lund University
Performance vs. Flexibility
• Specialized hardware (ASIC)
+ High performance, small size, low power
- Less flexible, manufacturing defects
- High NRE cost
• Standard processor (GPP, DSP…)
+ Flexible, Short design time
- Lack of computation capacity
• Fine-grained reconfigurable architecture (FPGA)
+ High calculation capacity, flexible
- Routing overhead, high power consumption
- Hardware oriented design approach
Department of Electrical and Information Technology, Lund University
Application specific DSP:Tensilica ConnX Baseband Engine
Department of Electrical and Information Technology, Lund University
Tabula Spacetime
• Ultra-rapid reconfiguration:
multi-GHz rates
• 2.5x logic density
• 3.7x DSP performance
Department of Electrical and Information Technology, Lund University
Coarse-grainedreconfigurable architecture
• High calculation capacity & flexible
• Software oriented: relevantly fast
development
• tolerance to manufacturing defects
• Sacrificed area & energy efficiency
compared to ASICs
• Sacrificed mapping flexibility compared
to FPGAs
CGRA
Department of Electrical and Information Technology, Lund University
Related work
• ALU clusters: MathStar FPOA, RICA…
– Instruction level, data level parallelism
– SIMD or VLIW
• Processor array: RAW, WPPA, REMARC…
– Instruction level, data level, and task level parallelism
– MIMD
• Hybrid structure: ADRES, PACT XPP…
– Instruction level, data level, and task level parallelism
– SIMD or VLIW and MIMD
– Combined complexity?
Courtesy: MathStar “FPOA architecture guide”.
Courtesy: D. Kissler et
al.“A Highly ParameterizableParallel Processor Array Architecture”.
Courtesy: PACT: “XPP-III Processor Overview”.
Department of Electrical and Information Technology, Lund University
R
R R
System infrastructure
• An array of resource cells.
• Heterogeneous cell array:
– Processing cell
– Memory cell
– Accelerator
(e.g. no configuration)
• Hierarchical cell array.
R
R
Addr. gen
Coeff. gen
Department of Electrical and Information Technology, Lund University
Resource cell
• Dedicated local interconnections:
– High data throughput
• Hierarchical global routing network:
– Flexible global data transmission
– External data access
– Global cell (re)configuration
• Data driven synchronization
• Single-Cycle-Per-Hop latency
• AMBA 4 AXI4-stream protocol
• GALS network data transmission
L0
L2
L1
L3
G0
R
RC
Department of Electrical and Information Technology, Lund University
Processing cell
• Processing core
– ALU, DSP, SIMD, VLIW,
CORDIC...
– Implicit load-store operations in
all instructions.
– Run-time control and conditional
reconfiguration.
– In-cell NoC supervision and
reconfiguration.
• Processing shell
– Network adapter
P3 = f(P1,P2)P1
P2
Department of Electrical and Information Technology, Lund University
Example 1:Generic signal processing cell
• 4 pipeline stages.
• Hybrid Load-Store & Memory-Memory
architecture.
• Compact program size (memory
references).
• With external memory cells:
– Complex addressing modes, e.g.
memory indirect, auto-increment.
– Flexible usage: program/data
memory, processor stack, (cache).
• Single-cycle delayed branch.
• Zero-delay conditional inner loop
control.
P3 = f(P1,P2)P1
P2
Department of Electrical and Information Technology, Lund University
Example 2:Dataflow processing cell (I) Branch
IF/ID EXE/WB
Operation
controller
L0 L1 ... Lx G
Local IO ports Global IO port
PC
Register
ID/EXE
...
Input arrangement MUX
Arith/Logic selection
Output arrangement MUX
Output MUX
Department of Electrical and Information Technology, Lund University
Example 2:Dataflow processing cell (II)
• SIMD/VLIW-like operation:
– 2/4-way 16/8-bit independent data processing
– Multi-level data processing (implicit prolog & epilog processing)
• Dual-operand instruction set:
– Dual-OpCode & Dual-Operand: e.g. ADDSUB R[d1], R[d2], R[s1], R[s2]
– Vector operation option: e.g. complex number arithmetic
• Dynamic data path reconfiguration
• Conditional instruction executions
Input arrangement MUX
Arith/Logic selection
Output arrangement MUX
Output MUX
Department of Electrical and Information Technology, Lund University
Dataflow processing cell:Dynamic data path reconfiguration
Input arrangement MUX
Arith/Logic selection
Output arrangement MUX
Output MUX
Department of Electrical and Information Technology, Lund University
Dataflow processing cell:Run-time data arrangement (II)
• Complex number multiplication vs. Real number multiplication
– MUL R3, R1, R2 ; R3 = R1 * R2 where {ab} is stored in R1
and {cd} is stored in R2.
Department of Electrical and Information Technology, Lund University
Memory cell (I)
� ������������� ��������������������� � �������� ������� �������
� ����������� ������ �������� ������������
� ��������������������� � ��������������
� ����������������������� ���������
� ��������������������������� ����������������������������
Department of Electrical and Information Technology, Lund University
Memory cell (II)Memory descriptor
� ������������ ��������!��������������������� �����
� "���� �������������� ����������������#��$�������#��������$�
Department of Electrical and Information Technology, Lund University
Memory cell (III)������������������������������ ���� ������ ���� ������ ���� ������ ����
Sign Sign
I Q
2(I) 2(Q)
1(I) 1(Q)
3(I) 3(Q)
4(I) 4(Q)
Inphase Quadrature
011162731 19 3
12 bits -> 4 bits
3(Q) 1(Q)4(Q) 2(Q)3(I) 1(I)4(I) 2(I)
After 4 iterations
PC0 -> MC0
(a)
(b)
(c)
(d)Address ‘X’
Sign Sign
723
Shift by 0 &
mask
Shift by 20
& mask
Shift by 16
& mask
Shift by 4
& maskLogic
“or”
Department of Electrical and Information Technology, Lund University
Memory cell (IV)Reconfiguration
• Individual memory DSC loading & tracing
• Memory DSC execution program:
• Memory DSC execution mode: restart, resume
• Memory data dump (debug)
Department of Electrical and Information Technology, Lund University
Network router cell (I)
• Cell structure:
– Decision unit
– Routing structure :
• Parallel network
• MUX-DEMUX switch
– Output packet queue (FIFOs)
Department of Electrical and Information Technology, Lund University
Network router cell (II)Decision unit
• Static routing table
• Managing data transactions:
– Check in
– Packet arbitration (MUX-DEMUX switch)
• Fixed
• Round-robin
• Data broadcast
– Configure routing path
Action list with candidate transactions
O(0) O(1) O(2) O(3) O(4)
In(0) o
In(1) o o o
In(2) x
In(3) o
In(def) x
Action list with candidate transaction
O(0) O(1) O(2) O(3) O(4)
In(0) x
In(1) o x x
In(2) x
In(3) x
In(def) x
(Parallel network)
(MUX-DEMUX switch)
Department of Electrical and Information Technology, Lund University
Static & Dynamic configuration (I)
icache dcache
Master
MPMC
R R
R R
R
Mem
ory
StreamCtrl
Conf.Ctrl
Department of Electrical and Information Technology, Lund University
Static & Dynamic configuration (II)
R R
R R
R
M1
M2 M3
M4
Department of Electrical and Information Technology, Lund University
• FIR filter
– Processing cell: MAC
– Memory cell: Input data FIFO, coefficient ROM
• Time-multiplexed structure for area driven application.
• Unfolding (parallelize) to improve processing throughput.
• High-precision computations.
Case study:Reconfigurable FIR
R R
R R
R
Department of Electrical and Information Technology, Lund University
Case study:Reconfigurable FFT processor
• Radix-22 structure
• Folding
Department of Electrical and Information Technology, Lund University
Radix-22 FFT building block
• Basic radix-22 FFT building block
• A 2,048-point radix-22 pipeline FFT
Department of Electrical and Information Technology, Lund University
Radix-22 pipeline FFT
• Simple mapping– Simple to scale up.
– Local communication only.
– High storage capacity demand in each
single memory cell.
Department of Electrical and Information Technology, Lund University
Radix-22 pipeline FFT
• Simple mapping– Simple to scale up.
– Local communication only.
– High storage capacity demand in each
single memory cell.
• Simple mapping with concatenated memory cells
– Low storage capacity demand in each
single memory cell.
– Global data communications.
Department of Electrical and Information Technology, Lund University
Time-multiplied FFT (I)
Department of Electrical and Information Technology, Lund University
Time-multiplied FFT (II)
• FFT benchmark comparison
– Rapid system reconfiguration: 40nS @300MHz
– High performance: 2.5x vs. DSPs, 6.5x vs. GPPs
Architecturefmax
[MHz]FFT size[point]
Execution time [cc]
Code size[byte]
Reconfigurationcode size [byte]
CGRA 5342561024
2,2429,943
1,032 30
Texas TMS-320VC5502
3002561024
5,38925,921
462462
(code reload)
ARM926EJ-S 2762561024
13,19466,196
- -
Department of Electrical and Information Technology, Lund University
Case study:Multi-standard OFDM synchronization
• Multiple wireless radio standards
• Concurrent data stream processing
• Coarse Time Synchronization
• Carrier Frequency Offset (CFO) estimation
$θ
${ }arg γ θ
[ ]γ θ
Department of Electrical and Information Technology, Lund University
Implementation results (I)
• 65 nm low-power regular VT CMOS:
– Area: 0.48 mm2
– Clock frequency: 534 MHz
• Adaptive word length scheduling.
• Adoption of different algorithms, e.g. Novel sign-bit OFDM acquisition.
Department of Electrical and Information Technology, Lund University
Summary
• Reconfigurable cell array enables hardware sharing at
different levels, i.e., task-, function-, and algorithm-level.
• Coarse-grained reconfigurable cell array comprises
distributed processing and memory cells, and a
hierarchical NoC structure.
• In-cell dynamic reconfiguration enables fast context
switching.