Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson,...
-
Upload
kayla-hoskinson -
Category
Documents
-
view
220 -
download
2
Transcript of Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson,...
Architecture-Specific Packingfor Virtex-5 FPGAsTaneem Ahmed, Paul Kundarewich, Jason Anderson,Brad Taylor, Rajat Aggarwal
February 25th, 2008
2
Overview
• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Results• Summary
3
Simplified FPGA Logic Element
4-LUT
A4A3A2A1
O4
FF
4
Simplified FPGA Logic Block
FF4-LUT
FF4-LUT
FF4-LUT
FF4-LUT
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
5
Virtex-5 Logic Block
CLB
FF6-LUT
FF6-LUT
FF6-LUT
FF6-LUT
SLICE
FF6-LUT
FF6-LUT
FF6-LUT
FF6-LUT
SLICE
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
6
Dual-Output 6-LUT
6-LUT
A6A5A4A3A2A1
O6
O5
7
Dual-Output 6-LUT UsageA6
A5A4A3A2A1
O6
5-LUT O5
5-LUT
8
Dual-Output Packing
A6
A5A4A3A2A1
O6
5-LUT
5-LUT O5
A6
A5A4A3A2A1
O6
5-LUT
5-LUT O5
6-LUT 6-LUT
Number of 6-LUTs used: 2Number of 6-LUTs used: 1!
xy
X
LogicX
ab
Y
LogicY
VCC
xy
ba
Y
LogicY
LogicX
X
9
XOR
XOR
AX
AX
6-LUT
CY
CY
F7
F7
F7
O5
O5O5
O6
CIN
FFAQ
AMUX
A
O6
O6
Virtex-5 LUT/FF Pair
10
Dual-Output Packing Tradeoff
AX
6-LUT
F7
O5
O5O5
O6
FF
O6
O66-LUT
11
Dual-Output Packing in Placer
• Goal: To reduce area without performance hit– Can be done pre-placement
• Will be sub-optimal without delay estimates – Use delay estimates available during placement to
make good decisions on when to merge two LUTs
• Approach:– Allow second 5-LUT to be used, when performance
impact is small– Incorporate LUT packing in placer’s cost function
12
Placer Cost Function
• Previous cost function:– Cost = a * W + b * T– W: wirelength cost T: timing performance cost
• Extend cost function with two new terms– One based on 6-LUT utilization (L)– One based on SLICE utilization (S)– Cost = a * W + b * T + c * L + d * S
13
6-LUT Utilization Term
• L is computed based on all the used 6-LUT slots
• Where
14
• S is computed based on all the available SLICEs
• Let:– Ni = Number of used 5-LUTs in SLICE i (at most 8)
SLICE Utilization Term
S = Sii=0
m
15
Performance Recovery
• Helpful to prohibit pack in certain cases for performance reasons
• Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.
16
Performance Recovery: XOR
XOR
XOR
AX
AX
LUT6
CY
CY
F7
F7
F7
O5
O5O5
O6
CIN
FFAQ
AMUX
A
O6
O6LUT6
FF
17
Performance Recovery: F7
XOR
XOR
AX
AX
LUT6
CY
CY
F7
F7
F7
O5
O5O5
O6
CIN
FFAQ
AMUX
A
O6
O6LUT6
F7
FF
18
6-LUT Reduction
0
2
4
6
8
10
12
14
16
Benchmark Design #
% 6
-LU
T R
ed
uc
tio
n
5.5% 6-LUTReduction
19
SLICE Reduction
0
5
10
15
20
25
Benchmark Design #
% S
LIC
E R
edu
ctio
n
10.23% SLICEReduction
20
Performance Results
-15
-10
-5
0
5
10
15
20
25
0 5 10 15 20 25
SLICEs Reduction (%)
Pe
rfo
rma
nc
e L
os
s (
%)
3.3% PerformanceDegradation
21
Overview
• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Summary
22
New Type of Packing Problem
• Traditionally, packing is considered to be a problem of just LUTs and flops
• However, Virtex-5 contains large IP blocks that present their own packing problem
23
Virtex-5 Block RAMs
18 Kb RAM
18 Kb RAM
36Kb RAM
• A 36 Kbit block RAM tile can store:a) single 36 Kb RAMb) two independent 18 Kb RAMs
• Block RAM has configurable “aspect ratio”• 18 Kb RAM can be configured as:
16K x 1, 8K x 2, 2K x 9, or 1K x 18
• Tools decide which independent 18 Kb block RAMs to locate in which tile
24
Virtex-5 DSP48E Block• A multiply-accumulate operation, pervasive in DSP
circuits, can be realized in a single DSP48E. • Multiple DSP48Es can be chained together to form more
complex functions through the PCIN and PCOUT ports
PCIN
C (48-bit)
B (18-bit)A (25-bit)
=
48-bit
Op
tion
al p
ipe
line
re
gis
ter/
rou
ting
log
ic
Op
tion
al p
ipe
line
re
gis
ter/
rou
ting
log
ic
Ro
utin
g lo
gicX
P
25x18
Pattern detect
ALU
PCOUT
25
Block RAM and DSP Floorplan
• Block RAM and DSP48E tiles are organized in columns
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Virtex-5DSP tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
DSP48E
DSP48E
Block RAM tile
26
Block RAM/DSP Packing
• Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing
• Goal: Leverage preferred block RAM packing patterns to achieve high performance
• Target area: DSP designs– DSP designs make heavy use of block RAMs and
DSP blocks
27
DSP Block RAM Designs
• Most common DSP application is the Finite Impulse Response Filter or FIR filter– FIR filters have multiple instances of a “tap” which
involve DSP and block RAMs
28
FIR Filter
• A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line
• An N-tap filter can be expressed as:y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1]– Where:
• y[n] is the output of the filter at time n• x[n] is the data input “signal” at time n• Ci is the coefficient
• Each coefficient/data product in sum is referred to as a “tap”– DSP units used for the multiply and accumulate– Block RAMs used to store the data and coefficients
29
FIR Designs – Use Case 1• 2-tap FIR filter involving small block RAMs
RAMD1 RAMC1
Data RAM
18 Kb block RAM
RAMD0 RAMC0
Coefficient RAM
DSP0 Tap 0
DSP1 Tap 1
PCOUT
PCIN
A
B
datainput
dataoutput
A
B
36 Kb block RAM Tile
30
Packing for Use Case 1
• Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs
High Performance!
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Operates as two independent18 Kb block RAMs
Virtex-5DSP tile
31
FIR Designs – Use Case 2
• 2-tap FIR filter involving larger block RAMs
DSP0
DSP1
PCOUT
PCIN
RAMD0
RAMD1
A
B
18 Kb block RAM
A
B
36 Kb block RAM
RAMC0
RAMC1
Data RAM Coefficient RAM
Tap 1
Tap 0
32
Packing for Use Case 2
• Two Block RAM columns feed one DSP column• Again provides a natural alignment between the
DSP and Block RAMsDSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
DSP48E
DSP48E
DSP48E
DSP48E
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Virtex-5DSP tile
33
Block RAM Chains
• Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO)
• Algorithm: Look for such chains and pack them together into single block RAM tile
• Special Case: 18k block RAMs separated by registers
inRAM0dia doa
addra
RAM1dib dob
addrb
out
18 Kb block RAM
34
Block RAM/DSP Packing Results
Circuit Perf RAM Packing (MHz)
Perf. Baseline (MHz)
Percent Improvement
Circuit 1 500 400 25%
Circuit 2 450 365 23%
Circuit 3 500 470 6%
Circuit 4 425 435 -2%
Circuit 5 215 200 8%
Geomean 400 359 11%
35
Summary
• Described two architecture specific packing approaches for a 65nm commercial FPGA:Xilinx Virtex-5– Dual-output LUT packing in placement:
• Achieves 10.2% SLICE reduction and 5.5% LUT reduction– Packing for DSPs and block RAMs:
• Achieves 11% performance improvement
36
Questions