DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs
Deming Chen and Jason CongComputer Science Department
University of California, Los Angeles
This work is partially supported by the California MICRO program and the NSF Grant CCR-0306682
Outline Introduction Related Works Definitions and Problem Formulation Algorithm Description
Cut Enumeration Delay and Area Propagation Cost Function for a Cut Global and Local Cost Adjustments Iterative Cut Selection
Experimental Results Conclusions and Future Work
Introduction Field Programmable Gate Array (FPGA) has become
increasingly popular Fast to market No or very low NRE (non-recurring expenses)
The LUT-based FPGA architecture dominates the existing programmable chip industry
FPGA technology mapping converts a given Boolean circuit into a functionally equivalent network comprised only of LUTs
FPGA technology mapping is a crucial optimization step in the FPGA design flow
Related Works on FPGA Mapping Area Minimization
Chortle-crf, [Francis, et al, DAC’91] MIS-pga, [Murgai, et al, ICCAD’91] Praetor, [Cong, et al, FPGA’99] Anti-fuse FPGA Mapper, [Kang, et al, ASPDAC’04]
Delay Minimization DAG-Map, [Chen, et al, DTC’92] FlowMap, [Cong, et al, ICCAD’92] Edge-map, [Yang, et al, ICCAD’94]
Power Minimization PowerMinMap, [Li, et al, ASPDAC’03] Emap, [Lamoureux, et al, ICCAD’03] DVmap, [Chen, et al, FPGA’04]
Simultaneous Delay and Area Minimization FlowMap-r, [Cong, et al, TVLSI’94] CutMap, [Cong, et al, FPGA’95] BoolMap-D, [Legl, et al, DAC’96]
Definitions DAG : a Boolean network Cone Cv : a sub-network
rooted on a node v K-feasible cone :
|input(Cv)| K Fanin Cone Fv : the largest Cv K-feasible cut :
A K-feasible Cv Occupies a K-LUT
Unit delay model : One LUT contributes one
unit delay No edge delay
a
b c
d e
v
Fv
3-feasible cone Cv
PIs
Delay of 2
Problem Formulation Delay-optimal Area Optimization problem
Given: a Boolean network; an integer K Goal: cover the network with K-feasible
cones (K-LUTs), such that Optimal mapping depth Area (number of LUTs) is minimized
NP-hard problem on area minimization
Highlights of Our Algorithm Consider potential node duplications and make
mapping-area estimation close to reality Search solution space considering both global
and local optimality information Carry out an iterative cut selection procedure
on top of cost adjustment to further improve solution quality
Each technique used is simple and intuitive The key is the right combination of them
New cut
Cut Enumeration
ab
d
zyx
c
w
ab
d
zyx
c
w
Combine sub-cuts on the inputs of the gate
Process each gate in topological order from PIs to POs
Subcut
Subcut
Another Subcut
Complexity Analysis Number of cuts on a node for the worst case is O(nK) Practically, it is a small constant for small K
Average and Maximum Number ofCuts per Node
0
50
100
150
200
250
3 4 5 6K
Num
ber o
f Cut
s
AverageMaximum
Average over 20 largest MCNC benchmarks
Delay and Area Propagation
a c
d
yxz
b
w
e
fg
Delay 1, Area 1
Delay 1, Area 1
Delay = 1Area = 1
Delay = 2Area = 2
Delay 1, Area 1
Delay 2, Area 3
Delay 2, Area 3
Delay 2, Area 2
Delay 2, Area 2
Delay 2, Area 2
Delay = 1Area = 1
Delay = 1Area = 1
Propagation process visits cuts and nodes iteratively
The longest best delay on the POs is the optimal mapping delay
Area Estimation AC = [Ai / f(i)] + UC
i = input(C)
Ai : estimated area of the fanin cone on signal i
f(i) : fanout number of i Uc : area of the cut itself
Try to estimate area considering fanout effect
Praetor, [Cong, et al, FPGA’99] Can under-estimate the area
because of node duplications
q r
s
pnm o
t uCut Ct Cut Cu
f(p) = 2
Ap
Cut CAs / 2
C3
fanin1 fanin2
Cost (Area) Function of a CutSome Key parameters IC: cutsize of C NC: number of nodes
covered by C f(v): fanout number of
the root node v Pf: duplication cost
a
b c
d e
v
C1
C2
21)(
ffC
CC PP
vfNIU
Duplication Cost Adjustment Consider potential node duplications Check the sub-cuts for multiple fanouts Propagate adjusted cost globally
Subcut Cf2
NCf2 = 1
Multiple fanouts
New cut C
IC = 4
q r
sSubcut Cf1
pnm o
otherwise 0
1 if
f(i)I
NP C
Cf
f
Duplication Cost: NCf : number of nodes the subcut Cf contains
IC : cutsize of C
Non-critical LUT
Critical LUT
Cut Selection – Mapping Generation From POs to PIs Critical paths
optimal delay + best area available
Non-critical paths relaxed delay + better area
a c
d
yxz
b
w
e
fg
Techniques for Better Cut Selection
Cut selection equivalent to min-cover problem Greedy approach will not work well Use heuristics to guide the selection
Iterative Cut Selection Procedure Local Cost Adjustment
Input Sharing Slack Distribution Cut Probing
Iterative Cut Selection (ICS) Some valuable information on area is unknown until
after mapping mapped LUT root nodes duplicated nodes
ICS carries out multiple mapping iterations
Start Mapping Iteration i, i++
Profiling data
Adjust Cut Cost
i < threshold
Exit if i = threshold
Local Cost Adjustment – Input Sharing Takes advantage of
existing resources Considers roots from
previous iterations The more a cut shares
inputs with others, the better for the cut
de
fgBecome
LUT roots
Share inputs with existing
LUTs
Duplicated node
numshareCC original
i_
Local Cost Adjustment – Slack Distribution SlackC = Reqv – 1 – MAX (Arri)
i input(C)
If SlackC < 0, C is not a timing_feasible cut The larger the SlackC, the better for C in terms of slack
distribution effect
a c
d
yxz
b
w
Largest arrival time among
inputs
Reqd : Required time of the root
CCi
si
SlackCC
Local Cost Adjustment – Cut Probing Probe the amount of area gain locally before
making decisions about a cut Reduce connections between LUTs Reduce potential node duplications based on
previous duplication profiling Reconvergent paths handling
probingsifinal CCC
Use Cfinal to guide cut selection
Experimental Results – Settings DAOmap is implemented using C language
within the UCLA RASP system Compare LUT counts and runtime to CutMap
[Cong et al, FPGA’95] Use a 750 MHz SunBlade-1000 Solaris machine Test on LUT input numbers from 4 to 6 Benchmarks
20 largest MCNC benchmarks A set of large industrial benchmarks
Experimental Results of DAOmap over CutMap on MCNC Benchmarks
Average Area Reduction Average Run Time Improvement4-LUT -13.98% 13.2X
5-LUT -16.02% 24.2X6-LUT -12.44% 4.7X
After mapping
After mapping + packing (daomap + mpack) vs. (“cutmap –x” + mpack)
Average Area Reduction Average Run Time Improvement
4-LUT -7.50% 57.7X
5-LUT -11.31% 38.7X6-LUT -7.90% 10.1X
Detailed Experimental Results on Industrial Benchmarks
CutMap
DAOmap
Comparison
Benchmarks
LUTNo.
Run Time(s)
LUTNo.
Run Time(s)
LUT (Reduce)
Run Time (Improve)
big1 9928 301 9169 93 -7.6% 3.2big2 - >10H. 14625 708 - -big3 10005 28926 9031 106 -9.7% 272.9big4 11800 583 9364 156 -20.6% 3.7big5 - >10H. 32230 3377 - -big6 39000 14437 32028 402 -17.9% 35.9
Ave. -13.98% 78.9X
After mapping into 5-LUTs
Individual Technique Analysis
Techniques % droppedCut Enumeration Min-cost propagation 4.35%Global cost adjustment 2.68%Cut Selection Input sharing 4.55%Iterative cut selection (ICS) 2.04%Others <1%
Mapping Iteration Analysis
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
1 2 3 4 5 6Mapping Iterations
Improvement %
For single iteration only (the base case), use manual profiling [Chen et al, FPGA’04] When the iteration number is more than 3, it is no longer helpful
Conclusions and Future Work We presented a new mapping algorithm,
DAOmap, to minimize FPGA delay and area We built several cost-adjustment heuristics and
used an iterative mapping procedure DAOmap gained significant amount of area and
runtime reduction over a state-of-the-art algorithm CutMap
Future works include adding cut-pruning techniques for mapping with larger K values
Top Related