Download - DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs

Deming Chen and Jason CongComputer Science Department

University of California, Los Angeles

This work is partially supported by the California MICRO program and the NSF Grant CCR-0306682

Outline Introduction Related Works Definitions and Problem Formulation Algorithm Description

Cut Enumeration Delay and Area Propagation Cost Function for a Cut Global and Local Cost Adjustments Iterative Cut Selection

Experimental Results Conclusions and Future Work

Introduction Field Programmable Gate Array (FPGA) has become

increasingly popular Fast to market No or very low NRE (non-recurring expenses)

The LUT-based FPGA architecture dominates the existing programmable chip industry

FPGA technology mapping converts a given Boolean circuit into a functionally equivalent network comprised only of LUTs

FPGA technology mapping is a crucial optimization step in the FPGA design flow

Related Works on FPGA Mapping Area Minimization

Chortle-crf, [Francis, et al, DAC’91] MIS-pga, [Murgai, et al, ICCAD’91] Praetor, [Cong, et al, FPGA’99] Anti-fuse FPGA Mapper, [Kang, et al, ASPDAC’04]

Delay Minimization DAG-Map, [Chen, et al, DTC’92] FlowMap, [Cong, et al, ICCAD’92] Edge-map, [Yang, et al, ICCAD’94]

Power Minimization PowerMinMap, [Li, et al, ASPDAC’03] Emap, [Lamoureux, et al, ICCAD’03] DVmap, [Chen, et al, FPGA’04]

Simultaneous Delay and Area Minimization FlowMap-r, [Cong, et al, TVLSI’94] CutMap, [Cong, et al, FPGA’95] BoolMap-D, [Legl, et al, DAC’96]

Definitions DAG : a Boolean network Cone Cv : a sub-network

rooted on a node v K-feasible cone :

|input(Cv)| K Fanin Cone Fv : the largest Cv K-feasible cut :

A K-feasible Cv Occupies a K-LUT

Unit delay model : One LUT contributes one

unit delay No edge delay

a

b c

d e

v

Fv

3-feasible cone Cv

PIs

Delay of 2

Problem Formulation Delay-optimal Area Optimization problem

Given: a Boolean network; an integer K Goal: cover the network with K-feasible

cones (K-LUTs), such that Optimal mapping depth Area (number of LUTs) is minimized

NP-hard problem on area minimization

Highlights of Our Algorithm Consider potential node duplications and make

mapping-area estimation close to reality Search solution space considering both global

and local optimality information Carry out an iterative cut selection procedure

on top of cost adjustment to further improve solution quality

Each technique used is simple and intuitive The key is the right combination of them

New cut

Cut Enumeration

ab

d

zyx

c

w

ab

d

zyx

c

w

Combine sub-cuts on the inputs of the gate

Process each gate in topological order from PIs to POs

Subcut

Subcut

Another Subcut

Complexity Analysis Number of cuts on a node for the worst case is O(nK) Practically, it is a small constant for small K

Average and Maximum Number ofCuts per Node

0

50

100

150

200

250

3 4 5 6K

Num

ber o

f Cut

s

AverageMaximum

Average over 20 largest MCNC benchmarks

Delay and Area Propagation

a c

d

yxz

b

w

e

fg

Delay 1, Area 1

Delay 1, Area 1

Delay = 1Area = 1

Delay = 2Area = 2

Delay 1, Area 1

Delay 2, Area 3

Delay 2, Area 3

Delay 2, Area 2

Delay 2, Area 2

Delay 2, Area 2

Delay = 1Area = 1

Delay = 1Area = 1

Propagation process visits cuts and nodes iteratively

The longest best delay on the POs is the optimal mapping delay

Area Estimation AC = [Ai / f(i)] + UC

i = input(C)

Ai : estimated area of the fanin cone on signal i

f(i) : fanout number of i Uc : area of the cut itself

Try to estimate area considering fanout effect

Praetor, [Cong, et al, FPGA’99] Can under-estimate the area

because of node duplications

q r

s

pnm o

t uCut Ct Cut Cu

f(p) = 2

Ap

Cut CAs / 2

C3

fanin1 fanin2

Cost (Area) Function of a CutSome Key parameters IC: cutsize of C NC: number of nodes

covered by C f(v): fanout number of

the root node v Pf: duplication cost

a

b c

d e

v

C1

C2

21)(

ffC

CC PP

vfNIU

Duplication Cost Adjustment Consider potential node duplications Check the sub-cuts for multiple fanouts Propagate adjusted cost globally

Subcut Cf2

NCf2 = 1

Multiple fanouts

New cut C

IC = 4

q r

sSubcut Cf1

pnm o

otherwise 0

1 if

f(i)I

NP C

Cf

f

Duplication Cost: NCf : number of nodes the subcut Cf contains

IC : cutsize of C

Non-critical LUT

Critical LUT

Cut Selection – Mapping Generation From POs to PIs Critical paths

optimal delay + best area available

Non-critical paths relaxed delay + better area

a c

d

yxz

b

w

e

fg

Techniques for Better Cut Selection

Cut selection equivalent to min-cover problem Greedy approach will not work well Use heuristics to guide the selection

Iterative Cut Selection Procedure Local Cost Adjustment

Input Sharing Slack Distribution Cut Probing

Iterative Cut Selection (ICS) Some valuable information on area is unknown until

after mapping mapped LUT root nodes duplicated nodes

ICS carries out multiple mapping iterations

Start Mapping Iteration i, i++

Profiling data

Adjust Cut Cost

i < threshold

Exit if i = threshold

Local Cost Adjustment – Input Sharing Takes advantage of

existing resources Considers roots from

previous iterations The more a cut shares

inputs with others, the better for the cut

de

fgBecome

LUT roots

Share inputs with existing

LUTs

Duplicated node

numshareCC original

i_

Local Cost Adjustment – Slack Distribution SlackC = Reqv – 1 – MAX (Arri)

i input(C)

If SlackC < 0, C is not a timing_feasible cut The larger the SlackC, the better for C in terms of slack

distribution effect

a c

d

yxz

b

w

Largest arrival time among

inputs

Reqd : Required time of the root

CCi

si

SlackCC

Local Cost Adjustment – Cut Probing Probe the amount of area gain locally before

making decisions about a cut Reduce connections between LUTs Reduce potential node duplications based on

previous duplication profiling Reconvergent paths handling

probingsifinal CCC

Use Cfinal to guide cut selection

Experimental Results – Settings DAOmap is implemented using C language

within the UCLA RASP system Compare LUT counts and runtime to CutMap

[Cong et al, FPGA’95] Use a 750 MHz SunBlade-1000 Solaris machine Test on LUT input numbers from 4 to 6 Benchmarks

20 largest MCNC benchmarks A set of large industrial benchmarks

Experimental Results of DAOmap over CutMap on MCNC Benchmarks

Average Area Reduction Average Run Time Improvement4-LUT -13.98% 13.2X

5-LUT -16.02% 24.2X6-LUT -12.44% 4.7X

After mapping

After mapping + packing (daomap + mpack) vs. (“cutmap –x” + mpack)

Average Area Reduction Average Run Time Improvement

4-LUT -7.50% 57.7X

5-LUT -11.31% 38.7X6-LUT -7.90% 10.1X

Detailed Experimental Results on Industrial Benchmarks

CutMap

DAOmap

Comparison

Benchmarks

LUTNo.

Run Time(s)

LUTNo.

Run Time(s)

LUT (Reduce)

Run Time (Improve)

big1 9928 301 9169 93 -7.6% 3.2big2 - >10H. 14625 708 - -big3 10005 28926 9031 106 -9.7% 272.9big4 11800 583 9364 156 -20.6% 3.7big5 - >10H. 32230 3377 - -big6 39000 14437 32028 402 -17.9% 35.9

Ave. -13.98% 78.9X

After mapping into 5-LUTs

Individual Technique Analysis

Techniques % droppedCut Enumeration Min-cost propagation 4.35%Global cost adjustment 2.68%Cut Selection Input sharing 4.55%Iterative cut selection (ICS) 2.04%Others <1%

Mapping Iteration Analysis

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

1 2 3 4 5 6Mapping Iterations

Improvement %

For single iteration only (the base case), use manual profiling [Chen et al, FPGA’04] When the iteration number is more than 3, it is no longer helpful

Conclusions and Future Work We presented a new mapping algorithm,

DAOmap, to minimize FPGA delay and area We built several cost-adjustment heuristics and

used an iterative mapping procedure DAOmap gained significant amount of area and

runtime reduction over a state-of-the-art algorithm CutMap

Future works include adding cut-pruning techniques for mapping with larger K values