Recent Advances in Cut-based Recent Advances in Cut-based
FPGA Technology Mapping
Kevin ChungApril 3, 2009
Preamble
� Logic synthesis and verification
research is alive and vibrant
� FPGAs are growing fast – scalability
in runtime and memory paramountin runtime and memory paramount
Page 2
Outline
1. Review of Cut-based Mapping
2. More Efficient Cut Computation
3. Lossless Synthesis
4. Priority Cuts
5. Area Recovery
6. WireMap
Page 3
Cut-based Mapping Algorithm
Input: And-Inverter Graph
1. Compute all K-feasible cuts
2. Compute best arrival time at each node
• In topological order (from PI to PO)
• Assuming that each cut maps to a K-LUT
Page 4
• Assuming that each cut maps to a K-LUT
• Assuming that each K-LUT has unit delay
3. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
Cut-based Mapping Advantages
� Advantages
–Cuts have direct correspondence to LUTs
• Easy to create LUT-based cost functions� different LUT input delays
� output switching activity
Page 5
� output switching activity
–Cut computation is fast and simple
–Dynamic programming mapping solution
• guarantees optimal delay
• efficient search of LUT design space
Cut-based Mapping Challenges
� Feasible cuts grow quickly wrt LUT size
� Results depend upon AIG netlist
structure– many possible equivalent AIG structures
– logic restructuring optimizations that
K
Avg # of
cuts per
node
Page 6
– logic restructuring optimizations that
works well for one part of the design
may not give good mapping for another
4 8
5 16
6 38
7 95
8 240
Outline
1. Review of Cut-based Mapping
2. More Efficient Cut Computation
• Cut Dropping
• Cut Dominance
Page 7
• Cut Dominance
3. Lossless Synthesis
4. Priority Cuts
5. Area Recovery
6. WireMap
Cut Dropping
{ {q}, {b, c} }
r
{ {r}, {p, q}, {p, b, c}, {a, b, q}, {a, b, c} }
During bottom up computation of cuts, the set of cuts of a node
can be freed once all its fan-outs have been processed
{ {p}, {a, b} } Can delete these cuts
Page 8
a b c
p q
{ {q}, {b, c} }
Bottom-up
computation
{ {p}, {a, b} }
• Once the cuts of node r are computed, the cuts of q are no longer needed
• But can’t discard the cuts of node p since not all fan-outs of p have been processed
• Dramatically reduces peak memory consumption on large designs
once node r is done
Cuts Behaving Badly
x
f { .. {d, b, c} .. {a, b, c} .. }
{ .. {a, d, b, c} .. {a, b, c} .. }
Bottom-up cut computation in the presence of re-convergence
might produce dominated cuts
x = ~a + a.b + ~b.c
Page 9
a cb
d e
f { .. {d, b, c} .. {a, b, c} .. }
Cut {a, b, c} dominates cut
{a, d, b, c}
• The “good” cut {a, b, c} is there: so not a quality issue
• But the “bad” cut {a, d, b, c} may be propagated further: so a run-time issue
• Want to discard dominated cuts quickly
Signature-based Dominance
Problem: Given two cuts how to quickly determine whether one is
a subset of another
sig (c) = Σ 2ID(n) mod 32
n ∈c
Define signature of a cut:
(Σ means bit-wise OR)
Page 10
Observation: If cut c1 dominates cut c2 then
sig(c1) OR sig(c2) = sig(c2)
Cheap test for the common case that a cut does not dominate another. Only if
this fails is an actual comparison made.
n ∈c
where ID(n) is the integer id of node n
(Σ means bit-wise OR)
Example
� Let the node id’s be a = 1, b = 2, c = 3, d = 4
� Let c1 = {a, b, c} and c2 = {a, d, b, c}
� sig (c1) = 21 OR 22 OR 23
= 0001 OR 0010 OR 0100
= 0111
Page 11
= 0111
� sig (c2) = 21 OR 24 OR 22 OR 23
= 0001 OR 1000 OR 0010 OR 0100
= 1111
� As sig (c1) OR sig (c2) ≠ ≠ ≠ ≠ sig (c1), c2 does not dominate c1
� But sig (c1) OR sig (c2) = sig (c2), so c1 may dominate c2
K = 4 K = 5 K = 6 K = 7 K = 8
Name N C/N T, s C/N T, s C/N T, s C/N T, s C/N T, s L/N, %
alu4 2642 6.7 0.00 12.3 0.01 23.1 0.04 45.5 0.18 94.7 1.02 0.00
apex2 2940 7.2 0.01 14.2 0.02 29.2 0.07 62.6 0.32 139.7 1.90 0.00
apex4 2017 8.5 0.00 19.5 0.03 47.0 0.10 116.3 0.62 293.5 4.49 0.10
bigkey 3080 6.6 0.01 12.1 0.02 24.2 0.05 50.1 0.20 99.7 0.84 0.00
clma 11869 8.1 0.04 18.2 0.11 44.4 0.51 114.9 3.01 306.3 20.99 1.64
des 3020 8.0 0.01 17.0 0.03 38.7 0.12 92.0 0.69 218.0 4.80 4.37
diffeq 2566 6.5 0.01 12.3 0.01 26.6 0.07 65.0 0.50 155.9 2.80 3.66
dsip 2521 6.2 0.01 10.7 0.01 20.7 0.03 42.0 0.10 86.7 0.44 0.00
Run-time of K-feasible Cut Computation
Page 12
dsip 2521 6.2 0.01 10.7 0.01 20.7 0.03 42.0 0.10 86.7 0.44 0.00
elliptic 5502 6.4 0.01 10.6 0.03 18.5 0.07 36.9 0.33 83.4 2.12 0.20
ex1010 7652 9.2 0.02 23.3 0.11 61.8 0.61 165.8 4.01 438.2 30.43 1.99
ex5p 1719 9.4 0.01 24.1 0.02 66.2 0.17 188.2 1.30 514.8 10.50 14.14
frisc 5905 7.1 0.01 14.4 0.04 32.3 0.16 79.8 0.88 209.0 6.30 1.24
misex3 2441 7.7 0.01 15.7 0.02 33.3 0.08 73.7 0.38 170.7 2.48 0.00
pdc 7527 9.4 0.03 24.8 0.12 67.4 0.68 183.7 4.41 489.4 31.71 4.40
s298 2514 7.9 0.00 17.5 0.02 44.0 0.13 121.9 0.94 346.5 7.10 7.56
s38417 12867 6.6 0.03 13.5 0.10 32.0 0.46 83.1 3.24 225.9 23.72 3.38
s38584 11074 6.1 0.03 11.4 0.06 22.4 0.20 46.7 0.98 101.5 5.81 0.86
seq 2761 7.5 0.00 15.2 0.02 31.7 0.08 68.6 0.37 153.3 2.25 0.04
spla 6556 9.6 0.03 25.8 0.11 73.9 0.69 215.5 4.98 561.4 31.14 13.83
tseng 1920 6.5 0.01 11.8 0.01 23.5 0.04 50.6 0.21 112.7 1.32 1.35
Average 4954.6 7.56 0.01 16.22 0.05 38.05 0.22 95.15 1.38 240.0 9.61 2.94
K = 4 K = 5 K = 6 K = 7 K = 8
Name Total Drop Total Drop Total Drop Total Drop Total Drop
clma 2.56 0.10 6.60 0.22 18.09 0.54 52.03 1.47 152.55 4.07
ex1010 1.87 0.37 5.45 0.97 16.25 2.27 48.40 4.68 140.70 8.38
pdc 1.90 0.27 5.69 0.75 17.42 2.00 52.75 4.98 154.56 11.83
s38417 2.28 0.15 5.28 0.37 14.12 1.10 40.80 3.55 121.98 10.25
s38584.1 1.80 0.11 3.86 0.20 8.52 0.40 19.72 0.86 47.15 1.94
spla 1.68 0.21 5.15 0.59 16.63 1.65 53.88 4.34 154.44 10.04
Peak Memory in Mb with Cut Dropping
Page 13
spla 1.68 0.21 5.15 0.59 16.63 1.65 53.88 4.34 154.44 10.04
Ratio 1.00 0.11 1.00 0.10 1.00 0.08 1.00 0.07 1.00 0.06
Outline
1. Review of Cut-based Mapping
2. More Efficient Cut Computation
3. Lossless Synthesis
4. Priority Cuts
Page 14
4. Priority Cuts
5. Area Recovery
6. WireMap
Structural Bias
The mapped netlist very closely resembles the subject graph
f
Technology
Mapping
fp
p
Page 15
a b c d
Mapping
e a b c d e
Every input of every LUT in the mapped netlist must be present in the
subject graph ..
.. otherwise technology mapping will not find the match
m
m
The Problem of Structural Bias
f
f
f
Root problem: Best matches for mapping may not be found
This match is not found
p
p
Page 16
a b c d e a b c d e a b c d e
Since the point q is not present in the subject graph,
the match on the extreme right will not be found
q
mm
The Problem of Structural Bias
f
f
The match would be found with a different subject graph
p
f
Page 17
a b c d e
a b c d e
q
m
a b c d
q
e
=
Traditional Synthesis Flow
Technology-
independent
synthesis
sweep
eliminate
resub
simplify
Boolean
Network
No guarantee of optimality since each
synthesis step is heuristic.
Page 18
Since only network at the end of technology independent synthesis used
for mapping, good intermediate netlists not used
fx
resub
sweep
eliminate
sweep
full simplify
Technology
Specific
Mapping
Mapped
Netlist
But structural bias means the mapped
netlist depends heavily on the final
network.
Lossless Synthesis Flow
Idea: Merge intermediate networks into a single network with choices
which can be explored during mapping
sweep
eliminate
resub
Boolean
Network
Technology-
independent
synthesis
Choice operator
Page 19
Technology mapping is not
any harder with choices
(Lehman-Watanabe ’95,
Chen and Cong `01)
resub
simplify
fx
resub
sweep
eliminate
sweep
full simplify
Technology
MappingMapped
Netlist
Choice operator
Lossless Synthesis Flow
sweep
eliminate
resub
Boolean
Network
speed up
Script
optimizes
areaScript
optimizes
delay
Can combine results of different technology independent optimization
scripts
Page 20
resub
simplify
fx
resub
sweep
eliminate
sweep
full simplify
Technology
MappingMapped
Netlist
reduce
depth
delay
Mapping with Choices
sweep
eliminate
resub
simplify
Boolean
Network
Question 1:
How to implement an
efficient choice operator?
Page 21
fx
resub
sweep
eliminate
sweep
full simplify
Technology
MappingMapped
Netlist
efficient choice operator?
Question 2:
How to map quickly with
choices?
Mapping with Choices
sweep
eliminate
resub
simplify
Boolean
Network
Question 1:
How to implement an
efficient choice operator?
Page 22
fx
resub
sweep
eliminate
sweep
full simplify
Technology
MappingMapped
Netlist
efficient choice operator?
Question 2:
How to map quickly with
choices?
Detecting Choices
Task: Given two Boolean networks, we need to create a network with choices
Network 1
x = (a + b).c
y = b.c.d
Network 2
x = a.c + b.c
y = b.c.d
Step 1: Make And-Inverter decomposition of networks
Page 23
a b c d
x y
a b c d
x y
Step 1: Make And-Inverter decomposition of networks (dotted means inversion)
Detecting Choices
Network 1
x = (a + b).c
Network 2
x = a.c + b.c
Step 2: Use combinational equivalence to detect functionally equivalent nodes up to complementation (Kuehlmann ’04, …)
– Random simulation to detect possibly equivalent nodes
– SAT-based decision procedure to prove equivalence
Page 24
y = b.c.d y = b.c.d
a b c d
x y
a b c d
x y
Detecting Choices
Step 3: Merge equivalent nodes with choice edges
x y x y
Page 25
a b c d a b c d
a b c d
x y
x now represents a
class of nodes that are
functionally equivalent
up to complementation
Mapping with Choices
sweep
eliminate
resub
simplify
Boolean
Network
Question 1:
How to implement an
efficient choice operator?
Page 26
fx
resub
sweep
eliminate
sweep
full simplify
Technology
MappingMapped
Netlist
efficient choice operator?
Question 2:
How to map quickly
with choices?
Mapping without Choices
Input: And-Inverter Graph
1. Compute all K-feasible cuts
2. Compute best arrival time at each node
• In topological order (from PI to PO)
Page 27
• Assuming that each cut maps to a K-LUT
• Assuming that each K-LUT has unit delay
3. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
Mapping with Choices
Input: And-Inverter Graph with Choices
1. Compute all K-feasible cuts with choices
2. Compute best arrival time at each node
• In topological order (from PI to PO)
Page 28
• Assuming that each cut maps to a K-LUT
• Assuming that each K-LUT has unit delay
3. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
Only Step 1 requires modification
Cut Computation with Choices
Cuts are now computed for equivalence classes of nodes
x yx1 x2
{ {x1}, {p, r}, {p, b, c}, {a, c, r}, {a, b, c} } { {x2}, {q, c}, {a, b, c} }
Page 29
Cuts ( x ) = Cuts ( x1 ) ∪∪∪∪ Cuts( x2 )
= { {x1}, {p, r}, {p, b, c}, {a, c, r}, {a, b, c}, {x2}, {q, c} }
a b c d
p q r
Mapping with Choices
Input: And-Inverter Graph with Choices
1. Compute all K-feasible cuts with choices
2. Compute best arrival time at each node
• In topological order (from PI to PO)
Page 30
• Assuming that each cut maps to a K-LUT
• Assuming that each K-LUT has unit delay
3. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
No changes needed except for Step 1
Lossless Synthesis Summary
Also called Mapping with Structure Choices
Advantages
� Equivalent netlist variations are recorded
– mapping algorithm selects best among alternative
Page 31
– mapping algorithm selects best among alternative
structures to optimize a cost function
� Simple extension of mapping algorithm
Disadvantages
� Even more cuts to explore!
Outline
1. Review of Technology Mapping
2. More Efficient Cut Computation
3. Lossless Synthesis
4. Priority Cuts
Page 32
4. Priority Cuts
5. Area Recovery
1. Area-flow
2. Exact Area
6. WireMap
Exhaustive Cut Enumeration Mapping
� Large designs have many K-feasible cuts
– 1M node AIG has ~40M 6-cuts
– Needs ~2GB and ~30 sec for computation
�Past ways of tackling the problem
Page 33
– Detect and remove dominated cuts
• Does not help much
– Perform cut pruning (store N cuts/node)
• Throws away useful cuts even if N = 1000
– Store only cuts on the frontier
• Reduces memory but increases runtime
Priority Cuts: A Bag of Tricks
• Compute and prioritize cuts (select subset of all cuts)
• Fast and memory efficient – affordable for multiple passes
• Potentially lower quality overcome via multiple passes
• Use different sorting criteria in each mapping pass to explore
additional cost criteria
•
Page 34
• Include the best cut from the previous pass into the set of
candidate cuts of the current pass
• Efficient memory management
• Only maintain complete set of priority cuts for nodes on the
mapping frontier
• Precompute frontier to create efficiently managed memory pool
• Only save best cut for each node
Computing Priority Cuts
� Consider nodes in a topological order
– At each node, merge two sets of fanin cuts (each containing up to C
cuts) getting (C+1) * (C+1) + 1 cuts
– Sort these cuts using a given cost function, select C best cuts, and
use them for computing priority cuts of the fanouts
– Select one best cut, and use it to map the node
Page 35
– Select one best cut, and use it to map the node
� Sorting criteria
Mapping pass Primary metric Tie-breaker 1 Tie-breaker 2
depth depth cut size area flow
area flow area flow fanin refs depth
exact area exact area fanin refs depth
Priority-Cut-Based Mapping
Input: And-Inverter Graph
1. Compute all K-feasible cuts for each node
2. Compute arrival time at each node
• In topological order (from PI to PO)
• Compute the depth of all cuts and choose the best one
• Compute at most C good cuts and choose the best one
3. Chose the best cover
Page 36
3. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
Complexity Analysis
� Traditional mapping algorithm
– FlowMap O(Kmn) (J. Cong et al, TCAD ’94)
– CutMap O(2KmnK) (J. Cong et al, FPGA ’95)
– DAOmap O(KnK) (J. Cong et al, ICCAD’04)
� Proposed mapping algorithm
Page 37
� Proposed mapping algorithm
– O(KC2n)
• 6-LUT mapping has about 5X speedup
• 8-LUT mapping has up to 100X speedup
K is max cut size
C is max number of cuts
n is number of nodes
m is number of edges
C between 8 and 16 achieves
optimal delay with good runtime
Outline
1. Review of Technology Mapping
2. More Efficient Cut Computation
3. Lossless Synthesis
4. Priority Cuts
Page 38
4. Priority Cuts
5. Area Recovery
1. Area-flow
2. Exact Area
6. WireMap
Overview of Area Recovery
� Initial mapping is delay oriented
– Gets best delay for all paths
– Area-based tie-breaking
� Not all paths critical
– Area recovery tries to slow down non critical paths to
Page 39
– Area recovery tries to slow down non critical paths to
reduce area
– Each node with positive slack: choose a different cut
that reduces area
– Done as subsequent passes after delay-oriented
mapping
� Question: how to measure area?
How to Measure Area?
q r
x
p
y
q r
x
p
y
Naïve definition: Area (cut) = 1 + [ Σ area (fan-in) ]
Page 40
c d e fa b
Area of cut {p, c, d}
= 1 + [1 + 0 + 0]
= 2
c d e fa b
Area of cut {a, b, q}
= 1 + [ 0 + 0 + 1]
= 2
Naïve definition says both cuts are equally good in area
Naïve definition ignores sharing due to multiple fan-outs
Area-flow
q r
x
p
y
q r
x
p
y
∑+=i i
i
nLeafNumFanout
nLeafAFnAF
))((
))((1)(
Page 41
c d e fa b
Area-flow of cut {p, c, d}
= 1 + [1 + 0 + 0]
= 2
c d e fa b
Area-flow of cut {a, b, q}
= 1 + [ 0/1 + 0/1 + ½]
= 1.5
Area-flow “correctly” accounts for sharing and penalizes replication
It is a floating point value!
Area-flow recognizes that cut {a, b, q} is better
Area Recovery with Area-flow
1. Do delay-optimal mapping
2. Compute slack at each node
3. Do area recovery with area-flow
– Done in topological order from PI to PO
Page 42
– Among all the cuts which do not exceed slack budget
choose cut with smallest area-flow
– Fan-out of a node is estimated from delay optimal
mapping
– We only do one pass
• Saw only marginal improvement on subsequent passes
Exact Area
p
X
6 6
p
X
6 6
Exact-area (cut) = 1 + [ Σ exact-area (fan-in with no other fan-out) ]
- Gives minimum area solution within an MFFC
Page 43
Cut {s, t, q}
Area flow = 1+ [.25+.25 +1] = 2.5
Exact area = 1 + 1 = 2 (due to q)
Area flow will choose this cut.
Cut {p, e, f}
Area flow = 1+ [(.25+.25+3)/2] = 2.75
Exact area = 1 + 0 (p is used elsewhere)
Exact area will choose this cut.
db c e fa
s tq
db c e fa
s tq
6
Area Recovery with Exact-area
1. Do delay-optimal mapping
2. Compute slack at each node
3. Do area recovery with area-flow
4. Do area recovery with exact-flow
Page 44
4. Do area recovery with exact-flow
– Done in topological order from PI to PO
– Among all the cuts which do not exceed slack budget
choose cut with smallest exact-area
– Note: Unlike area-flow, no estimation involved
– We only do one pass
• Saw only marginal improvement on subsequent passes
Priority-Cut Mapping with Area Recovery
Input: And-Inverter Graph
1. Compute all K-feasible cuts for each node
2. Compute arrival time at each node
• In topological order (from PI to PO)
• Compute the depth of all cuts and choose the best one
• Compute at most C good cuts and choose the best one
3. Perform area recovery
Page 45
3. Perform area recovery
• Using area flow
• Using exact local area
• In each iteration, re-compute at most C good cuts and choose the best one
4. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
Area Recovery Summary
�Two step area recovery
�Area-flow has global view
�Exact area has local view
Page 46
–Ensures local minimum is reached
�Order in which nodes are processed
for both steps is important
�Order of the two passes is important
Experimental Comparison
� Compare area-recovery with state-of-the-art academic mapper DAOmap– DAOmap uses many (~10) different area recovery heuristics
– Some more effective than others
� Just the two heuristics of area-recovery and exact-area give better results on their benchmarks
Page 47
area give better results on their benchmarks
� Also separate comparison with choices obtained from lossless synthesis flow– Six snapshots of MVSIS script.rugged
– Not the best FPGA optimization script ☺
– Improves both area and delay
DAOmap MVSIS-baseline MVSIS-choices MVSIS-choices 2x Example
Depth LUTs T, s Depth LUTs T, s Depth LUTs T, s Depth LUTs T, s
alu4 6 1065 0.5 6 992 0.34 6 972 0.64 6 949 +0.84
apex2 7 1352 0.6 7 1200 0.36 7 1249 0.95 7 1191 +1.34
apex4 6 931 0.7 6 891 0.24 6 895 0.74 6 894 +1.47
bigkey 3 1245 0.6 3 797 0.34 3 797 0.75 3 684 +1.07
clma 13 5425 5.9 13 4426 1.50 11 3883 4.30 11 3453 +5.20
des 5 965 0.8 5 1024 0.36 5 947 0.93 5 1104 +1.87
diffeq 10 817 0.6 10 844 0.30 9 745 0.46 9 736 +0.43
Comparison with DAOmap
Page 48
dsip 3 686 0.5 3 686 0.23 3 685 0.19 3 684 +0.36
elliptic 12 1965 2.0 12 2017 0.61 12 2005 0.72 12 2022 +1.25
ex1010 7 3564 4.0 7 3258 1.15 7 3305 3.39 7 3302 +5.80
ex5p 6 778 1.0 6 744 0.36 5 724 1.17 5 675 +1.40
frisc 16 1999 1.9 15 2009 0.76 14 1875 1.54 13 1867 +1.58
misex3 6 980 0.8 6 957 0.26 6 926 0.73 6 861 +0.94
pdc 7 3222 4.6 8 2920 1.13 7 2738 4.73 7 2692 +5.59
s298 13 1258 2.4 13 826 0.30 12 863 4.07 11 826 +1.49
s38417 9 3815 3.8 9 3864 1.46 8 2989 4.04 7 2729 +2.76
s38584 7 2987 27.0 7 2844 1.11 7 2497 2.58 6 2470 +1.69
seq 6 1188 0.8 6 1109 0.30 5 1136 0.79 6 1016 +1.38
spla 7 2734 4.0 7 2535 1.03 7 2319 4.68 7 2224 +4.79
tseng 10 706 0.6 10 752 0.25 8 719 0.39 8 705 +0.31
Ratio 1.00 1.00 1.00 1.00 0.93 0.37 0.95 0.89 0.95 0.93 0.86 1.46
Outline
1. Review of Cut-based Mapping
2. More Efficient Cut Computation
3. Lossless Synthesis
4. Priority Cuts
Page 49
4. Priority Cuts
5. Area Recovery
6. WireMap
Motivation
� Cut-based mapping algorithms do well in
minimizing LUT levels and area (LUT count)
– Performance of circuit correlates to LUT levels
– Logic block utilization correlates well to LUT count
Page 50
� Could we change cut based mapping to improve
netlist for packing, placement, routing?
� Area calculation gives each LUT equal weight –
but should this be the case?
Virtex-5 LUT6
LUT6
A6
A5
A4
A3
O6
O5
Page 51
A3
A2
A1
V5 LUT6 Details and Packing
A6
A5
A4
A3
A2
A1
5LUT
O6LUT
Page 52
A1
O6
5LUT O5
O5LUT
Can we produce smaller LUTs without increasing LUT levels?
Placement and Routing
�Routing is done for connections between
inputs and outputs of a LUT (and other
design elements)
� Fewer connections to route should make
the design easier to place and route
Page 53
the design easier to place and route
�Can we come up with a mapping algorithm
to minimize the total # of connections in a
design?
Motivation Revisited
�Could we use cut based mapping to
improve netlist for clustering, placement,
routing?
– Can we come up with a mapping algorithm to
Page 54
– Can we come up with a mapping algorithm to
minimize the total # of connections in a design?
– Can we produce smaller LUTs without increasing
LUT levels?
�Area calculation gives equal weight to all
LUTs – should that be the case?
Edge Recovery Overview
Key: Find a simple to compute cut metric that minimizes edge counts and creates more small LUTs
∑+=i i
i
nLeafNumFanout
nLeafEFnNumFaninnEF
))((
))(()()(
Page 55
1. Edge flow phase: Use edge flow cost function to minimize global edge counts
2. Exact edge phase: Use optimal algorithm to minimize edge counts within MFFCs
• Contrast with Area Flow eqn:
∑+=i i
i
nLeafNumFanout
nLeafAFnAF
))((
))((1)(
Edge Flow Phase
1. Do delay-optimal mapping
2. Compute slack at each node
3. Do area recovery with area-flow with one change in how cuts are selected
– Done in topological order from PI to PO
Page 56
– Done in topological order from PI to PO
– Among all the cuts which do not exceed slack budget choose cut with smallest area-flow
– If 2 cuts have the same area-flow then choose the cut with the lower edge-flow
• Edge flow is a tie breaker when area is within epsilon
Exact Edge Phase
1. Do delay-optimal mapping
2. Compute slack at each node
3. Do edge recovery with edge-flow
4. Do edge recovery with exact edge with one
Page 57
4. Do edge recovery with exact edge with one
change
– Done in topological order from PI to PO
– Among all the cuts which do not exceed slack budget
choose cut with smallest area, and to break ties choose
cuts with lower number of edges
– Note: Unlike edge-flow, no estimation involved
Modified Cut Prioritization Heuristics
� Consider nodes in a topological order
– At each node, merge two sets of fanin cuts (each containing C
cuts) getting (C+1) * (C+1) + 1 cuts
– Sort these cuts using a given cost function, select C best cuts, and
use them for computing priority cuts of the fanouts
– Select one best cut, and use it to map the node
Page 58
– Select one best cut, and use it to map the node
� Sorting criteria
Mapping pass Primary metric Tie-breaker 1 Tie-breaker 2
Depth depth cut size area flow
area/edge flow area flow edge flow depth
exact area/edge exact area exact edge depth
Experimental Method
• Implemented WireMap using ABC
• Compared against two ABC mapping algorithms
• Baseline – mapping with area recovery
• Mapping with Structure Choices (MSC) – area-recovery mapping with alternative netlists produced
Page 59
recovery mapping with alternative netlists produced by synthesis
• WireMap was built on top of MSC
• Performed packing of single-output LUTs to dual-output LUTs using maximum cardinality matching
• Used VPR to place/route design for wirelength and critical path delays
WireMap Results
� MSC is superior to baseline mapping
– Single-output LUT count reduced by 9.1%
– Edge count reduced by 8.1% and dual-output LUT count reduced
by 7.7% - similar level of reduction as single-output LUT count
� WireMap leads to further reduction in edges by 9.3%
Page 60
and dual-output LUT count by 9.4% versus MSC
– Single-output LUT count only reduced by 1.3% wrt. MSC
� WireMap improvements to edges and dual-output
LUTs not directly related to single-output LUT count
reduction
WireMap Results - Packing
LUT Distribution: MSC vs. WireMap
50.00%
60.00%
The histogram below shows how the single-output LUT size distribution was
modified leading to a 9.4% reduction in dual output LUT6s
Page 61
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
%L
UT
s
MSC WireMap
MSC 4.71% 8.00% 15.87% 23.49% 47.93%
WireMap 10.12% 12.66% 17.89% 20.19% 39.14%
LT2 LT3 LT4 LT5 LT6
WireMap Results – Place and Route
• Wirelength was reduced by 8.5% vs. MSC
• Minimum channel width reduced by 6%.
Page 62 twl = total wire length, mcw = minimum channel width required to route in VPR
*cpd = critical path delay using the smallest possible channel width across the three implementations
• Critical path delay reduced by 2.3%.
WireMap Summary
�Edge recovery cut-based mapping algorithm
that extends area recovery heuristic with an
edge cost function
– area flow->edge flow
Page 63
– exact area->exact edge
�Minimizes total # of connections in the
design
� Improves packing by increasing frequency of
smaller LUTs
Overall Summary
� Cut-based mapping is efficient and flexible
� Lossless Synthesis
– Map over multiple synthesis snapshots
� Priority Cuts
– Limit # of cuts explored
Page 64
– Limit # of cuts explored
• Runtime and memory scalability
• Without compromising QoR
� Improved area recovery
– Global area-flow and local exact area
– Order of application is important
� WireMap
– Pack/place/route friendly cut-based mapping algorithm
Key Takeaways
� Pay attention to runtime and memory scalability
� Defer choices between alternative implementations to
later phases that make better decisions
� Global optimization followed by exact local
optimization is effective
Page 65
optimization is effective
� Overcome suboptimal solution via multiple passes
that explore different corners of the optimization space
� Best solutions consider what is done in synthesis,
mapping, placement and routing
References
� S. Jang, B. Chan, K. Chung, and A. Mishchenko, "WireMap:
FGPA technology mapping for improved routability". Proc.
FPGA '08. PDF
� S. Cho, S. Chatterjee, A. Mishchenko, and R. Brayton,
"Efficient FPGA mapping using priority cuts". (Poster.) Proc.
FPGA '07. PDF
Page 66
FPGA '07. PDF
� A. Mishchenko, S. Chatterjee, and R. Brayton, "Improvements
to technology mapping for LUT-based FPGAs". IEEE TCAD,
Vol. 26(2), Feb 2007, pp. 240-253. PDF ICCAD
� All publications for ABC:
http://www.eecs.berkeley.edu/~alanmi/publications/
Top Related