Download - Recent Advances in Cut-based FPGA Technology Mapping · Cut-based Mapping Advantages Advantages –Cuts have direct correspondence to LUTs • Easy to create LUT-based cost functions

Recent Advances in Cut-based Recent Advances in Cut-based

FPGA Technology Mapping

Kevin ChungApril 3, 2009

Preamble

� Logic synthesis and verification

research is alive and vibrant

� FPGAs are growing fast – scalability

in runtime and memory paramountin runtime and memory paramount

Outline

1. Review of Cut-based Mapping

2. More Efficient Cut Computation

3. Lossless Synthesis

4. Priority Cuts

5. Area Recovery

6. WireMap

Cut-based Mapping Algorithm

Input: And-Inverter Graph

1. Compute all K-feasible cuts

2. Compute best arrival time at each node

• In topological order (from PI to PO)

• Assuming that each cut maps to a K-LUT


• Assuming that each K-LUT has unit delay

3. Chose the best cover

• In reverse topological order (from PO to PI)

Output: Mapped Netlist

Cut-based Mapping Advantages

� Advantages

–Cuts have direct correspondence to LUTs

• Easy to create LUT-based cost functions� different LUT input delays

� output switching activity

� output switching activity

–Cut computation is fast and simple

–Dynamic programming mapping solution

• guarantees optimal delay

• efficient search of LUT design space

Cut-based Mapping Challenges

� Feasible cuts grow quickly wrt LUT size

� Results depend upon AIG netlist

structure– many possible equivalent AIG structures

– logic restructuring optimizations that

K

Avg # of

cuts per

node

– logic restructuring optimizations that

works well for one part of the design

may not give good mapping for another

4 8

5 16

6 38

7 95

8 240

Outline



• Cut Dropping

• Cut Dominance

• Cut Dominance


4. Priority Cuts

5. Area Recovery

6. WireMap

Cut Dropping

{ {q}, {b, c} }

r

{ {r}, {p, q}, {p, b, c}, {a, b, q}, {a, b, c} }

During bottom up computation of cuts, the set of cuts of a node

can be freed once all its fan-outs have been processed

{ {p}, {a, b} } Can delete these cuts

a b c

p q

{ {q}, {b, c} }

Bottom-up

computation

{ {p}, {a, b} }

• Once the cuts of node r are computed, the cuts of q are no longer needed

• But can’t discard the cuts of node p since not all fan-outs of p have been processed

• Dramatically reduces peak memory consumption on large designs

once node r is done

Cuts Behaving Badly

x

f { .. {d, b, c} .. {a, b, c} .. }

{ .. {a, d, b, c} .. {a, b, c} .. }

Bottom-up cut computation in the presence of re-convergence

might produce dominated cuts

x = ~a + a.b + ~b.c

a cb

d e

f { .. {d, b, c} .. {a, b, c} .. }

Cut {a, b, c} dominates cut

{a, d, b, c}

• The “good” cut {a, b, c} is there: so not a quality issue

• But the “bad” cut {a, d, b, c} may be propagated further: so a run-time issue

• Want to discard dominated cuts quickly

Signature-based Dominance

Problem: Given two cuts how to quickly determine whether one is

a subset of another

sig (c) = Σ 2ID(n) mod 32

n ∈c

Define signature of a cut:

(Σ means bit-wise OR)

Observation: If cut c1 dominates cut c2 then

sig(c1) OR sig(c2) = sig(c2)

Cheap test for the common case that a cut does not dominate another. Only if

this fails is an actual comparison made.

n ∈c

where ID(n) is the integer id of node n

(Σ means bit-wise OR)

Example

� Let the node id’s be a = 1, b = 2, c = 3, d = 4

� Let c1 = {a, b, c} and c2 = {a, d, b, c}

� sig (c1) = 21 OR 22 OR 23

= 0001 OR 0010 OR 0100

= 0111

= 0111

� sig (c2) = 21 OR 24 OR 22 OR 23

= 0001 OR 1000 OR 0010 OR 0100

= 1111

� As sig (c1) OR sig (c2) ≠ ≠ ≠ ≠ sig (c1), c2 does not dominate c1

� But sig (c1) OR sig (c2) = sig (c2), so c1 may dominate c2

K = 4 K = 5 K = 6 K = 7 K = 8

Name N C/N T, s C/N T, s C/N T, s C/N T, s C/N T, s L/N, %

alu4 2642 6.7 0.00 12.3 0.01 23.1 0.04 45.5 0.18 94.7 1.02 0.00

apex2 2940 7.2 0.01 14.2 0.02 29.2 0.07 62.6 0.32 139.7 1.90 0.00

apex4 2017 8.5 0.00 19.5 0.03 47.0 0.10 116.3 0.62 293.5 4.49 0.10

bigkey 3080 6.6 0.01 12.1 0.02 24.2 0.05 50.1 0.20 99.7 0.84 0.00

clma 11869 8.1 0.04 18.2 0.11 44.4 0.51 114.9 3.01 306.3 20.99 1.64

des 3020 8.0 0.01 17.0 0.03 38.7 0.12 92.0 0.69 218.0 4.80 4.37

diffeq 2566 6.5 0.01 12.3 0.01 26.6 0.07 65.0 0.50 155.9 2.80 3.66

dsip 2521 6.2 0.01 10.7 0.01 20.7 0.03 42.0 0.10 86.7 0.44 0.00

Run-time of K-feasible Cut Computation

dsip 2521 6.2 0.01 10.7 0.01 20.7 0.03 42.0 0.10 86.7 0.44 0.00

elliptic 5502 6.4 0.01 10.6 0.03 18.5 0.07 36.9 0.33 83.4 2.12 0.20

ex1010 7652 9.2 0.02 23.3 0.11 61.8 0.61 165.8 4.01 438.2 30.43 1.99

ex5p 1719 9.4 0.01 24.1 0.02 66.2 0.17 188.2 1.30 514.8 10.50 14.14

frisc 5905 7.1 0.01 14.4 0.04 32.3 0.16 79.8 0.88 209.0 6.30 1.24

misex3 2441 7.7 0.01 15.7 0.02 33.3 0.08 73.7 0.38 170.7 2.48 0.00

pdc 7527 9.4 0.03 24.8 0.12 67.4 0.68 183.7 4.41 489.4 31.71 4.40

s298 2514 7.9 0.00 17.5 0.02 44.0 0.13 121.9 0.94 346.5 7.10 7.56

s38417 12867 6.6 0.03 13.5 0.10 32.0 0.46 83.1 3.24 225.9 23.72 3.38

s38584 11074 6.1 0.03 11.4 0.06 22.4 0.20 46.7 0.98 101.5 5.81 0.86

seq 2761 7.5 0.00 15.2 0.02 31.7 0.08 68.6 0.37 153.3 2.25 0.04

spla 6556 9.6 0.03 25.8 0.11 73.9 0.69 215.5 4.98 561.4 31.14 13.83

tseng 1920 6.5 0.01 11.8 0.01 23.5 0.04 50.6 0.21 112.7 1.32 1.35

Average 4954.6 7.56 0.01 16.22 0.05 38.05 0.22 95.15 1.38 240.0 9.61 2.94

K = 4 K = 5 K = 6 K = 7 K = 8

Name Total Drop Total Drop Total Drop Total Drop Total Drop

clma 2.56 0.10 6.60 0.22 18.09 0.54 52.03 1.47 152.55 4.07

ex1010 1.87 0.37 5.45 0.97 16.25 2.27 48.40 4.68 140.70 8.38

pdc 1.90 0.27 5.69 0.75 17.42 2.00 52.75 4.98 154.56 11.83

s38417 2.28 0.15 5.28 0.37 14.12 1.10 40.80 3.55 121.98 10.25

s38584.1 1.80 0.11 3.86 0.20 8.52 0.40 19.72 0.86 47.15 1.94

spla 1.68 0.21 5.15 0.59 16.63 1.65 53.88 4.34 154.44 10.04

Peak Memory in Mb with Cut Dropping

spla 1.68 0.21 5.15 0.59 16.63 1.65 53.88 4.34 154.44 10.04

Ratio 1.00 0.11 1.00 0.10 1.00 0.08 1.00 0.07 1.00 0.06

Outline




4. Priority Cuts

4. Priority Cuts

5. Area Recovery

6. WireMap

Structural Bias

The mapped netlist very closely resembles the subject graph

f

Technology

Mapping

fp

p

a b c d

Mapping

e a b c d e

Every input of every LUT in the mapped netlist must be present in the

subject graph ..

.. otherwise technology mapping will not find the match

m

m

The Problem of Structural Bias

f

f

f

Root problem: Best matches for mapping may not be found

This match is not found

p

p

a b c d e a b c d e a b c d e

Since the point q is not present in the subject graph,

the match on the extreme right will not be found

q

mm

The Problem of Structural Bias

f

f

The match would be found with a different subject graph

p

f

a b c d e

a b c d e

q

m

a b c d

q

e

=

Traditional Synthesis Flow

Technology-

independent

synthesis

sweep

eliminate

resub

simplify

Boolean

Network

No guarantee of optimality since each

synthesis step is heuristic.

Since only network at the end of technology independent synthesis used

for mapping, good intermediate netlists not used

fx

resub

sweep

eliminate

sweep

full simplify

Technology

Specific

Mapping

Mapped

Netlist

But structural bias means the mapped

netlist depends heavily on the final

network.

Lossless Synthesis Flow

Idea: Merge intermediate networks into a single network with choices

which can be explored during mapping

sweep

eliminate

resub

Boolean

Network

Technology-

independent

synthesis

Choice operator

Technology mapping is not

any harder with choices

(Lehman-Watanabe ’95,

Chen and Cong `01)

resub

simplify

fx

resub

sweep

eliminate

sweep

full simplify

Technology

MappingMapped

Netlist

Choice operator

Lossless Synthesis Flow

sweep

eliminate

resub

Boolean

Network

speed up

Script

optimizes

areaScript

optimizes

delay

Can combine results of different technology independent optimization

scripts

resub

simplify

fx

resub

sweep

eliminate

sweep

full simplify

Technology

MappingMapped

Netlist

reduce

depth

delay

Mapping with Choices

sweep

eliminate

resub

simplify

Boolean

Network

Question 1:

How to implement an

efficient choice operator?

fx

resub

sweep

eliminate

sweep

full simplify

Technology

MappingMapped

Netlist


Question 2:

How to map quickly with

choices?

Detecting Choices

Task: Given two Boolean networks, we need to create a network with choices

Network 1

x = (a + b).c

y = b.c.d

Network 2

x = a.c + b.c

y = b.c.d

Step 1: Make And-Inverter decomposition of networks

a b c d

x y

a b c d

x y

Step 1: Make And-Inverter decomposition of networks (dotted means inversion)

Detecting Choices

Network 1

x = (a + b).c

Network 2

x = a.c + b.c

Step 2: Use combinational equivalence to detect functionally equivalent nodes up to complementation (Kuehlmann ’04, …)

– Random simulation to detect possibly equivalent nodes

– SAT-based decision procedure to prove equivalence

y = b.c.d y = b.c.d

a b c d

x y

a b c d

x y

Detecting Choices

Step 3: Merge equivalent nodes with choice edges

x y x y

a b c d a b c d

a b c d

x y

x now represents a

class of nodes that are

functionally equivalent

up to complementation


sweep

eliminate

resub

simplify

Boolean

Network

Question 1:

How to implement an


fx

resub

sweep

eliminate

sweep

full simplify

Technology

MappingMapped

Netlist


Question 2:

How to map quickly

with choices?

Mapping without Choices


1. Compute all K-feasible cuts









Input: And-Inverter Graph with Choices

1. Compute all K-feasible cuts with choices








Only Step 1 requires modification

Cut Computation with Choices

Cuts are now computed for equivalence classes of nodes

x yx1 x2

{ {x1}, {p, r}, {p, b, c}, {a, c, r}, {a, b, c} } { {x2}, {q, c}, {a, b, c} }

Cuts ( x ) = Cuts ( x1 ) ∪∪∪∪ Cuts( x2 )

= { {x1}, {p, r}, {p, b, c}, {a, c, r}, {a, b, c}, {x2}, {q, c} }

a b c d

p q r


Input: And-Inverter Graph with Choices

1. Compute all K-feasible cuts with choices








No changes needed except for Step 1

Lossless Synthesis Summary

Also called Mapping with Structure Choices

Advantages

� Equivalent netlist variations are recorded

– mapping algorithm selects best among alternative

– mapping algorithm selects best among alternative

structures to optimize a cost function

� Simple extension of mapping algorithm

Disadvantages

� Even more cuts to explore!

Outline

1. Review of Technology Mapping



4. Priority Cuts

4. Priority Cuts

5. Area Recovery

1. Area-flow

2. Exact Area

6. WireMap

Exhaustive Cut Enumeration Mapping

� Large designs have many K-feasible cuts

– 1M node AIG has ~40M 6-cuts

– Needs ~2GB and ~30 sec for computation

�Past ways of tackling the problem

– Detect and remove dominated cuts

• Does not help much

– Perform cut pruning (store N cuts/node)

• Throws away useful cuts even if N = 1000

– Store only cuts on the frontier

• Reduces memory but increases runtime

Priority Cuts: A Bag of Tricks

• Compute and prioritize cuts (select subset of all cuts)

• Fast and memory efficient – affordable for multiple passes

• Potentially lower quality overcome via multiple passes

• Use different sorting criteria in each mapping pass to explore

additional cost criteria

•

• Include the best cut from the previous pass into the set of

candidate cuts of the current pass

• Efficient memory management

• Only maintain complete set of priority cuts for nodes on the

mapping frontier

• Precompute frontier to create efficiently managed memory pool

• Only save best cut for each node

Computing Priority Cuts

� Consider nodes in a topological order

– At each node, merge two sets of fanin cuts (each containing up to C

cuts) getting (C+1) * (C+1) + 1 cuts

– Sort these cuts using a given cost function, select C best cuts, and

use them for computing priority cuts of the fanouts

– Select one best cut, and use it to map the node


� Sorting criteria

Mapping pass Primary metric Tie-breaker 1 Tie-breaker 2

depth depth cut size area flow

area flow area flow fanin refs depth

exact area exact area fanin refs depth

Priority-Cut-Based Mapping


1. Compute all K-feasible cuts for each node

2. Compute arrival time at each node


• Compute the depth of all cuts and choose the best one

• Compute at most C good cuts and choose the best one





Complexity Analysis

� Traditional mapping algorithm

– FlowMap O(Kmn) (J. Cong et al, TCAD ’94)

– CutMap O(2KmnK) (J. Cong et al, FPGA ’95)

– DAOmap O(KnK) (J. Cong et al, ICCAD’04)

� Proposed mapping algorithm

� Proposed mapping algorithm

– O(KC2n)

• 6-LUT mapping has about 5X speedup

• 8-LUT mapping has up to 100X speedup

K is max cut size

C is max number of cuts

n is number of nodes

m is number of edges

C between 8 and 16 achieves

optimal delay with good runtime

Outline

1. Review of Technology Mapping



4. Priority Cuts

4. Priority Cuts

5. Area Recovery

1. Area-flow

2. Exact Area

6. WireMap

Overview of Area Recovery

� Initial mapping is delay oriented

– Gets best delay for all paths

– Area-based tie-breaking

� Not all paths critical

– Area recovery tries to slow down non critical paths to

– Area recovery tries to slow down non critical paths to

reduce area

– Each node with positive slack: choose a different cut

that reduces area

– Done as subsequent passes after delay-oriented

mapping

� Question: how to measure area?

How to Measure Area?

q r

x

p

y

q r

x

p

y

Naïve definition: Area (cut) = 1 + [ Σ area (fan-in) ]

c d e fa b

Area of cut {p, c, d}

= 1 + [1 + 0 + 0]

= 2

c d e fa b

Area of cut {a, b, q}

= 1 + [ 0 + 0 + 1]

= 2

Naïve definition says both cuts are equally good in area

Naïve definition ignores sharing due to multiple fan-outs

Area-flow

q r

x

p

y

q r

x

p

y

∑+=i i

i

nLeafNumFanout

nLeafAFnAF

))((

))((1)(

c d e fa b

Area-flow of cut {p, c, d}

= 1 + [1 + 0 + 0]

= 2

c d e fa b

Area-flow of cut {a, b, q}

= 1 + [ 0/1 + 0/1 + ½]

= 1.5

Area-flow “correctly” accounts for sharing and penalizes replication

It is a floating point value!

Area-flow recognizes that cut {a, b, q} is better

Area Recovery with Area-flow

1. Do delay-optimal mapping

2. Compute slack at each node

3. Do area recovery with area-flow

– Done in topological order from PI to PO

– Among all the cuts which do not exceed slack budget

choose cut with smallest area-flow

– Fan-out of a node is estimated from delay optimal

mapping

– We only do one pass

• Saw only marginal improvement on subsequent passes

Exact Area

p

X

6 6

p

X

6 6

Exact-area (cut) = 1 + [ Σ exact-area (fan-in with no other fan-out) ]

- Gives minimum area solution within an MFFC

Cut {s, t, q}

Area flow = 1+ [.25+.25 +1] = 2.5

Exact area = 1 + 1 = 2 (due to q)

Area flow will choose this cut.

Cut {p, e, f}

Area flow = 1+ [(.25+.25+3)/2] = 2.75

Exact area = 1 + 0 (p is used elsewhere)

Exact area will choose this cut.

db c e fa

s tq

db c e fa

s tq

6

Area Recovery with Exact-area



3. Do area recovery with area-flow

4. Do area recovery with exact-flow

4. Do area recovery with exact-flow



choose cut with smallest exact-area

– Note: Unlike area-flow, no estimation involved

– We only do one pass

• Saw only marginal improvement on subsequent passes

Priority-Cut Mapping with Area Recovery


1. Compute all K-feasible cuts for each node

2. Compute arrival time at each node


• Compute the depth of all cuts and choose the best one

• Compute at most C good cuts and choose the best one

3. Perform area recovery

3. Perform area recovery

• Using area flow

• Using exact local area

• In each iteration, re-compute at most C good cuts and choose the best one




Area Recovery Summary

�Two step area recovery

�Area-flow has global view

�Exact area has local view

–Ensures local minimum is reached

�Order in which nodes are processed

for both steps is important

�Order of the two passes is important

Experimental Comparison

� Compare area-recovery with state-of-the-art academic mapper DAOmap– DAOmap uses many (~10) different area recovery heuristics

– Some more effective than others

� Just the two heuristics of area-recovery and exact-area give better results on their benchmarks

area give better results on their benchmarks

� Also separate comparison with choices obtained from lossless synthesis flow– Six snapshots of MVSIS script.rugged

– Not the best FPGA optimization script ☺

– Improves both area and delay

DAOmap MVSIS-baseline MVSIS-choices MVSIS-choices 2x Example

Depth LUTs T, s Depth LUTs T, s Depth LUTs T, s Depth LUTs T, s

alu4 6 1065 0.5 6 992 0.34 6 972 0.64 6 949 +0.84

apex2 7 1352 0.6 7 1200 0.36 7 1249 0.95 7 1191 +1.34

apex4 6 931 0.7 6 891 0.24 6 895 0.74 6 894 +1.47

bigkey 3 1245 0.6 3 797 0.34 3 797 0.75 3 684 +1.07

clma 13 5425 5.9 13 4426 1.50 11 3883 4.30 11 3453 +5.20

des 5 965 0.8 5 1024 0.36 5 947 0.93 5 1104 +1.87

diffeq 10 817 0.6 10 844 0.30 9 745 0.46 9 736 +0.43

Comparison with DAOmap

dsip 3 686 0.5 3 686 0.23 3 685 0.19 3 684 +0.36

elliptic 12 1965 2.0 12 2017 0.61 12 2005 0.72 12 2022 +1.25

ex1010 7 3564 4.0 7 3258 1.15 7 3305 3.39 7 3302 +5.80

ex5p 6 778 1.0 6 744 0.36 5 724 1.17 5 675 +1.40

frisc 16 1999 1.9 15 2009 0.76 14 1875 1.54 13 1867 +1.58

misex3 6 980 0.8 6 957 0.26 6 926 0.73 6 861 +0.94

pdc 7 3222 4.6 8 2920 1.13 7 2738 4.73 7 2692 +5.59

s298 13 1258 2.4 13 826 0.30 12 863 4.07 11 826 +1.49

s38417 9 3815 3.8 9 3864 1.46 8 2989 4.04 7 2729 +2.76

s38584 7 2987 27.0 7 2844 1.11 7 2497 2.58 6 2470 +1.69

seq 6 1188 0.8 6 1109 0.30 5 1136 0.79 6 1016 +1.38

spla 7 2734 4.0 7 2535 1.03 7 2319 4.68 7 2224 +4.79

tseng 10 706 0.6 10 752 0.25 8 719 0.39 8 705 +0.31

Ratio 1.00 1.00 1.00 1.00 0.93 0.37 0.95 0.89 0.95 0.93 0.86 1.46

Outline




4. Priority Cuts

4. Priority Cuts

5. Area Recovery

6. WireMap

Motivation

� Cut-based mapping algorithms do well in

minimizing LUT levels and area (LUT count)

– Performance of circuit correlates to LUT levels

– Logic block utilization correlates well to LUT count

� Could we change cut based mapping to improve

netlist for packing, placement, routing?

� Area calculation gives each LUT equal weight –

but should this be the case?

Virtex-5 LUT6

LUT6

A6

A5

A4

A3

O6

O5

A3

A2

A1

V5 LUT6 Details and Packing

A6

A5

A4

A3

A2

A1

5LUT

O6LUT

A1

O6

5LUT O5

O5LUT

Can we produce smaller LUTs without increasing LUT levels?

Placement and Routing

�Routing is done for connections between

inputs and outputs of a LUT (and other

design elements)

� Fewer connections to route should make

the design easier to place and route

the design easier to place and route

�Can we come up with a mapping algorithm

to minimize the total # of connections in a

design?

Motivation Revisited

�Could we use cut based mapping to

improve netlist for clustering, placement,

routing?

– Can we come up with a mapping algorithm to

– Can we come up with a mapping algorithm to

minimize the total # of connections in a design?

– Can we produce smaller LUTs without increasing

LUT levels?

�Area calculation gives equal weight to all

LUTs – should that be the case?

Edge Recovery Overview

Key: Find a simple to compute cut metric that minimizes edge counts and creates more small LUTs

∑+=i i

i

nLeafNumFanout

nLeafEFnNumFaninnEF

))((

))(()()(

1. Edge flow phase: Use edge flow cost function to minimize global edge counts

2. Exact edge phase: Use optimal algorithm to minimize edge counts within MFFCs

• Contrast with Area Flow eqn:

∑+=i i

i

nLeafNumFanout

nLeafAFnAF

))((

))((1)(

Edge Flow Phase



3. Do area recovery with area-flow with one change in how cuts are selected



– Among all the cuts which do not exceed slack budget choose cut with smallest area-flow

– If 2 cuts have the same area-flow then choose the cut with the lower edge-flow

• Edge flow is a tie breaker when area is within epsilon

Exact Edge Phase



3. Do edge recovery with edge-flow

4. Do edge recovery with exact edge with one

4. Do edge recovery with exact edge with one

change



choose cut with smallest area, and to break ties choose

cuts with lower number of edges

– Note: Unlike edge-flow, no estimation involved

Modified Cut Prioritization Heuristics

� Consider nodes in a topological order

– At each node, merge two sets of fanin cuts (each containing C

cuts) getting (C+1) * (C+1) + 1 cuts

– Sort these cuts using a given cost function, select C best cuts, and

use them for computing priority cuts of the fanouts



� Sorting criteria

Mapping pass Primary metric Tie-breaker 1 Tie-breaker 2

Depth depth cut size area flow

area/edge flow area flow edge flow depth

exact area/edge exact area exact edge depth

Experimental Method

• Implemented WireMap using ABC

• Compared against two ABC mapping algorithms

• Baseline – mapping with area recovery

• Mapping with Structure Choices (MSC) – area-recovery mapping with alternative netlists produced

recovery mapping with alternative netlists produced by synthesis

• WireMap was built on top of MSC

• Performed packing of single-output LUTs to dual-output LUTs using maximum cardinality matching

• Used VPR to place/route design for wirelength and critical path delays

WireMap Results

� MSC is superior to baseline mapping

– Single-output LUT count reduced by 9.1%

– Edge count reduced by 8.1% and dual-output LUT count reduced

by 7.7% - similar level of reduction as single-output LUT count

� WireMap leads to further reduction in edges by 9.3%

and dual-output LUT count by 9.4% versus MSC

– Single-output LUT count only reduced by 1.3% wrt. MSC

� WireMap improvements to edges and dual-output

LUTs not directly related to single-output LUT count

reduction

WireMap Results - Packing

LUT Distribution: MSC vs. WireMap

50.00%

60.00%

The histogram below shows how the single-output LUT size distribution was

modified leading to a 9.4% reduction in dual output LUT6s

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

%L

UT

s

MSC WireMap

MSC 4.71% 8.00% 15.87% 23.49% 47.93%

WireMap 10.12% 12.66% 17.89% 20.19% 39.14%

LT2 LT3 LT4 LT5 LT6

WireMap Results – Place and Route

• Wirelength was reduced by 8.5% vs. MSC

• Minimum channel width reduced by 6%.

twl = total wire length, mcw = minimum channel width required to route in VPR

*cpd = critical path delay using the smallest possible channel width across the three implementations

• Critical path delay reduced by 2.3%.

WireMap Summary

�Edge recovery cut-based mapping algorithm

that extends area recovery heuristic with an

edge cost function

– area flow->edge flow

– exact area->exact edge

�Minimizes total # of connections in the

design

� Improves packing by increasing frequency of

smaller LUTs

Overall Summary

� Cut-based mapping is efficient and flexible

� Lossless Synthesis

– Map over multiple synthesis snapshots

� Priority Cuts

– Limit # of cuts explored

– Limit # of cuts explored

• Runtime and memory scalability

• Without compromising QoR

� Improved area recovery

– Global area-flow and local exact area

– Order of application is important

� WireMap

– Pack/place/route friendly cut-based mapping algorithm

Key Takeaways

� Pay attention to runtime and memory scalability

� Defer choices between alternative implementations to

later phases that make better decisions

� Global optimization followed by exact local

optimization is effective

optimization is effective

� Overcome suboptimal solution via multiple passes

that explore different corners of the optimization space

� Best solutions consider what is done in synthesis,

mapping, placement and routing

References

� S. Jang, B. Chan, K. Chung, and A. Mishchenko, "WireMap:

FGPA technology mapping for improved routability". Proc.

FPGA '08. PDF

� S. Cho, S. Chatterjee, A. Mishchenko, and R. Brayton,

"Efficient FPGA mapping using priority cuts". (Poster.) Proc.

FPGA '07. PDF

FPGA '07. PDF

� A. Mishchenko, S. Chatterjee, and R. Brayton, "Improvements

to technology mapping for LUT-based FPGAs". IEEE TCAD,

Vol. 26(2), Feb 2007, pp. 240-253. PDF ICCAD

� All publications for ABC:

http://www.eecs.berkeley.edu/~alanmi/publications/