Outline - University of California, Los Angelescadlab.cs.ucla.edu/~cong/CS258F/chap3.pdfJason Cong...
Transcript of Outline - University of California, Los Angelescadlab.cs.ucla.edu/~cong/CS258F/chap3.pdfJason Cong...
CS258F 2002F
Prof. J. Cong 1
Jason Cong 1
Jason Cong 2
Outline
a Circuit Partitioning formulation
a Importance of Circuit Partitioning
a Partitioning Algorithms
a Circuit Clustering Formulation
a Clustering Algorithms
CS258F 2002F
Prof. J. Cong 2
Jason Cong 3
Circuit Partitioning Formulation
Bi-partitioning formulation:Minimize interconnections between partitions
a Minimum cut: min c(x, x’)
a minimum bisection: min c(x, x’) with |x|= |x ’|
a minimum ratio-cut: min c(x, x’) / |x||x’|
X X’
c(X,X’)
Jason Cong 4
A Bi-Partitioning Example
Min-cut size=13Min-Bisection size = 300Min-ratio-cut size= 19
a
b
c e
d f
mini-ratio-cut min-bisection
min-cut 9
10
100100 100
100100
100
4
Ratio-cut helps to identify natural clusters
CS258F 2002F
Prof. J. Cong 3
Jason Cong 5
Circuit Partitioning Formulation (Cont’d)
General multi-way partitioning formulation:
Partitioning a network N into N1, N2, …, Nk such that
a Each partition has an area constraint
a each partition has an I/O constraint
Minimize the total interconnection:
∑∈
≤iNv
iAva )(
iii INNNc ≤− ),(
),( iN
i NNNci
−∑
Jason Cong 6
Importance of Circuit PartitioningaDivide-and- conquer methodology
The most effective way to solve problems of high complexity
E.g.: min-cut based placement, partitioning-based test generation,…
aSystem-level partitioning for multi -chip designs
inter-chip interconnection delay dominates system performance.
aCircuit emulation/parallel simulation
partition large circuit into multiple FPGAs (e.g. Quickturn), or multiple special -purpose processors (e.g. Zycad).
aParallel CAD development
Task decomposition and load balancing
a In deep-submicron designs, partitioning defines local and global interconnect, and has significant impact on circuit performance
…… ……
CS258F 2002F
Prof. J. Cong 4
Jason Cong 7
Partitioning Algorithms
a Iterative partitioning algorithms
a Spectral based partitioning algorithms
a Net partitioning vs. module partitioning
aMulti-way partitioning
aMulti-level partitioning (to be discussed after clustering)
a Further study in partitioning techniques
Jason Cong 8
Iterative Partitioning Algorithmsa Greedy Iterative improvement method
[Kernighan-Lin 1970]
[Fiduccia-Mattheyses 1982]
[krishnamurthy 1984]
a Simulated Annealing
[Kirkpartrick-Gelatt-Vecchi 1983]
[Greene-Supowit 1984]
(See Appendix, SA will be formally introduced in the Floorplan chapter)
CS258F 2002F
Prof. J. Cong 5
Jason Cong 9
Kernighan-Lin’s Algorithma Pair-wise exchange of nodes to reduce cut sizea Allow cut size to increase temporarily within a pass
Compute the gain of a swapRepeat
Perform a feasible swap of max gainMark swapped nodes “locked”;Update swap gains;
Until no feasible swap; Find max prefix partial sum in gain sequence g1, g2, …, gm
Make corresponding swaps permanent.
a Start another pass if current pass reduces the cut size(usually converge after a few passes)
u • v •
v • u •
locked
Jason Cong 10
Fiduccia-Mattheyses’ Improvement
a Each pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)Choosing swap with max gain and updating swap gains take O(n2) time
a FM-algorithm takes O(p) time per pass( p: #pins)
a Key ideas in FM-algorithms! Each move affects only a few moves
⇒ constant time gain updating per move(amortized)! Maintain a list of gain buckets
⇒ constant time selection of the move with max gain
a Further improvement by KrishnamurthyLook-ahead in gain computation
•
•
u1
V1 V2
u2
gmax
-gmax
CS258F 2002F
Prof. J. Cong 6
Jason Cong 11
Partitioning Algorithms
a Iterative partitioning algorithms
a Spectral based partitioning algorithms
a Net partitioning vs. module partitioning
aMulti-way partitioning
a Further study in partitioning techniques
Jason Cong 12
Spectral Based Partitioning Algorithms
a
b
c
d
13
4
3
0
3
4
3
343
000
001
010
d
c
b
a
A
dcba
=
10
0
0
0
000
300
050
004
d
c
b
a
D
dcba
=
D: degree matrix; A: adjacency matrix; D-A: Laplacian matrix
Eigenvectors of D-A form the Laplacian spectrum of G
CS258F 2002F
Prof. J. Cong 7
Jason Cong 13
Some Applications of Laplacian Spectrum
a Placement and floorplan[Hall 1970][Otten 1982][Frankle-Karp 1986][Tsay-Kuh 1986]
a Bisection lower bound and computation[Donath-Hoffman 1973][Barnes 1982][Boppana 1987]
a Ratio-cut lower bound and computation[Hagen-Kahng 1991][Cong-Hagen-Kahng 1992]
Jason Cong 14
Eigenvalues and Eigenvectors
If Ax=λx
then λ is an eigenvalue of A
x is an eignevector of A w.r.t. λ
(note that Kx is also a eigenvector, for any constant K).
+++
+++
=
nnnnn
nn
n
n
xaxaxa
xaxaxa
x
xaaa
L
M
L
ML
2211
12121111
nnnn aaa L21
11211
xAxA
...
CS258F 2002F
Prof. J. Cong 8
Jason Cong 15
A Basic Property
( )
∑
∑∑
=
=
=
==
jiijji
n
n
iini
n
iii
nnnn
n
n
axx
x
xaxax
x
x
aa
aaxx
,
1
1,
1,1
1
1
111
1 ,,
ML
M
L
M
L
LxTAx
Jason Cong 16
Basic Idea of Laplacian Spectrum Based Graph Partitioning
a Given a bisection ( X, X’ ), define a partitioning vector
clearly, x ⊥ 1, x ≠ 0 ( 1 = (1, 1, …, 1 ), 0 =(0, 0, …, 0))
a For a partition vector x:
a Let S = {x | x ⊥ 1 and x ≠ 0}. Finding best partition vector x such that the total edge cut C(X,X’) is minimum is relaxed to finding x in S such that is minimum
a Linear placement interpretation:
minimize total squared wirelength
'11
.s.t),,,( 21 XiXi
xxxxx in ∈∈
−
== L
xxEji
ji )(),(
2−∑∈
….
x1 x2 x3 xn
xxEji
ji )(),(
2−∑∈
= 4 * C(X, X’)
CS258F 2002F
Prof. J. Cong 9
Jason Cong 17
Property of Laplacian Matrix
∑
∑∑ ∑
∑
∑
∈
∈
−=
−=
−=
−=
=
Ejiji
Ejijiii
jiijii
TT
jiijT
xx
xaij xxd
xxAxd
AxxDxx
xxQQxx
),(
2
),(
2
2
)(
2
( 1 )
squared wirelength
= 4 * C(X, X’)
Therefore, we want to minimizeT Qxx
If x is a partition vector
….
x1 x2 x3 xn
Jason Cong 18
( 2 ) Q is symmetric and semi-definite, i.e.
(i)
(ii) all eigenvalues of Q are ≥0
( 3 ) The smallest eigenvalue of Q is 0
corresponding eigenvector of Q is x0= (1, 1, …, 1 )
(not interesting, all modules overlap and x0∉S )
( 4 ) According to Courant-Fischer minimax principle:
the 2nd smallest eigenvalue satisfies:
Property of Laplacian Matrix (Cont’d)
0≥= ∑ jiijT xxQQxx
2||min
xQxxT
=λx in S
CS258F 2002F
Prof. J. Cong 10
Jason Cong 19
Results on Spectral Based Graph Partitioning
a Min bisection cost c(x, x’) ≥ n⋅λ/4
a Min ratio-cut cost c(x,x’)/|x|⋅|x’| ≥ λ/n (homework)
a The second smallest eigenvalue gives the best linear placement
a Compute the best bisection or ratio-cut
based on the second smallest eigenvector
Jason Cong 20
Computing the Eigenvector
aOnly need one eigenvector
(the second smallest one)
aQ is symmetric and sparse
aUse block Lanczos Algorithm
CS258F 2002F
Prof. J. Cong 11
Jason Cong 21
Interpreting the Eigenvector
aLinear Ordering of modules
aTry all splitting points, choose the best one
using bisection of ratio-cut metric
23 10 22 34 6 21 8
23 10 22 34 6 21 8
Jason Cong 22
Experiments on Special GraphsaGBUI(2n, d, b) [Bui-Chaudhurl-Leighton 1987]
2n nodesd-regularmin-bisection≈b
aGGAR (m, n, P int, Pext ) [Garbe-Promel-Steger 1990]
m clusters
n nodes each cluster
Pint: intra-cluster edge probability
Pext: Inter-cluster edge probability
n
n
n
n
n
n
CS258F 2002F
Prof. J. Cong 12
Jason Cong 23
GBUI Results
GBUI Results
Valu
e of X
i
Cluster 2
Cluster 1
Module i
Jason Cong 24
GGAR Results
GGAR Results
Valu
e of X
i
Cluster 4
Cluster 1
Module i
Cluster 2
Cluster 3
CS258F 2002F
Prof. J. Cong 13
Jason Cong 25
Partitioning Algorithms
a Iterative partitioning algorithms
a Spectral based partitioning algorithms
a Net partitioning vs. module partitioning
aMulti-way partitioning
a Further study in partitioning techniques
Jason Cong 26
Transform Netlists to Graph
a Transform each r-pin net into a r-clique
a Weight each edge in the r-clique1/r-1 ( simple scaling) or2/r (edge probability) or4/r2 (bound each net contribution in the cut size)
a Transforming multi-pin nets to cliques destroy the sparsity of D-A
CS258F 2002F
Prof. J. Cong 14
Jason Cong 27
Net Partitioning vs. Module Partitioninga Dual netlist representation-- intersection graph
Node: NetEdge: if two nets share a common gate
a Intersection graph also captures circuit connectivity, but does not use hyperedges
o
o
o
o
o
o
o
i
G1
G7
G6
G5
G4
G3
G2
i
b
g
d
ca j
f
eh
Netlist N Intersection graph I( N )
i
b
c
f
j
hd
a e
Jason Cong 28
Transform Net Partition to Module Partition
o
o
o
o
o
o
o
i
G1
G7
G6
G5
G4
G3
G2
i
bg
d
ca j
f
eh
⊆X’⊆X
N
Every edge in B(Y, Y’) indicates a conflict:
G3 is wanted by both c and g
G4 is wanted by both d and e
G5 is wanted by both d and f
ig
c
i
b
c
f
j
hd
a
I ( N )
e
fd
e
B (Y, Y’)
Y Y’
CS258F 2002F
Prof. J. Cong 15
Jason Cong 29
Some DefinitionsaMaximum Independent Set(MIS)
Maximum subset of vertices with no two connected
aMinimum Vertex Cover(MVC)Minimum subset of vertices which “covers” all edges
MIS MVC
Jason Cong 30
Resolving Net Conflict When Transforming A Net Partition to A Module Partition
a Some nets in B(Y, Y’) will lose
Each losing net corresponds to a net-cut in module partition
a Objective: maximize the # winning nets, or equivalently, minimize the # losing nets
a Theorem: Set of winning nets forms an independent set in B(Y, Y’), and the set of losers form a vertex cover
⇒Find max independent set in B(Y, Y’)⇒Solved optimally in polynomial time for bipartite graph
CS258F 2002F
Prof. J. Cong 16
Jason Cong 31
Bipartite Graph Propertiesa |MIS|+|MVC|=|V|
a size of any vertex cover is no smaller than max. matching
a |MVC|=|MM| (MM: maximum matching)
⇒ Goal: Find an MVC of size |MM|, or equivalently, an MIS of size |V|-|MM|
MIS = Winners
MVC =Losers
Jason Cong 32
Finding an MISL
R
L
R
Maximum Matching (MM)
UL
UR
CS258F 2002F
Prof. J. Cong 17
Jason Cong 33
Finding an MIS
Alternating Paths
L
R
UL
UR
Jason Cong 34
Finding an MIS
Odd, Even Sets
L
R
UL
UR
Odd(R)
Odd(L)Even ( R )
Even(L)
CS258F 2002F
Prof. J. Cong 18
Jason Cong 35
Improving the Module PartitionNet1= {a, b}
Net3= {a, e}
Net2= {b, c}
Net6= {g, e}
Net5= {e, f}
Net4= {b, d}
Even(L)={Net1, Net2}
ModSL={a,b,c}
Even(R)={Net5, Net6}
ModSR={e,f,g}
ModSL+{d} cuts 1set
ModSR+{d} cuts 2 nets
Jason Cong 36
Algorithm IG-Match
aGet spectral linear ordering
a Construct bipartite graph B
a Find MIS of B
a Convert MIS into module partition
Theorem: IG-Match computes all |V|-1 module partitions in O(|V|(|V|+|E|)) time
Key idea: incremental updating of the bipartite graph and alternating paths
CS258F 2002F
Prof. J. Cong 19
Jason Cong 37
Experimental Results Wei-Cheng
Areas/ Cutsize IG-Match
Areas/Cutsize Improvement
bm1 9: 873 / 1 21:861 / 1 57%
19ks 1011:1833 / 109 650:2194/85 -1%
Prim1 152:681 / 14 154: 679/ 14 1%
Prim2 1132: 1882/123 740: 2274/ 77 21%
Test02 372:1291/95 211:1452/38 37%
Test03 147: 1460/31 803: 804/ 58 38%
Test04 401:1114/51 73: 1442 / 6 50%
Test05 1204:1391/110 105: 1442 / 6 53%
Test06 145: 1607/18 141: 1611/ 17 3%
a 28.8% average improvement
Jason Cong 38
Further Study in Partitioning Techniques
a Module replication in circuit partitioning [Kring-Newton 1991; Hwang-ElGamal 1992; Liu et al TCAD’95; Enos, et al, TCAD’99]
a Generating uni-directional partitioning [Iman-Pedram-Fabian-Cong 1993] or acyclic partitioning [Cong-Li-Bagrodia, DAC94] [Cong-Lim, ASPDAC2000]
a Logic restructuring during partitioning
[Iman-Pedram-Fabian-Cong 1993]
a Communication based partitioning
[Hwang-Owens-Irwin 1990; Beardslee-Lin-Sangiovanni 1992]
a Multi-level partitioning (will discuss after clustering)
CS258F 2002F
Prof. J. Cong 20
Jason Cong 39
Multi-Way Partitioning
a Recursive bi-partitioning[Kernighan-Lin 1970]
a Generalization of Fuduccia-Mattheyse’sand Krishnamurthy’s algorithms [ Sanchis 1989] [Cong-Lim, ICCAD’98]
a Generalization of ratio-cut and spectral method to multi-way partitioning [Chan-Schlag-Zien 1993]ageneralized ratio-cut value=sum of flux of each partitionageneralized ratio-cut cost of a k-way partition
≥sum of the k smallest eigenvalue of the Laplacian Matrix
Jason Cong 40
Circuit Clustering Formulation
Motivation:
a Reduced the size of flat netlistsa Identify natural circuit hierarchy
Objectives:
aMaximize the connectivity of each clusteraMinimize the size, delay (or simply depth),
and density of clustered circuits
CS258F 2002F
Prof. J. Cong 21
Jason Cong 41
Measuring Cluster Connectivity
aSimple density measurementm/n -- include a new node when it
connects to more than m/n edges
nm/( ) -- includes a new node when it
connects to more than 2m/(n-1) edges -- biased for small clusters
a (k, I)-connectivity [Garber-Promel-Steger 1990]Every two nodes in the cluster are connected by at
least k edge -disjoint paths of length no more than IProper choice of k and I may not be easy
aD/S-metric [Cong-Hagen-Kahng 1991]D/S=Degree/SeparationDegree= average degree in the clusterSeparation=average shortest path between a pair of nodes in the cluster Robust but expensive to compute
uP1
P2
Pk
≤ l
v
2
Jason Cong 42
Clustering Algorithmsa (k, I)-connectivity algorithms [Garber-Promel-Steger 1990]
a Random-walk based algorithm [Cong-Hagen-Kahng 1991; Hagen-Kahng 1992]
a Multicommodity-Flow based algorithm [Yeh-Cheng-Lin 1992]
a Clique based algorithm [Bui 1989; Cong-Smith 1993]
a Clustering for delay minimization (combinational circuits) [Lawler-Levitt-Turner 1969; Murgai-Brayton-Sanjiovanni 1991; Rajaraman-Wong 1993]
a Clustering for delay minimization (with consideration of retiming for sequential circuits) [Pan et al, TCAD’99; Cong et al, DAC’99]
a Signal flow based clustering [Cong-Ding, DAC’93; Cong et al ICCAD’97]
a Multi-level clustering [Karypis-Kumar, TR95/DAC97; Cong-Sung, ASPDAC’2000]
CS258F 2002F
Prof. J. Cong 22
Jason Cong 43
Lawler’s Labeling Algorithm[Lawler-Levitt-Turner 1969]
aAssumption: Cluster size≤ K; Intra-cluster delay = 0; Inter-cluster delay =1
aObjective: Find a clustering of minimum delay
aAlgorithm:aPhase 1: Label all nodes in topological ordor
For each PI node V, L(v)= 0;For each non-PI node v
p=Maximum label of predecessors of vXp = set of predecessors of v with label pif |Xp|<K then L(v) = p else L(v) =p+1
aPhase2: Form clustersStart from PO to generate necessary clustersNodes with the same label form a cluster
p-1
Xp
p-1
v
p-1
p
p
Jason Cong 44
Lawler’s Labeling Algorithm(Cont’d)
aPerformance of the algorithmaEfficient run-timeaMinimum delay clustering solutionaAllow node duplicationaNo attempt to minimize the number of clusters
aExtension to allow arbitrary gate delaysaHeuristic solution
[Murgai-Brayton-Sangiovanni 1991]aOptimal solution
[Rajaraman-Wong 1993]
CS258F 2002F
Prof. J. Cong 23
Jason Cong 45
Signal Flow and Logic Dependency• Beyond Connectivity
– identifies logically dependant cells
– useful in guiding• clustering • partitioning [CoLB94, CoLL+97]• placement [CoXu95]
Jason Cong 46
Maximum Fanout Free Cone (MFFC)• Definition: for a node v in a combinational circuit,
– cone of v ( ) : v and all of its predecessors such that any path connecting a node in and v lies entirely in
– fanout free cone at v ( ) : cone of v such that for any node
– maximum FFC at v ( ) : FFC of v such that for any non-PI node w,
vCvC vC
vFFC
vv FFCuoutputFFCvu ⊆≠ )( , in
vMFFCvv MFFCwMFFCwoutput ∈⊆ then ,)( if
CS258F 2002F
Prof. J. Cong 24
Jason Cong 47
Properties of MFFCs• If
• Two MFFCs are either disjoint or one contains another [CoDi93]• Area-optimal duplication-free LUTs map each MFFC independently
to get an optimal solution [CoDi93]• For acyclic bipartitioning without area constraint, there exists an
optimal cut which does not cut through any MFFCs [CoLB94]
[CoDi93] then , vMFFCwMFFCvMFFCw ⊆∈
Jason Cong 48
Maximum Fanout Free Subgraph (MFFS)• Definition : for a node v in a sequential circuit,
• Illustration
} through passes PO some to from path every|{ vuuFFSv =} , allfor |{ vvv FFSuFFSuMFFS ∈=
MFFCs ??? MFFS
CS258F 2002F
Prof. J. Cong 25
Jason Cong 49
MFFS Construction Algorithm• For Single MFFS at Node v
– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs – MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)
v
Jason Cong 50
MFFS Construction Algorithm• For Single MFFS at Node v
– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs– MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)
v
CS258F 2002F
Prof. J. Cong 26
Jason Cong 51
MFFS Construction Algorithm• For Single MFFS at Node v
– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs – MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)
v
Jason Cong 52
MFFS Construction Algorithm• For Single MFFS at Node v
– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs – MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)
v
CS258F 2002F
Prof. J. Cong 27
Jason Cong 53
MFFS Clustering Algorithm• Clusters Entire Netlist
– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))
v
Jason Cong 54
MFFS Clustering Algorithm• Clusters Entire Netlist
– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))
CS258F 2002F
Prof. J. Cong 28
Jason Cong 55
MFFS Clustering Algorithm• Clusters Entire Netlist
– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))
v
Jason Cong 56
MFFS Clustering Algorithm• Clusters Entire Netlist
– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))
v
CS258F 2002F
Prof. J. Cong 29
Jason Cong 57
MFFS Clustering Algorithm• Clusters Entire Netlist
– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))
Jason Cong 58
Speedup of MFFS Clustering• Approximation of MFFS
– cluster size limit imposed in practice– perform MFFS construction on subcircuit SC(v, h)
• defined by search depth h for given node v• internal nodes : all visited nodes by BFS of depth h from v• pseudo PIs/POs : all I/Os from/to nodes not in SC(v, h)
h
v
SC(v, h)
circuit
CS258F 2002F
Prof. J. Cong 30
Jason Cong 59
Speedup of MFFS Clustering• Approximation of MFFS
– cluster size limit imposed in practice– perform MFFS construction on subcircuit SC(v, h)
• defined by search depth h for given node v• internal nodes : all visited nodes by BFS of depth h from v• pseudo PIs/POs : all I/Os from/to nodes not in SC(v, h)
h
v
pseudo PIs
pseudo POs
SC(v, h)
circuit
Jason Cong 60
LSR/MFFS Partitioning Algorithm
• Basic Approach : Clustering Based Partitioning– MFFS clustering : exploit logic dependency– LSR partitioning : focus on loose and stable net
removal• Overview of the Algorithm
– cluster netlist using MFFS algorithm– partition clustered netlist with LSR algorithm– completely decompose netlist– refine flat netlist with LSR algorithm
CS258F 2002F
Prof. J. Cong 31
Jason Cong 61
Loose net Removal (LR)• Net Status Transition
• Loose net Removal
free net
loose net
locked net =
block 0 block 1 block 2
free cells of loose net
uncut net =
Jason Cong 62
Loose net Removal (LR)• Net Status Transition
• Loose net Removal
free net
loose net
locked net =
block 0 block 1 block 2
free cells of loose net
uncut net =
CS258F 2002F
Prof. J. Cong 32
Jason Cong 63
Loose net Removal (LR)• Net Status Transition
• Loose net Removal
free net
loose net
locked net =
block 0 block 1 block 2
free cells of loose net
uncut net =
Jason Cong 64
Loose net Removal (LR)• Net Status Transition
• Loose net Removal
free net
loose net
locked net =
block 0 block 1 block 2
free cells of loose net
uncut net =
CS258F 2002F
Prof. J. Cong 33
Jason Cong 65
Loose net Removal (LR)• Net Status Transition
• Loose net Removal
free net
loose net
locked net =
block 0 block 1 block 2
free cells of loose net
uncut net =
Jason Cong 66
LR Implementation• Basic Approach
– increase gain of free cells of loose nets outside anchor
– focus on smaller nets
• Cell Gain Increase Function– accumulate : until complete removal– set threshold : array-based bucket can be used– expanded gain range : no tie-breaking scheme (LIFO, LA3)
needed
⋅=∑∑
∈
∈
n
n
Fc
Lc
cdeg
cdeg
nsizemax_net
nincr)(
)(
)()(
CS258F 2002F
Prof. J. Cong 34
Jason Cong 67
Stable Net Transition (SNT)• Stable net [Shibuya et al. 1995]
– remain cut throughout entire run– more than 80% of cut nets are stable
• Implementation– detect stable nets by comparing cutsets– move all unprocessed cells of a stable net into single block– serve as initial partition for next run
• fast convergence• efficiently explore more local minima
• Loose and Stable net Removal (LSR)– combination of two effective net removal schemes
• LR : small loose nets• SNT : large stable nets
Jason Cong 68
Summary of LSR/MFFS Algorithm
Cluster netlist with MFFS
Partition netlist with LSR
Gain?
Decluster netlist
Refine netlist with LSR
Gain?yes yes
nono
START
END
CS258F 2002F
Prof. J. Cong 35
Jason Cong 69
Clustering Result• MCNC benchmarks with signal flow information
NAME SIZE AVR # CLST TIME # CLST TIMEs1423 619 1.0 193 0.9 168 1.9sioo 664 4.6 442 2.3 442 2.6
s1488 686 1.0 187 0.9 187 1.4balu 801 12.9 287 2.0 284 3.8
prim1 833 6.7 394 3.2 379 5.9struct 1952 2.5 1183 17.9 1183 12.1prim2 3014 6.7 1404 34.8 1243 22.1s9234 5866 1.0 3203 36.7 3177 9.7biomed 6514 4.5 2142 87.8 1853 30.5s13207 8772 1.0 2132 73.9 1994 14.1s15850 10470 1.0 2279 84.9 2137 17.6s35932 18148 1.0 5562 420.8 2943 32.1s38584 20995 1.0 5139 565.1 4242 44.5avq.sm 21918 4.5 8477 1287.3 8309 116.1s38417 23949 1.0 5906 452.2 5295 45.1avq.lg 25178 4.5 9103 1473.2 8658 90.2
TOTAL 48033 4543.9 42494 449.7
Optimal Heuristic
min_areamax_area
AVR =
Jason Cong 70
Cutsize Reduction by MFFS Clustering
• Bipartitioning under 45-55% Balance Criterion
500 700 900 1100 1300 1500 1700 1900
LSR
LR
SNT
FM
FLATMFFS
0.0%-25.8%
-8.0%-31.7%
-45.6%--48.4%
-46.9%-49.9%
CS258F 2002F
Prof. J. Cong 36
Jason Cong 71
Cutsize Reduction by MFFS Clustering
NAME FM SNT LR LSR FM SNT LR LSRs1423 12 14 13 13 12 12 12 12sioo 25 25 25 25 25 25 25 25
s1488 48 45 44 42 42 42 43 43balu 27 27 27 27 27 27 27 27prim1 43 42 42 42 43 42 42 43struct 42 45 33 33 39 52 33 33prim2 174 167 121 122 133 122 131 119s9234 49 54 41 43 42 43 41 40biomed 83 83 83 83 100 84 84 84s13207 90 74 75 70 91 65 73 61s15850 113 73 73 65 72 70 44 43s35932 100 128 45 44 48 76 45 44s38584 52 52 49 47 52 52 47 47avq.sm 321 378 135 139 273 172 127 127s38417 176 112 53 55 119 147 51 50avq.lg 491 380 145 131 251 168 127 127
TOTAL 1846 1699 1004 981 1369 1261 952 925% IMPR 0.0 8.0 45.6 46.9 25.8 31.7 48.4 49.9
TIME 121.4 59.4 162.3 138.9 84.2 61.3 113.6 89.4
No Clustering MFFS Clustering
Jason Cong 72
Cutsize Comparison with Other Methods
800
900
1000
1100
1200
CD
IP
Pro
pF
Stra
w
hMet
is
ML
c
L/M
Pan
za
Par
ab
FB
B
IIPsNon-IIPs
CS258F 2002F
Prof. J. Cong 37
Jason Cong 73
Details on Cutsize Comparison
NAME CDIP PROPf Straw hMetis MLc L/M PANZA Parab FBBs1423 12 15 16 13sioo 25 25 45
s1488 43 44 50balu 27 27 27 27 27 27 27 41
prim1 52 51 49 50 47 43 struct 36 33 33 33 33 33 33 40prim2 152 152 143 145 139 119 s9234 44 42 42 40 40 40 40 74 70biomed 83 84 83 83 83 84 83 135s13207 70 71 57 55 55 61 66 91 74s15850 67 56 44 42 44 43 44 91 67s35932 73 42 47 42 41 44 43 62 49s38584 47 51 49 47 47 47 47 55 47avq.sm 148 144 131 130 128 127 s38417 79 65 53 51 49 50 49 49 58avq.lg 145 143 140 127 128 127
TOTAL1 1023 961 898 872 861 845 TOTAL2 509 516 749% IMPR 17.4 12.1 5.9 3.1 1.9 1.4 32.0 21.4
TIME 5817 5611 12577 3455 1338 24619
Iterative Improvement Partitioners (IIPs) non-IIPs
Jason Cong 74
Multi-level Paradigm• Combination of Bottom-up and Top-down Methods
– From coarse-grain into finer-grain optimization– Successfully used in partial differential equations, image
processing, combinatorial optimization, etc, and circuit partitioning.
Coarsening Uncoarsening
Initial Partitioning
CS258F 2002F
Prof. J. Cong 38
Jason Cong 75
General Framework• Step 1: Coarsening
– Generate hierarchical representation of the netlist
• Step 2: Initial Solution Generation– Obtain initial solution for the top-level clusters
– Reduced problem size: converge fast
• Step 3: Uncoarsening and Refinement– Project solution to the next lower-level (uncoarsening)– Perturb solution to improve quality (refinement)
• Step 4: V-cycle– Additional improvement possible from new clustering– Iterate Step 1 (with variation) + Step 3 until no further gain
Jason Cong 76
V-cycle Refinement• Motivation
– Post-refinement scheme for multi-level methods– Different clustering can give additional improvement
• Restricted Coarsening– Require initial partitioning– Do not merge clusters in different partition– Maintaincutline: cutsize degradation is not possible
• Two Strategies: V-cycle vs. v-cycle– V-cycle: start from the bottom-level– v-cycle: start from some middle-level– Tradeoff between quality vs. runtime
CS258F 2002F
Prof. J. Cong 39
Jason Cong 77
Application in Partitioning• Multi-level Partitioning
– Coarsening engine (bottom-up)• Unrestricted and restricted coarsening• Any bottom-up clustering algorithm can be used• Cutsize oriented (MHEC, ESC) vs. delay oriented (PRIME)
– Initial partitioning engine• Move-based methods are commonly used
– Refinement engine (top-down)• Move-based methods are commonly used• Cutsize oriented (FM, LR) vs. delay oriented (xLR)
• State-of-the-art Algorithms– hMetis [DAC97] and hMetis-Kway [DAC99] for cutsize– HPM [DAC00] for delay
Jason Cong 78
hMetis Algorithm• Best Bipartitioning Algorithm [DAC97]
– Contribution: 3 new coarsening schemes for hypergraphs
Original Graph Edge Coarsening
Edge Coarsening = heavy-edge maximal matching1. Visit vertices randomly2. Compute edge-weights (=1/(|n|-1)) for all unmatched neighbors3. Match with an unmatched neighbor via max edge-weight
CS258F 2002F
Prof. J. Cong 40
Jason Cong 79
hMetis Algorithm (cont)
• Best Bipartitioning Algorithm [DAC97]– Contribution: 3 new coarsening schemes for hypergraphs
Hyperedge Coarsening Modified Hyperedge Coarsening
Hyperedge Coarsening = independent hyperedge merging1. Sort hyperedges in non-decreasing order of their size2. Pick an hyperedge with no merged vertices and merge
Modified Hyperedge Coarsening = Hyeredge Coarsening + post process1. Perform Hyperedge Coarsening2. Pick a non-merged hyperedge and merge its non-merged vertices
Jason Cong 80
hMetis-Kway Algorithm• Multiway Partitioning Algorithm [DAC99]
– New coarsening: First Choice (variant of Edge Coarsening)• Can match with either unmatched or matched neighbors
– Greedy refinement• On-the-fly gain computation• No bucket: not necessarily the max-gain cell moves• Save time and space requirements
Original Graph First Choice
CS258F 2002F
Prof. J. Cong 41
Jason Cong 81
hMetis Results• Bipartitioning on ISPD98 Benchmark Suite
1.61
1.211.03 1
0
0.4
0.8
1.2
1.6
Scal
ed C
utsi
ze
FM LR LR/ESC hMetis
Jason Cong 82
hMetis-Kway Results• Multiway Partitioning on ISPD98 Benchmark Suite
1.2
1.03
1.19
1.02
1.18
1.011.15
0.97
0
0.2
0.4
0.6
0.8
1
1.2
Scal
ed C
utsi
ze
2way 8way 16way 32way
hMetis-KwayKPM/LRLR/ESC-PM
CS258F 2002F
Prof. J. Cong 42
Jason Cong 83
Summarya Partitioning is very important for applying divide -and-
conquer methodology (for complexity management)
a Partitioning also defines global/local interconnects and greatly impact circuit performance
a Increasing emphasis on interconnect design has introduced many new partitioning formulations
a clustering is effective in reducing circuit size and identifying natural circuit hierarchy
aMulti-level circuit clustering + iterative improvement based methods produce the best partitioning results (so far)
Jason Cong 84
Appendix: Simulated Annealing Based Partitioning
aBasic approacha Solution space: all partitionsa Neighboring solution: move a module from one side to another
a Cost function: cost = c(X,X’)+λ(|X|-|X’|)2
a Temperature schedule: standard cooling schedule
aSimulated annealing without rejectiona Generate a move with probability proportional to its gain
a All gains can be updated efficiently after a move( unique for partitioning!)
a Substantial run-time reduction at low temperature
a Use conventional SA at high temperature and rejectless SA at low temperatures
CS258F 2002F
Prof. J. Cong 43
Jason Cong 85
Simulated Annealing
a Local Search
cost
fun
ctio
n
solution space
o
o
oo
o
oo
o?
Jason Cong 86
Statistical Mechanics
Combinational Optimization
State { r: } (configuration - a set of atomic position)
Weight
-Boltzmann distribution
E({r:}) energy of configuration
KB: Boltzmann constant; T: temperature.
Low Temperature Limit??
TKrEbe /:})({−
CS258F 2002F
Prof. J. Cong 44
Jason Cong 87
Analogy
Physical System
State(configuration)
Energy
Ground State
Rapid Quenching
Careful Annealing
Optimization Problem
(Solution)
Cost function
Optimal solution
Iteration Improvement
Simulated Annealing
Jason Cong 88
Generic Simulated Annealing Algorithm
1. Get an initial solution S
2. Get an initial temperature T>0
3. While not yet “frozen” do the following:3.1 For 1≤ i ≤ L, do the following:
3.1.1 Pick a random neighbor S’ of S.3.1.2 Let ∆=cost( s’ )-cost(s)3.1.3 If ∆ ≤ ( 0 ) (downhill move),
Set S=S’3.1.4 If ∆ > 0 (uphill move)
set S=S’ with probability , 3.2 Set T= rT (reduce temperature)
4. Return S
Te /∆−
CS258F 2002F
Prof. J. Cong 45
Jason Cong 89
Basic Ingredients for S.A.
a Solution space
a Neighborhood Structure
a Cost Function
a Annealing Schedule
Jason Cong 90
SA Partitioning“ Optimization by simulation Annealing” -Kirkpatrick, Gaett, Vecchi.
a Solution space=set of all partitions
a Neighborhood Structure
abc
def
ab
def
af
bcde
abc
a solution a solution a solution
def
bcde
ac
a move
Randomly move one cell to the other side
CS258F 2002F
Prof. J. Cong 46
Jason Cong 91
SA PartitioningaCost function:
f=C+λBaC is the partitioning cost as used before
aB is a measure of how balance the partitioning is
aλ is a constant.
Example of B:
ab...
cd...S2S1
B = ( |S1| - |S2| )2
Jason Cong 92
SA PartitioningaAnnealing schedule:
aTn=(T1/T0)nT0 Ratio T1/T0=0.9
aAt each temperature, either
1. There are 10 accepted moves on the average;
or
2. # of attempts≥100× total # of cells
aThe system is “frozen” if very low acceptances at 3 consecutive temp.
CS258F 2002F
Prof. J. Cong 47
Jason Cong 93
Graph Partition UsingSimulated Annealing Without Rejections
aGreene and Supowit, ICCD-88 pp. 658-663
aMotivation:
At low temperature, most moves are rejected!
e.g. 1/100 acceptance rate for 1,000 vertices
Jason Cong 94
aKey Idea(I) Biased selection
If a move i has probability ω i to be accepted, generate move i with probability
N: size of neighborhood
In general,
In conventional model, each move has probability 1/N to be generated.
(II) If a move is generated, it is always be accepted
Graph Partition Using Simulated Annealing Without Rejections (Cont’d)
∑=
N
Jj
i
1
ω
ω
}.,1min{ /Ti
ie ∆ −=ω
CS258F 2002F
Prof. J. Cong 48
Jason Cong 95
Graph Partition Using Simulated Annealing Without Rejections (Cont’d)
aMain Difficulty
( 1 ) ωi is dynamic ( since ∆ i is dynamic )
It is too expensive to update ωi’s (∆ i’s) after every move
( 2 ) Weighted selection problem
how to select move i with probability
∑=
N
jji
1
ωω ??
Jason Cong 96
Solution to the Weight Selection Problem(general solution to the several problems)
ω1+ ω 2
ω1+ •••+ ω 7
ω7
ω5+ ω 6+ ω 7
ω5+ ω 6ω3+ ω 4
ω1+ •••+ ω 4
ω1ω2 ω3 ω4
ω5 ω6 ω70
CS258F 2002F
Prof. J. Cong 49
Jason Cong 97
Solution to the Weight Selection Problem (Cont’d)
Let W= ω1+ ω2+ ω3+ω4+ ω5+ ω6+ •••+ωn, how to select i with probability ωi /W ?
Equivalent to choosing x such that ω1+ •••+ωi-1< x ≤ ω2+ •••+ωnv← rootx ← random( 0, 1 )* ω(v)while v is not a leaf do
if x < ω(left (v)) then v ← left(v)else x ← x-ω(left(v)), v ← right (v)
endProbability of ending up at leaf:
∑ ∑
∑−
= =
=
≤<=
=
1
1 1
1
(Probi
j
i
jjj
i
N
jji
x
W
ωω
ωωω
)
Jason Cong 98
Application to PartitioningSpecial solution to the first problem
Given a partition (A, B)
Cost F(A,B)=Fc(A,B)+FI(A,B)Fc(A,B) = net-cut between A,BFI(A,B) = C(|A|2+|B|2) (min when |A|=|B|=n/2)
for move i, ∆ i=F(A’,B’)-F(A,B)
After a move
),()','(
),()','(
BAFBAF
BAFBAFcII
i
ccci
−=∆
−=∆
changesAll
changes.fewaIi
Ci
∆
∆
CS258F 2002F
Prof. J. Cong 50
Jason Cong 99
Solution:
Two-step biased selection:
(i) choose A or B based on
(ii) choose move i within A or B based
Note, ’s are the same for each in A or B.
So we keep one copy of for A
one copy of for B
choose the moves within A or B using the tree algorithm
Application to PartitioningSpecial solution to the first problem(Cont’d)
)(-> TIi
Ii ∆ ∆ α
)(TCi
Ci ∆ ∆ −>α
)(TIi∆ α )(T
Ci∆ α •Pi=
Ii∆
Ii∆
Ii∆