chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf ·...

44
CS258F 2006W Prof. J. Cong 1 Jason Cong 1 Jason Cong 2 Outline Circuit Partitioning formulation Importance of Circuit Partitioning Partitioning Algorithms Circuit Clustering Formulation Clustering Algorithms

Transcript of chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf ·...

Page 1: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 1

Jason Cong 1

Jason Cong 2

Outline

Circuit Partitioning formulation

Importance of Circuit Partitioning

Partitioning Algorithms

Circuit Clustering Formulation

Clustering Algorithms

Page 2: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 2

Jason Cong 3

Basic Notations

• Node = cell, logic gate, functional block …• Pin = an input/output port of a node• Net = set of pins to be connected = hyperedge• Netlist = set of nodes + nets = hypergraph

Jason Cong 4

Circuit Partitioning FormulationBi-partitioning formulation:

Minimize interconnections between partitions

Minimum cut: min c(x, x’)

minimum bisection: min c(x, x’) with |x|= |x’|

minimum ratio-cut: min c(x, x’) / |x||x’|

X X’

c(X,X’)

Page 3: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 3

Jason Cong 5

A Bi-Partitioning Example

Min-cut size=13Min-Bisection size = 300Min-ratio-cut size= 19

a

b

c e

d f

mini-ratio-cut min-bisection

min-cut 9

10

100100 100

100100

100

4

Ratio-cut helps to identify natural clusters

Jason Cong 6

Circuit Partitioning Formulation (Cont’d)

General multi-way partitioning formulation:

Partitioning a network N into N1, N2, …, Nk such that

Each partition has an area constraint

each partition has an I/O constraint

Minimize the total interconnection:

∑∈

≤iNv

iAva )(

iii INNNc ≤− ),(

),( iN

i NNNci

−∑

Page 4: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 4

Jason Cong 7

Importance of Circuit PartitioningDivide-and- conquer methodology

The most effective way to solve problems of high complexity

E.g.: min-cut based placement, partitioning-based test generation,…

System-level partitioning for multi-chip designs

inter-chip interconnection delay dominates system performance.

Circuit emulation/parallel simulation

partition large circuit into multiple FPGAs (e.g. Quickturn), or multiple special-purpose processors (e.g. Zycad).

Parallel CAD development

Task decomposition and load balancing

In deep-submicron designs, partitioning defines local and global interconnect, and has significant impact on circuit performance

…… ……

Jason Cong 8

Partitioning Algorithms

Iterative partitioning algorithms

Spectral based partitioning algorithms

Net partitioning vs. module partitioning

Multi-way partitioning

Multi-level partitioning (to be discussed after clustering)

Further study in partitioning techniques

Page 5: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 5

Jason Cong 9

Iterative Partitioning AlgorithmsGreedy Iterative improvement method

[Kernighan-Lin 1970]

[Fiduccia-Mattheyses 1982]

[krishnamurthy 1984]

Simulated Annealing

[Kirkpartrick-Gelatt-Vecchi 1983]

[Greene-Supowit 1984](See Appendix, SA will be formally introduced in the Floorplan chapter)

Jason Cong 10

Kernighan-Lin’s AlgorithmPair-wise exchange of nodes to reduce cut sizeAllow cut size to increase temporarily within a pass

Compute the gain of a swapRepeat

Perform a feasible swap of max gainMark swapped nodes “locked”;Update swap gains;

Until no feasible swap; Find max prefix partial sum in gain sequence g1, g2, …, gm

Make corresponding swaps permanent.

Start another pass if current pass reduces the cut size(usually converge after a few passes)

u • v •

v • u •

locked

Page 6: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 6

Jason Cong 11

Fiduccia-Mattheyses’ Improvement

Each pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)Choosing swap with max gain and updating swap gains take O(n2) time

FM-algorithm takes O(p) time per pass( p: #pins)

Key ideas in FM-algorithmsEach move affects only a few moves

⇒ constant time gain updating per move(amortized)Maintain a list of gain buckets

⇒ constant time selection of the move with max gain

Further improvement by KrishnamurthyLook-ahead in gain computation

u1

V1 V2

u2

gmax

-gmax

Jason Cong 12

Further Improvement on Iterative Improvement-Based Partitioning Algorithms

• Basic idea :– Encouraging cluster-based move during

iterative improvement process– Example: LSR partitioning : focus on loose

and stable net removal

Page 7: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 7

Jason Cong 13

Loose net Removal (LR)• Net Status Transition

• Loose net Removal

free net

loose net

locked net =

block 0 block 1 block 2

free cells of loose net

uncut net =

Jason Cong 14

Loose net Removal (LR)• Net Status Transition

• Loose net Removal

free net

loose net

locked net =

block 0 block 1 block 2

free cells of loose net

uncut net =

Page 8: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 8

Jason Cong 15

Loose net Removal (LR)• Net Status Transition

• Loose net Removal

free net

loose net

locked net =

block 0 block 1 block 2

free cells of loose net

uncut net =

Jason Cong 16

Loose net Removal (LR)• Net Status Transition

• Loose net Removal

free net

loose net

locked net =

block 0 block 1 block 2

free cells of loose net

uncut net =

Page 9: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 9

Jason Cong 17

Loose net Removal (LR)• Net Status Transition

• Loose net Removal

free net

loose net

locked net =

block 0 block 1 block 2

free cells of loose net

uncut net =

Jason Cong 18

LR Implementation• Basic Approach

– increase gain of free cells of loose nets outside anchor

– focus on smaller nets, with fewer number of free cells (Fn) & larger number of locked cells (Ln)

• Cell Gain Increase Function– accumulate : until complete removal– set threshold : array-based bucket can be used– expanded gain range : no tie-breaking scheme (LIFO, LA3)

needed

⎥⎥⎥

⎢⎢⎢

⎢⋅=∑∑

n

n

Fc

Lc

cdeg

cdeg

nsizemax_netnincr

)(

)(

)()(

Page 10: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 10

Jason Cong 19

Stable Net Transition (SNT)• Stable net [Shibuya et al. 1995]

– remain cut throughout entire run– more than 80% of cut nets are stable

• Implementation– detect stable nets by comparing cutsets– move all unprocessed cells of a stable net into single block– serve as initial partition for next run

• fast convergence• efficiently explore more local minima

• Loose and Stable net Removal (LSR)– combination of two effective net removal schemes

• LR : remove small loose nets during each pass• SNT : remove large stable nets at the end of each pass

Jason Cong 20

Cutsize Reduction by LSR• Bipartitioning under 45-55% Balance Criterion

500 700 900 1100 1300 1500 1700 1900

LSR

LR

SNT

FM0.0%

-8.0%

-45.6%-

-46.9%-

Page 11: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 11

Jason Cong 21

Partitioning Algorithms

Iterative partitioning algorithms

Spectral based partitioning algorithms

Net partitioning vs. module partitioning

Multi-way partitioning

Further study in partitioning techniques

Jason Cong 22

Spectral Based Partitioning Algorithms

a

b

c

d

13

4

3

0343

343000001010

dcba

A

dcba

=

10000

000300050004

dcba

D

dcba

=

D: degree matrix; A: adjacency matrix; D-A: Laplacian matrix

Eigenvectors of D-A form the Laplacian spectrum of G

Page 12: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 12

Jason Cong 23

Some Applications of Laplacian SpectrumPlacement and floorplan

[Hall 1970][Otten 1982][Frankle-Karp 1986][Tsay-Kuh 1986]

Bisection lower bound and computation[Donath-Hoffman 1973][Barnes 1982][Boppana 1987]

Ratio-cut lower bound and computation[Hagen-Kahng 1991][Cong-Hagen-Kahng 1992]

Jason Cong 24

Eigenvalues and Eigenvectors

If Ax=λx

then λ is an eigenvalue of A

x is an eignevector of A w.r.t. λ

(note that Kx is also a eigenvector, for any constant K).

⎟⎟⎟

⎜⎜⎜

+++

+++

=⎟⎟⎟

⎜⎜⎜

⎟⎟⎞

⎜⎜⎛

nnnnn

nn

n

n

xaxaxa

xaxaxa

x

xaaa

2211

12121111

⎠⎝ nnnn aaa 21

11211

xAxA

...

Page 13: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 13

Jason Cong 25

A Basic Property

( )

∑∑

=

⎟⎟⎟

⎜⎜⎜

⎛⎟⎠

⎞⎜⎝

⎛=

⎟⎟⎟

⎜⎜⎜

⎟⎟⎟

⎜⎜⎜

⎛=

==

jiijji

n

n

iini

n

iii

nnnn

n

n

axx

x

xaxax

x

x

aa

aaxx

,

1

1,

1,1

1

1

111

1 ,,xTAx

Jason Cong 26

Basic Idea of Laplacian Spectrum Based Graph Partitioning

Given a bisection ( X, X’ ), define a partitioning vector

clearly, x ⊥ 1, x ≠ 0 ( 1 = (1, 1, …, 1 ), 0 =(0, 0, …, 0))

For a partition vector x:

Let S = {x | x ⊥ 1 and x ≠ 0}. Finding best partition vector x such that the total edge cut C(X,X’) is minimum is relaxed to finding x in S such that is minimum

Linear placement interpretation:

minimize total squared wirelength

'11.s.t),,,( 21 Xi

Xi xxxxx in ∈∈

⎩⎨⎧−==

xxEji

ji )(),(

2−∑∈

….

x1 x2 x3 xn

xxEji

ji )(),(

2−∑∈

= 4 * C(X, X’)

Page 14: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 14

Jason Cong 27

Property of Laplacian Matrix

∑∑ ∑

−=

−=

−=

−=

=

Ejiji

Ejijiii

jiijii

TT

jiijT

xx

xaij xxd

xxAxdAxxDxx

xxQQxx

),(

2),(

2

2

)(

2

( 1 )

squared wirelength

= 4 * C(X, X’)

Therefore, we want to minimize T Qxx

If x is a partition vector

….

x1 x2 x3 xn

Jason Cong 28

( 2 ) Q is symmetric and semi-definite, i.e.

(i)

(ii) all eigenvalues of Q are ≥0

( 3 ) The smallest eigenvalue of Q is 0

corresponding eigenvector of Q is x0= (1, 1, …, 1 )

(not interesting, all modules overlap and x0∉S )

( 4 ) According to Courant-Fischer minimax principle:

the 2nd smallest eigenvalue satisfies:

Property of Laplacian Matrix (Cont’d)

0≥= ∑ jiijT xxQQxx

2||min

xQxxT

=λx in S

Page 15: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 15

Jason Cong 29

Results on Spectral Based Graph Partitioning

Min bisection cost c(x, x’) ≥ n⋅λ/4

Min ratio-cut cost c(x,x’)/|x|⋅|x’| ≥ λ/n (homework)

The second smallest eigenvalue gives the best linear placement

Compute the best bisection or ratio-cut

based on the second smallest eigenvector

Jason Cong 30

Computing the Eigenvector

Only need one eigenvector(the second smallest one)

Q is symmetric and sparse

Use block Lanczos Algorithm

Page 16: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 16

Jason Cong 31

Interpreting the EigenvectorLinear Ordering of modules

Try all splitting points, choose the best one

using bisection of ratio-cut metric

23 10 22 34 6 21 8

23 10 22 34 6 21 8

Jason Cong 32

Experiments on Special GraphsGBUI(2n, d, b) [Bui-Chaudhurl-Leighton 1987]

2n nodesd-regularmin-bisection≈b

GGAR (m, n, Pint, Pext ) [Garbe-Promel-Steger 1990]

m clusters

n nodes each cluster

Pint: intra-cluster edge probability

Pext: Inter-cluster edge probability

n

n

n

n

n

n

Page 17: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 17

Jason Cong 33

GBUI Results

GBUI Results

Value of Xi

Cluster 2

Cluster 1

Module i

Jason Cong 34

GGAR Results

GGAR Results

Value of Xi

Cluster 4

Cluster 1

Module i

Cluster 2

Cluster 3

Page 18: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 18

Jason Cong 35

Partitioning Algorithms

Iterative partitioning algorithms

Spectral based partitioning algorithms

Net partitioning vs. module partitioning

Multi-way partitioning

Further study in partitioning techniques

Jason Cong 36

Transform Netlists to Graph

Transform each r-pin net into a r-clique

Weight each edge in the r-clique1/r-1 ( simple scaling) or2/r (edge probability) or4/r2 (bound each net contribution in the cut size)

Transforming multi-pin nets to cliques may destroy the sparsity of D-A

Page 19: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 19

Jason Cong 37

Net Partitioning vs. Module PartitioningDual netlist representation-- intersection graph

Node: NetEdge: if two nets share a common gate

Intersection graph also captures circuit connectivity, but does not use hyperedges

o

o

o

o

o

o

o

i

G1

G7

G6

G5

G4

G3

G2

i

b

g

d

ca j

f

eh

Netlist N Intersection graph I( N )

i

b

c

f

j

hd

a e

Jason Cong 38

Transform Net Partition to Module Partition

o

o

o

o

o

o

o

i

G1

G7

G6

G5

G4

G3

G2

i

bg

d

ca j

f

eh

⊆X’⊆X

N

Every edge in B(Y, Y’) indicates a conflict:

G3 is wanted by both c and g

G4 is wanted by both d and e

G5 is wanted by both d and f

ig

c

i

b

c

f

j

hd

a

I ( N )

e

fd

e

B (Y, Y’)

Y Y’

Page 20: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 20

Jason Cong 39

Some DefinitionsMaximum Independent Set(MIS)Maximum subset of vertices with no two connected

Minimum Vertex Cover(MVC)Minimum subset of vertices which “covers” all edges

MIS MVC

Jason Cong 40

Resolving Net Conflict When Transforming A Net Partition to A Module Partition

Some nets in B(Y, Y’) will loseEach losing net corresponds to a net-cut in module partition

Objective: maximize the # winning nets, or equivalently, minimize the # losing nets

Theorem: Set of winning nets forms an independent set in B(Y, Y’), and the set of losers form a vertex cover

⇒Find max independent set in B(Y, Y’)⇒Solved optimally in polynomial time for bipartite graph

Page 21: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 21

Jason Cong 41

Bipartite Graph Properties|MIS|+|MVC|=|V|

size of any vertex cover is no smaller than max. matching

|MVC|=|MM| (MM: maximum matching)

⇒ Goal: Find an MVC of size |MM|, or equivalently, an MIS of size |V|-|MM|

MIS = Winners

MVC =Losers

Jason Cong 42

Finding an MISL

R

L

R

Maximum Matching (MM)

UL

UR

Page 22: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 22

Jason Cong 43

Finding an MIS

Alternating Paths

L

R

UL

UR

Jason Cong 44

Finding an MIS

Odd, Even Sets

L

R

UL

UR

Odd(R)

Odd(L)Even ( R )

Even(L)

Page 23: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 23

Jason Cong 45

Improving the Module PartitionNet1= {a, b}

Net3= {a, e}

Net2= {b, c}

Net6= {g, e}

Net5= {e, f}

Net4= {b, d}

Even(L)={Net1, Net2}

ModSL={a,b,c}

Even(R)={Net5, Net6}

ModSR={e,f,g}

ModSL+{d} cuts 1set

ModSR+{d} cuts 2 nets

Jason Cong 46

Algorithm IG-Match

Get spectral linear ordering

Construct bipartite graph B

Find MIS of B

Convert MIS into module partition

Theorem: IG-Match computes all |V|-1 module partitions in O(|V|(|V|+|E|)) time

Key idea: incremental updating of the bipartite graph and alternating paths

Page 24: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 24

Jason Cong 47

Experimental Results Wei-Cheng

Areas/ Cutsize IG-Match

Areas/Cutsize Improvement

bm1 9: 873 / 1 21:861 / 1 57%

19ks 1011:1833 / 109 650:2194/85 -1% Prim1 152:681 / 14 154: 679/ 14 1%

Prim2 1132: 1882/123 740: 2274/ 77 21%

Test02 372:1291/95 211:1452/38 37%

Test03 147: 1460/31 803: 804/ 58 38%

Test04 401:1114/51 73: 1442 / 6 50%

Test05 1204:1391/110 105: 1442 / 6 53%

Test06 145: 1607/18 141: 1611/ 17 3%

28.8% average improvement

Jason Cong 48

Further Study in Partitioning Techniques

Module replication in circuit partitioning [Kring-Newton 1991; Hwang-ElGamal 1992; Liu et al TCAD’95; Enos, et al, TCAD’99]

Generating uni-directional partitioning [Iman-Pedram-Fabian-Cong 1993] or acyclic partitioning [Cong-Li-Bagrodia, DAC94] [Cong-Lim, ASPDAC2000]

Logic restructuring during partitioning[Iman-Pedram-Fabian-Cong 1993]

Communication based partitioning[Hwang-Owens-Irwin 1990; Beardslee-Lin-Sangiovanni 1992]

Multi-level partitioning (will discuss after clustering)

Page 25: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 25

Jason Cong 49

Multi-Way PartitioningRecursive bi-partitioning[Kernighan-Lin 1970]

Generalization of Fuduccia-Mattheyse’s and Krishnamurthy’s algorithms [ Sanchis 1989] [Cong-Lim, ICCAD’98]

Generalization of ratio-cut and spectral method to multi-way partitioning [Chan-Schlag-Zien 1993]

generalized ratio-cut value=sum of flux of each partitiongeneralized ratio-cut cost of a k-way partition≥sum of the k smallest eigenvalue of the LaplacianMatrix

Jason Cong 50

Circuit Clustering Formulation

Motivation:Reduced the size of flat netlistsIdentify natural circuit hierarchy

Objectives:Maximize the connectivity of each cluster or minimize connectivity between clustersMinimize the delay (or simply depth) of clustered circuits

Page 26: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 26

Jason Cong 51

Measuring Cluster Connectivity

Simple density measurementm/n -- include a new node when it

connects to more than m/n edges

nm/( ) -- includes a new node when it

connects to more than 2m/(n-1) edges -- biased for small clusters

(k, I)-connectivity [Garber-Promel-Steger 1990]Every two nodes in the cluster are connected by at

least k edge-disjoint paths of length no more than IProper choice of k and I may not be easy

D/S-metric [Cong-Hagen-Kahng 1991]D/S=Degree/SeparationDegree= average degree in the clusterSeparation=average shortest path between a pair of nodes in the cluster Robust but expensive to compute

uP1

P2

Pk

≤l

v

2

Jason Cong 52

Clustering Algorithms(k, I)-connectivity algorithms [Garber-Promel-Steger 1990]

Random-walk based algorithm [Cong-Hagen-Kahng 1991; Hagen-Kahng1992]

Multicommodity-Flow based algorithm [Yeh-Cheng-Lin 1992]

Local connectivity based clustering

Clique based algorithm [Bui 1989; Cong-Smith 1993]

HEC, MHEC, First-choice [Karypis-Kumar, DAC97 & DAC98]

Global connectivity based clustering [Cong-Sung, ASPDAC’2000]

Clustering for delay minimization (combinational circuits) [Lawler-Levitt-Turner 1969; Murgai-Brayton-Sanjiovanni 1991; Rajaraman-Wong 1993]

Clustering for delay minimization (with consideration of retiming for sequential circuits) [Pan et al, TCAD’99; Cong et al, DAC’99]

Signal flow based clustering [Cong-Ding, DAC’93; Cong et al ICCAD’97]

Page 27: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 27

Jason Cong 53

FirstChoice Clustering Algorithm (FC)• Connectivity-driven clustering algorithm• First introduced by G. Karypis in DAC’1998 k-way hMetis

partitioner• Idea:

– Visit node in a random order– For each node v,

• compute the connectivity weights for its neighboring nodes. Connect(u, v)=sum of {w(e) / |e|}, e is the hyper-edge connecting uand v. |e| is the degree of hyperedge, w(e) is the hyper-edge weight.

• Choose node u which has the max. connectivity weight to “merge”. If node u is already clustered, then node v is “merged” to the cluster, otherwise, u and v form a new cluster.

– Cluster area control can be considered when “merging” node to cluster.

Jason Cong 54

Lawler’s Labeling Algorithm[Lawler-Levitt-Turner 1969]

Assumption: Cluster size≤ K; Intra-cluster delay = 0; Inter-cluster delay =1

Objective: Find a clustering of minimum delay

Algorithm:Phase 1: Label all nodes in topological ordor

For each PI node V, L(v)= 0;For each non-PI node v

p=Maximum label of predecessors of vXp = set of predecessors of v with label pif |Xp|<K then L(v) = p else L(v) =p+1

Phase2: Form clustersStart from PO to generate necessary clustersNodes with the same label form a cluster

p-1

Xp

p-1

v

p-1

p

p

Page 28: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 28

Jason Cong 55

Lawler’s Labeling Algorithm(Cont’d)

Performance of the algorithmEfficient run-timeMinimum delay clustering solutionAllow node duplicationNo attempt to minimize the number of clusters

Extension to allow arbitrary gate delaysHeuristic solution [Murgai-Brayton-Sangiovanni 1991]Optimal solution [Rajaraman-Wong 1993]

Jason Cong 56

Signal Flow and Logic Dependency• Beyond Connectivity

– identifies logically dependant cells

– useful in guiding• clustering • partitioning [CoLB94, CoLL+97]• placement [CoXu95]

Page 29: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 29

Jason Cong 57

Maximum Fanout Free Cone (MFFC)• Definition: for a node v in a combinational circuit,

– cone of v ( ) : v and all of its predecessors such that any path connecting a node in and v lies entirely in

– fanout free cone at v ( ) : cone of v such that for any node

– maximum FFC at v ( ) : FFC of v such that for any non-PI node w,

vCvC vC

vFFCvv FFCuoutputFFCvu ⊆≠ )( , in

vMFFCvv MFFCwMFFCwoutput ∈⊆ then ,)( if

Jason Cong 58

Properties of MFFCs• If• Two MFFCs are either disjoint or one contains another [CoDi93]• Area-optimal duplication-free LUTs map each MFFC independently

to get an optimal solution [CoDi93]• For acyclic bipartitioning without area constraint, there exists an

optimal cut which does not cut through any MFFCs [CoLB94]

[CoDi93] then , vMFFCwMFFCvMFFCw ⊆∈

Page 30: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 30

Jason Cong 59

Maximum Fanout Free Subgraph (MFFS)• Definition : for a node v in a sequential circuit,

• Illustration

} through passes PO some to from path every|{ vuuFFSv =} , allfor |{ vvv FFSuFFSuMFFS ∈=

MFFCs ??? MFFS

Jason Cong 60

MFFS Construction Algorithm• For Single MFFS at Node v

– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs – MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)

v

Page 31: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 31

Jason Cong 61

MFFS Construction Algorithm• For Single MFFS at Node v

– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs– MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)

v

Jason Cong 62

MFFS Construction Algorithm• For Single MFFS at Node v

– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs – MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)

v

Page 32: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 32

Jason Cong 63

MFFS Construction Algorithm• For Single MFFS at Node v

– select root node v and cut all its fanout edges– mark all nodes reachable backwards from all POs – MFFSv = {unmarked nodes}– complexity : O(|N| + |E|)

v

Jason Cong 64

MFFS Clustering Algorithm• Clusters Entire Netlist

– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))

v

Page 33: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 33

Jason Cong 65

MFFS Clustering Algorithm• Clusters Entire Netlist

– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))

Jason Cong 66

MFFS Clustering Algorithm• Clusters Entire Netlist

– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))

v

Page 34: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 34

Jason Cong 67

MFFS Clustering Algorithm• Clusters Entire Netlist

– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))

v

Jason Cong 68

MFFS Clustering Algorithm• Clusters Entire Netlist

– construct MFFS at a PO and remove it from netlist– include its inputs as new POs– repeat until all nodes are clustered– complexity : O(|N| · (|N| + |E|))

Page 35: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 35

Jason Cong 69

Speedup of MFFS Clustering• Approximation of MFFS

– cluster size limit imposed in practice– perform MFFS construction on subcircuit SC(v, h)

• defined by search depth h for given node v• internal nodes : all visited nodes by BFS of depth h from v• pseudo PIs/POs : all I/Os from/to nodes not in SC(v, h)

h

v

SC(v, h)

circuit

Jason Cong 70

Speedup of MFFS Clustering• Approximation of MFFS

– cluster size limit imposed in practice– perform MFFS construction on subcircuit SC(v, h)

• defined by search depth h for given node v• internal nodes : all visited nodes by BFS of depth h from v• pseudo PIs/POs : all I/Os from/to nodes not in SC(v, h)

h

v

pseudo PIs

pseudo POs

SC(v, h)

circuit

Page 36: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 36

Jason Cong 71

MFFS Clustering Result• MCNC benchmarks with signal flow information

NAME SIZE AVR # CLST TIME # CLST TIMEs1423 619 1.0 193 0.9 168 1.9sioo 664 4.6 442 2.3 442 2.6

s1488 686 1.0 187 0.9 187 1.4balu 801 12.9 287 2.0 284 3.8

prim1 833 6.7 394 3.2 379 5.9struct 1952 2.5 1183 17.9 1183 12.1prim2 3014 6.7 1404 34.8 1243 22.1s9234 5866 1.0 3203 36.7 3177 9.7

biomed 6514 4.5 2142 87.8 1853 30.5s13207 8772 1.0 2132 73.9 1994 14.1s15850 10470 1.0 2279 84.9 2137 17.6s35932 18148 1.0 5562 420.8 2943 32.1s38584 20995 1.0 5139 565.1 4242 44.5avq.sm 21918 4.5 8477 1287.3 8309 116.1s38417 23949 1.0 5906 452.2 5295 45.1avq.lg 25178 4.5 9103 1473.2 8658 90.2

TOTAL 48033 4543.9 42494 449.7

Optimal Heuristic

min_areamax_areaAVR =

Jason Cong 72

Two-Level Partitioning: LSR/MFFS Partitioning Algorithm

• Overview of the Algorithm– cluster netlist using MFFS algorithm– partition clustered netlist with LSR

algorithm– De-cluster the netlist– refine flat netlist with LSR algorithm

Page 37: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 37

Jason Cong 73

Flow of LSR/MFFS Algorithm

Cluster netlist with MFFS

Partition netlist with LSR

Gain?

Decluster netlist

Refine netlist with LSR

Gain?yes yes

nono

START

END

Jason Cong 74

Cutsize Reduction by MFFS Clustering• Bipartitioning under 45-55% Balance Criterion

500 700 900 1100 1300 1500 1700 1900

LSR

LR

SNT

FM

FLATMFFS

0.0%-25.8%

-8.0%-31.7%

-45.6%--48.4%

-46.9%-49.9%

Page 38: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 38

Jason Cong 75

Cutsize Reduction by MFFS Clustering

NAME FM SNT LR LSR FM SNT LR LSRs1423 12 14 13 13 12 12 12 12sioo 25 25 25 25 25 25 25 25

s1488 48 45 44 42 42 42 43 43balu 27 27 27 27 27 27 27 27

prim1 43 42 42 42 43 42 42 43struct 42 45 33 33 39 52 33 33prim2 174 167 121 122 133 122 131 119s9234 49 54 41 43 42 43 41 40

biomed 83 83 83 83 100 84 84 84s13207 90 74 75 70 91 65 73 61s15850 113 73 73 65 72 70 44 43s35932 100 128 45 44 48 76 45 44s38584 52 52 49 47 52 52 47 47avq.sm 321 378 135 139 273 172 127 127s38417 176 112 53 55 119 147 51 50avq.lg 491 380 145 131 251 168 127 127

TOTAL 1846 1699 1004 981 1369 1261 952 925% IMPR 0.0 8.0 45.6 46.9 25.8 31.7 48.4 49.9

TIME 121.4 59.4 162.3 138.9 84.2 61.3 113.6 89.4

No Clustering MFFS Clustering

Jason Cong 76

Cutsize Comparison with Other Methods

800

900

1000

1100

1200

CD

IP

Prop

F

Stra

w

hMet

is

ML

c

L/M

Panz

a

Para

b

FBB

IIPsNon-IIPs

Page 39: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 39

Jason Cong 77

Details on Cutsize ComparisonNAME CDIP PROPf Straw hMetis MLc L/M PANZA Parab FBBs1423 12 15 16 13sioo 25 25 45

s1488 43 44 50balu 27 27 27 27 27 27 27 41

prim1 52 51 49 50 47 43 struct 36 33 33 33 33 33 33 40prim2 152 152 143 145 139 119 s9234 44 42 42 40 40 40 40 74 70

biomed 83 84 83 83 83 84 83 135s13207 70 71 57 55 55 61 66 91 74s15850 67 56 44 42 44 43 44 91 67s35932 73 42 47 42 41 44 43 62 49s38584 47 51 49 47 47 47 47 55 47avq.sm 148 144 131 130 128 127 s38417 79 65 53 51 49 50 49 49 58avq.lg 145 143 140 127 128 127

TOTAL1 1023 961 898 872 861 845 TOTAL2 509 516 749% IMPR 17.4 12.1 5.9 3.1 1.9 1.4 32.0 21.4

TIME 5817 5611 12577 3455 1338 24619

Iterative Improvement Partitioners (IIPs) non-IIPs

Jason Cong 78

Multi-level Partitioning• Multi-level paradigm:

– From coarse-grain into finer-grain optimization– Successfully used in partial differential equations, image

processing, combinatorial optimization, etc, and circuit partitioning.

Coarsening Uncoarsening

Initial Partitioning

Page 40: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 40

Jason Cong 79

General Framework of Multi-Level Optimization

• Step 1: Coarsening– Generate hierarchical representation of the netlist

• Step 2: Initial Solution Generation– Obtain initial solution for the top-level clusters– Reduced problem size: converge fast

• Step 3: Uncoarsening and Refinement– Project solution to the next lower-level (uncoarsening)– Perturb solution to improve quality (refinement)

• Step 4: V-cycle– Additional improvement possible from new clustering– Iterate Step 1 (with variation) + Step 3 until no further gain

Jason Cong 80

V-cycle Refinement• Motivation

– Post-refinement scheme for multi-level methods– Different clustering can give additional improvement

• Restricted Coarsening– Require initial partitioning– Do not merge clusters in different partition– Maintain cutline: cutsize degradation is not possible

• Two Strategies: V-cycle vs. v-cycle– V-cycle: start from the bottom-level– v-cycle: start from some middle-level– Tradeoff between quality vs. runtime

Page 41: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 41

Jason Cong 81

Application in Partitioning• Multi-level Partitioning

– Coarsening engine (bottom-up)• Unrestricted and restricted coarsening• Any bottom-up clustering algorithm can be used• Cutsize oriented (MHEC, ESC) vs. delay oriented (PRIME)

– Initial partitioning engine• Move-based methods are commonly used

– Refinement engine (top-down)• Move-based methods are commonly used• Cutsize oriented (FM, LR) vs. delay oriented (xLR)

• State-of-the-art Algorithms– hMetis [DAC97] and hMetis-Kway [DAC99] for cutsize– HPM [DAC00] for delay

Jason Cong 82

hMetis Algorithm• Multilevel Partitioning Algorithm [DAC97]

– Contributions: 3 new coarsening schemes for hypergraphs– Best bipartitionining results – remain to be very competitive today!

Original Graph Edge Coarsening

Edge Coarsening = heavy-edge maximal matching1. Visit vertices randomly2. Compute edge-weights (=1/(|n|-1)) for all unmatched neighbors3. Match with an unmatched neighbor via max edge-weight

Page 42: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 42

Jason Cong 83

hMetis Algorithm (cont)

Coarsening schemes for hypergraphs (cont’d)

Hyperedge Coarsening Modified Hyperedge Coarsening

Hyperedge Coarsening = independent hyperedge merging1. Sort hyperedges in non-decreasing order of their size2. Pick an hyperedge with no merged vertices and merge

Modified Hyperedge Coarsening = Hyeredge Coarsening + post process1. Perform Hyperedge Coarsening2. Pick a non-merged hyperedge and merge its non-merged vertices

Jason Cong 84

hMetis-Kway Algorithm• Multiway Partitioning Algorithm [DAC99]

– New coarsening: First Choice (variant of Edge Coarsening)• Can match with either unmatched or matched neighbors

– Greedy refinement• On-the-fly gain computation• No bucket: not necessarily the max-gain cell moves• Save time and space requirements

Original Graph First Choice

Page 43: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 43

Jason Cong 85

hMetis Results• Bipartitioning on ISPD98 Benchmark Suite

1.61

1.211.03 1

0

0.4

0.8

1.2

1.6

Scal

ed C

utsi

ze

FM LR LR/ESC hMetis

Jason Cong 86

hMetis-Kway Results• Multiway Partitioning on ISPD98 Benchmark Suite

1.2

1.031.19

1.021.18

1.011.15

0.97

0

0.2

0.4

0.6

0.8

1

1.2

Scale

d Cu

tsize

2way 8way 16way 32way

hMetis-KwayKPM/LRLR/ESC-PM

Page 44: chap3 partitioning clusteringcadlab.cs.ucla.edu/~cong/CS258F/chap3_06w.pdf · Fiduccia-Mattheyses’ Improvement HEach pass in KL-algorithm takes O(n3) or O(n2 logn) time (n: #modules)

CS258F 2006W

Prof. J. Cong 44

Jason Cong 87

SummaryPartitioning is very important for applying divide-and-conquer methodology (for complexity management)

Partitioning also defines global/local interconnects and greatly impact circuit performance

Increasing emphasis on interconnect design has introduced many new partitioning formulations

clustering is effective in reducing circuit size and identifying natural circuit hierarchy

Multi-level circuit clustering + iterative improvement based methods produce the best partitioning results (so far)