Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning
description
Transcript of Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning
![Page 1: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/1.jpg)
Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning
Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.
Work supported in part by MARCO GSRC
![Page 2: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/2.jpg)
Outline
Motivation• Performance driven bipartition problem• New bipartitioning algorithm• Experimental results• Conclusion and future work
![Page 3: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/3.jpg)
Partitioning and Performance
The goal of traditional hypergraph partitioning is to minimize cutsize.
To meet the performance requirement of current designs, we need a performance-driven partitioner, which considers both cutsize and delay.
![Page 4: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/4.jpg)
– Reduces delay by 16% while increasing cutsize by 17%
– Requires substantial gate replication
Previous Work (I)• [Cong et al. ISPD-2002]
– Global clustering based algorithm with retiming
Min-delay Clusteringw/ retiming
De-clusteringand refinement
Min-cutsizeClustering
![Page 5: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/5.jpg)
– 14% reduction of delay with 10% increase in cutsize
– 139% increase in runtime compared with hMetis
Previous Work (II)
• [Ababei et al. ICCAD-2002]– Reweighting based method
Global timing analysis Find critical paths
Reweighting Input
1
11
1
1 2
Path based
Net based
Cutsize oriented partitioner, suchas hMetis,MLPart
![Page 6: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/6.jpg)
Motivating Questions Can we avoid global timing analysis?
– Global timing analysis is extremely time-consumingCan we improve path delay without significant
degrading of cutsize? – Need smooth tradeoff between delay and cutsize
Can we reduce implementation overheads?– Previous methods store thousands of critical paths and
continuously update them
![Page 7: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/7.jpg)
Outline
• MotivationPerformance driven bipartition problem• New bipartitioning algorithm• Experimental results• Conclusion and future work
![Page 8: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/8.jpg)
Delay ModelDelay = hop_delay + node_delay
Part 0 Part 1FF nodes
Combinational nodes
hop
cut
[Cong et al. ISPD-2002]hop_delay=5 node_delay=1 Delay = 3x5 + 5x1 = 20
[Ababei et al. ICCAD-2002]hop_delay=Elmore delay node_delay=constant
![Page 9: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/9.jpg)
Performance Driven Bipartition Problem
Given: • Hypergraph H=(V,E)• Area Balance tolerance s (0<s<1), a parameter
to control allowable slack in the area constraint• , a given parameter which captures tradeoff
between cutsize and path delay (hopcount)Find: A bipartition (V0|V1) which satisfies: and minimizes (cutsize)+(1-)
(Max_hopcount)
![Page 10: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/10.jpg)
Outline
• Motivation• Performance driven bipartition problem New bipartitioning algorithm• Experimental results• Conclusion and future work
![Page 11: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/11.jpg)
Unidirectional Partition Path delay is minimized with
hopcount = 1 if the partition is unidirectional (“acyclic”), that is, all cuts are in the same direction
Problem:• High cutsize• No unidirectional solution
Can we achieve “locally unidirectional” partition?
Max hopcount=5 Max hopcount=3
Part 1Part 0Part 0
Part 0 Part 1 Part 0 Part 1
![Page 12: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/12.jpg)
V-Shaped NodesV-shaped node If a combinational node v satisfies: there exist vj, vt in the other part and a path from vj to vt that includes only v
then v is a V-shaped node
vj
Part 1
Part 0 vt
v
![Page 13: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/13.jpg)
V-Shaped Nodes in Critical Paths
Empirical observations from study of partitioning solutions:• there are V-shaped nodes in the partitioning solutions• every V-shaped node is included in many critical paths• every critical path contains several V-shaped nodes
For testcase 1:•Number of nets : 16377•Number of critical paths : 26772•On average, one critical path contains 27.6 nodes •On average, one critical path contains 3.4 V-nodes•On average, one V-node belongs to 233.7 critical paths
![Page 14: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/14.jpg)
Key Idea: V-Shaped Nodes Elimination
PATH: abc hopcount=2
PATH: dbc hopcount=1
PATH: ebc hopcount=1
af
cb
edMove b
af
cb
ed
Move V-shaped node “b” to reduce path hopcount
Part 0
Part 1
Part 1
Part 0
PATH: abc hopcount=0
PATH: dbc hopcount=1
PATH: ebc hopcount=1
![Page 15: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/15.jpg)
Distance-k V-Shaped Nodes Elimination
a d
b Move b,c
k = 2: Move V2 node “b, c” reduce path hopcount from 2 to 0
Part 0
Part 1 c
a d
b
Part 0
Part 1
c
Problems with large k:Cutsize may be greatly increased
Distance-k V-shaped Nodes (Vk Nodes):
Paths of k combinational nodes with neighbors in the other part.
![Page 16: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/16.jpg)
New Gain Function
v
Before MoveAfter Move
v
g(v): traditional FM gainrj(v): reduction of Vj nodes after moving v
Gain(v)=δ(0)+ δ(1)
![Page 17: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/17.jpg)
Distance-k Unidirectional Algorithm
Calculate initial gains for all nodes and store the gainsSelect the node v with maximum gain
/* CLIP-like method: move the cluster that v belongs to */Reset the gains of all nodes to zeroMove v and update the gains of v and its neighborsWhile ( one node not moved) Select one node v with the maximum updated gain
Move v and update the related gains Find the point in the move sequence at which the sum of
gains is maximum; undo all moves after this point
![Page 18: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/18.jpg)
Outline
• Motivation• New bipartitioning algorithm Experimental results• Conclusion and future work
![Page 19: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/19.jpg)
Experimental Setup
• Four industry testcases obtained as LEF/DEF• Model of Ababei et al. (ICCAD-2002) used to
calculate delay • Partitioning solutions compared to results of
MLPart – strongest multilevel netlist partitioning code– website:
http://nexus6.cs.ucla.edu/GSRC/bookshelf/Slots/Partitioning/MLPart
• All tests on 600MHz Intel Pentium-III Xeon
![Page 20: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/20.jpg)
Biasing against V1 Nodes vs. MLPart
TestcaseMLPart MLPart+V-shaped nodes
Removal
cutsize h delay time(s) cutsize h delay time(s)
1 820.7 5.3 352.8 11.79 856.1 3.3 266.8 12.58
2 169.9 3.5 220.7 13.45 189.8 2.5 211.2 15.32
3 141.3 3 291.6 16.67 152.3 2.3 283.6 18.27
4 408.7 5.3 302.6 12.43 421.2 3.6 252.7 14.03
• Reduction of delay: 4.5%-24.4% average:15.1%• Increase of cutsize: 3.0%-10.0% average: 4.9%• Increase of runtime: 6.3%-11.4% average: 9.7% Using the delay model in Cong et al. ISPD -2002• Reduction of delay: 4.3%-21.2% average:14.7%
δ(0)=1, δ(1)=10
![Page 21: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/21.jpg)
Biasing against V2 Nodes vs. MLPart
TestcaseMLPart MLPart+Vk=2 nodes Removal
cutsize h delay time(s) cutsize h delay time(s)
1 820.7 5.3 352.8 11.79 847.5 3 262.1 13.16
2 169.9 3.5 220.7 13.45 183.2 2 202.5 15.67
3 141.3 3 291.6 16.67 149.2 2 275.6 18.92
4 408.7 5.3 302.6 12.43 416.7 3.4 243.5 14.79
δ(0)=1, δ(1)=30, δ(2)=3
• Reduction of delay: 8.9%-30.0% average: 18.7%• Increase of cutsize: 3.1%-7.2% average: 3.5%• Increase of runtime: 11.9%-15.9% average: 13.1%Using the delay model in Cong et al. ISPD -2002• Reduction of delay: 8.3%-28.7% average: 17.3%
![Page 22: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/22.jpg)
Outline
• Motivation• Performance driven bipartition problem• New bipartitioning algorithm• Experimental results Conclusions and future work
![Page 23: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/23.jpg)
Conclusions• Simple yet efficient timing-driven partitioning
that does not require global timing analysis • Negligible implementation, runtime overhead• Significantly reduces path delay with cutsize
and runtime almost same as leading-edge MLPart
• Similar improvements observed with different path delay metrics
• Futures– Impact of new partitioner on placement– Efficient methods for biasing δ(k) k>2
![Page 24: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/24.jpg)
Thank you!
![Page 25: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/25.jpg)
Future Work• Impact of new partitioner on placement• Efficient methods for biasing δ(k) k>2
![Page 26: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/26.jpg)
Why Performance Driven Partitioning?• Achieving timing closure becomes increasingly
difficult in deep-submicron technologies due to non-ideal scaling of interconnect delay
• Routing alone can no longer solve timing problem, even with aggressive optimizations (buffer insertion, buffer/wire sizing,…)
Timing needs to be addressed at all design stages• Partitioning is a critical step in defining
interconnect timing properties, but is traditionally driven by cutsize objective
![Page 27: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/27.jpg)
Previous Work (I)• With Logic Replication
– Retiming – Replication graph
• Without Logic Replication– Net based reweighting– Path based reweighting
![Page 28: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/28.jpg)
FM Partitioning and Gain Function
v
Before Move
v
After Move
Gain(v) = Reduction of cutsize after moving v
Gain(v)=-1
Move the node with the max gain and lock it
Start with random partition
Keep moving until all nodes are locked
Find the best point in the move sequence
Part 0
Part 1
Part 0
Part 1
Part 0
Part 1Part 0
Part 1
![Page 29: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/29.jpg)
Procedure to Calculate rj(v)
Delete all FF nodes and their related edgesIn the remaining graph, BFS from vFor each level j from 1 to k If v is a Vj node before moving, rj’=1 If v is a Vj node after moving, rj’’=1 rj=rj’’-rj’
![Page 30: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/30.jpg)
CLIP Algorithm
vCLIP
v
Reminiscent of CLIP (Deng et al. DAC 1996) in how it induces movement of clusters across the cutline.
![Page 31: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/31.jpg)
Distance-k V-Shaped Nodes
Distance-k V-shaped nodes (Vk-node): If k combinational nodes vi,1 … vi,k satisfy: vi,1 … vi,k are in the same part vj, vt in the other part a path from vj to vt and only passes vi,1 … vi,k
then vi,1 … vi,k are distance-k V-shaped nodes
vj
Part 1
Part 0 vt
vi,1 vi,k
![Page 32: Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815d9e550346895dcbc483/html5/thumbnails/32.jpg)
Notation
• H(V,E)= circuit hypergraph• V = set of nodes representing components of the
circuit• E = set of signal nets• A bipartition (V0|V1) of H(V,E) divides V into two
disjoint subsets s.t. V= V0V1, which are called Part 0 and Part 1
• A = the total area of all the nodes in V• A0 = the area of all the nodes in V0