Algorithms for Simultaneous Consideration of Multiple Physical Synthesis Transforms for Timing...

27
Simultaneous Consideration of Multiple Physical Synthesis Transforms for Timing Closure Huan Ren and Shantanu Dutt Dept. of Electrical and Computer Engineering University of IllinoisChicago

Transcript of Algorithms for Simultaneous Consideration of Multiple Physical Synthesis Transforms for Timing...

Algorithms for Simultaneous

Consideration of Multiple Physical

Synthesis Transforms for Timing Closure

Huan Ren and Shantanu Dutt Dept. of Electrical and Computer Engineering

University of Illinois Chicago

Outline

Problem formulation & prior work Network flow model Methodology Flow Discretization Requirements Structures for Accurate Objective Function Cost Simultaneous Detailed Placement—A Holistic Ap

proach! Experimental Results Conclusions

Problem Statement Problem Statement

Simultaneously apply a given set T of synthesis and replacement transforms to cells and nets on critical paths of a initial placed circuit to improve circuit delay near-optimally while satisfying area constraints.

For the current expts, T = {cell resizing, replication, replacement, type-1 & type-2 buffer insertions}

Critical paths (CP) = paths with delay > (1-α) fraction of circuit delay. We choose α=0.1.

Timing objective function [Dutt et al., ICCAD’06]

CS(ni ): critical sinks of nj, in CP D(uj, ni ) : delay of ni at sink uj . Sa(ni ) : allocated slack of ni , which is the path slack of the most critical p

ath through the net divided by the number of nets in the path allows exponential magnification of the timing function for critical nets

in order to approximate min. of the max net timing function ~ min. delays in CP

( )

( , ) / ( )i j i

t j i a in CP u CS n

F D u n S n

Post-placement Incremental Physical Synthesis

Why necessary? Wire load estimation is very inaccurate prior to placement Leaves large room for improvements

Various transforms Cell sizing: effective for improving timing

Continuous sizing [Fishburn et al., ICCAD’85] and Discrete sizing [Hu et al., DAC’07], [Ren et al., IWLS’08]

Options: Different cell sizes available in the library (s options for s sizes)

Incremental global placement Re-place a subset of cells targeting the metric of interest for design clos

ure [Dutt et al., ICCAD’06], [Wonjoon et al., ICCAD’03] Transform options:

Remain in the position in the initial placement Move to the new position determined in a incremental global placement pro

cess

Various Transforms (continued) Buffer insertion

Usually associated with routing tree generation Can be estimated after placement using two different types of

buffers [Jiang et al., TVLSI’98]

Transform options for each buffer type: Do not insert any buffer Insert a buffer with different sizes available in the library (s options

for s sizes)

D

S

S

SBuffer

D

S

S

SBuffer

Driving buffer (type 1)

Isolating buffer (type 2)

Critical

Non-critical

Various Transforms (continued) Cell Replication

Can both improve drive capability and isolating sinks. Need to partition sinks between the two drivers [Srivastava et al., TVLSI’0

4] [Lillis et al., ISCAS’96]. Transform options:

Do not replicate a cell Replicate a driver cell with several possible partitions of the sink cells among th

e two replicas (k options for k partitions)

D

S

S

S

DS

S

S

D

Combining Multiple Synthesis Transforms—Past

Work Usually timing-driven Most methods simply apply them sequentially Transforms are not unified [Donath et al., DATE’00]

Incorporating different synthesis transforms in different partition levels in a partition based placement

[Jiang et al., TVLSI’98] Considers both cell resizing and buffer insertion Dynamic sequencing but greedy. Choose the transform with largest

delay improvement to area increase ratio for a net/cell each time. Can be trapped in local optimums. Hard to handle other transforms (e.g. incremental placement which

cause no area increase)

Coarse partition level Detailed partition level

TD pl adjustmentCell resizing Replication and buffering

Our method: -- simultaneous -- unified transforms

An example: A simple transform selection graph (TSG) for one net

Nodes: Transform options for each net (& its cells) Arcs: those in complete bipartite graphs between transform option

sets for a net—all combinations are available as flow paths Flow: has binary meaning: flow through a node the option for the

node is selected Flow: also has a quantitative meaning: In constraint satisfaction

problems, flow amount = constraint metric value = (in our case) sizes of selected options Flow cost is equal to the timing objective function value with

selected options Timing-optimal transform options = the min-cost flow

Network Flow Model

u

Ores(u)

D

2

1 1

2

G

Ob1

v

w

ni

Complete bipartite

TD function value for this choice of options 1 (res), 2 (b1)

Overall Model

Mini-TSG is constructed for each net in CP (net structures)

If two nets have common cells, their net structures are connected by a spanning structure.

n1

n3

n2

n4

S

N1

N3

N2

N4

T

Spanning structures

DPG

Flows indicating selected cell sizes and positions are sent to the DPG to perform detailed placement

Detailed placement “cost” is also considered when selecting options to reach an overall near-optimal soln

Methodology Flow

Determine transform options from trans. set T for every net in CP (from library or using known algorithms, e.g., for replication)

Determine the set CP of near-critical paths = {paths w/ delays >= (1-)[critical path delay)}

Construct the transform selection graph (TSG) and couple it with the detailed placement graph (DPG) [Dutt et al. ICCAD’06]

Determine F- (obj) and C- (discretization) costs for arcs in the TSG

Determine min-cost flow through TSG + DPG using the “concave-cost’’ min-cost method of [Kim & Pardalos, OR Letters, ’99]

Determine transforms across all cells & nets in CP and their legalized detailed placement from the above flow

Mutually exclusive arcs (MEAs) for the output arc and/or input arcs stes of some nodes: at most one arc in an MEA set can have flow through it

Hyper-arc flow Hyper-arcs may be needed in some problems to model k-way dependencies (k > 2). For example, needed in our physical sy

nthesis problems to accurately reflect obj. metric value change caused by flow through nodes in it.

Discretization Requirements in the Network Flow Model

Ores(u)

S

2

1 1

2

T

Ob1MEA sets

MEA sets

4-aryhyperarc

Star graph model—No flow state All flow state

InvalidValid

Star graph model w/ only 2 states

Net Structure and F-cost

First attempt: A linear structure

Product term based arc cost Order of a product term in the timing objective function is the #

of transforms the term is a function of. E.g., Objective func. (linear delay model): d(u,v)+d(u,w)= 2cRdL(ni)+2RdCv+2RdCw

Ores(u)2

1 1 1

2 2 2

1

Ob1 Ores(v) Ores(w)

Distribution node

Gathering node

Rd(Ores(u), Ob1) ·Cv(Ores(v), Orep(u), Orep(v)) order 5

• Each flow path isa transform combination• Set {paths} = Set {transform combos}

uv

w

d(u, v)

uOres(u)v’ Orep(v)

v

w

Ores(v)v

Orep(u)u’

Ob1

u v

u w

d(u, w)

Linear Structure—Issues in Objective Function Cost

Drawbacks of linear structure Cannot handle terms with order >2 Cannot handle terms that depend on two “non-adjacent” transforms.

2

1 1 1

2 2

Supply node

Gathering node

OxOy Oz

T(Ox, Oy)T(Ox

1, Oy2)

T(Ox, Oy ,Oz

)

T(Ox?, Oy

1, Oz2)

T(Ox, Oz)

No bipartite graph

Hyperarcs: Accurate Objective Function Cost

Product term based arc cost Order of a product term in the timing objective function: the # of transforms the ter

m is a function of. Ex: Simple linear delay model: d(u,v)+d(u,w) = 2cRdL(ni)+2RdCw+2RdCv

Rd(Ores(u), Ob1) ·Cv(Ores(v), Orep(v), Orep(u)) order 5

uv

w

d(u, v)i1

j1

k1l1

m1

i1

j1

k1l1

m2

i2

j2

k2l2

m2

2n hyperarcs

• Assuming 2 options per transform, order=n•mn hyperarcs ifm options per transform

Ob1

Ores(u)

Ores(v)Orep(v)

Orep(u)

Meta-hyperarc H forabove order-5 term

“Combination”hyperarcs

d(u, w)

Flo

w n

eed

s to

sel

ect

exac

tly

1 co

mb

. hyp

erar

c

Arcs in network flow graph can only be between two nodes. Parallel arcs between central transform and parallel transform. Each parallel arc & the arcs to the regular transform option nodes it repres

ents corresponds to one hyperarc.

Hyperarcs: Star Graph Structure

T(Oxi, Oy

j, Oz1)

T(Oxi, Oy

j, Ozm)

OyO

…..

Central transform

iOx j Oy

1

Oz….

m

Parallel transform

Regular transforms

Oz

Hyperarc representingan order-3 cost term value

T(Ox, Oy, Oz)

m options

ji

… …

m parallel arcs

T(Oxi, Oy

j, Oz1)

T(Oxi, Oy

j, Ozm)

OxOyParallel arcs

Ox Oy

Oz

Meta arc

Parallel arc sets

Multiple optionnodes

Multiplearcs

Meta Star Graph

f (valid)f’ (invalid)

MEA Satisfaction via Arc C-costs

Besides the objective function based cost (F-cost), a objective function independent C-cost is added

Total arc cost = F-cost + C-cost (cost is a step function—incurred once for any flow amount)

Theorem: A min-cost flow with C-costs on MEA arcs ensures MEA satisfaction

Valid flow F-cost

Min-cost invalid flow F-cost

Invalid flow F+C-cost

Valid flow F+C-cost

F-cost diff >= - CΔ C-cost diff >= CΔ+1Total diff >= 1

CΔ +1 CΔ +1

CΔ +1 CΔ +1

MEA sets

Heuristically or randomly select a valid flow& determine its cost C1

Obtain standard min-cost flow of cost C2

w/o discretization constraints

Let CΔ= C1 – C2

Set MEA arc cost = CΔ+1

Consistent Hyperarc flow: Idea: Only the total capacity of a parallel arc a

nd arcs to its consistent regular option nodes can be = to incoming flow amount f.

How: use prime numbers

Hyperarc-Consistent Flows via Arc C-costs

For k total regular option nodes (across allregular transforms), select k prime numbersp1<p2…<pk such that: 1/p1+…+1/pk>(pk-1)/ pk

Cap of non-para arcs: f(1/pj ) Cap of para arcs: f-(cap of its consistent non-para arcs)

C-cost is proportional to arc capacity: Cunit * cap(e) Cunit = (CΔ +1)/ Δcapmin , Δcapmin is the min{cap of invalid arc sets – f} Theorem: A min-cost flow with C-costs on star graph arcs ensures hyparc-consistent flows in star graphs

ji

f(1-1/3)

OxOy

1 Oz

2

f

f(1-1/5)

f(1/3)

f(1/5)

Tot cap = f

Tot cap < fTot cap > f

Discrete Arc Cost

Standard linear flow cost

Cap(e)

Slope=cost(e)/cap(e)

f

c

Cap(e)

f

c

Cost(e)

Step function cost (concave)

Well studied NP-hard problem [Kim et al., ORL’99]; we use their min-cost algo.

• Total arc cost = F-cost + C-cost (incurred once for any amt of flow)—arc cost is discrete

Affected parameters for ni: Driver R: Rd(Ores(u)), WL Li(Orep(u), Ob2), Sink C: Cv(Ores(v), Orep(v), Orep(u)), Cw(Ores(w), Orep(w), Orep(u)) Order > 2 terms: 2RdCv (order 4), 2c · RdLi (order 3), 2RdCw (order 4)

uv

wni

Ob2 Ores(u)

Orep(u)2c · RdLi

Meta arc

Multiple Cost Terms: Intersecting Hyperarcs & Overlapping Star Graphs

There is one star graph structure for each term in the objective function. Option nodes for common transforms between different terms are combined. Example: Consider three transforms: gate sizing (res), replication (rep) and

isolating buffer (b2).

Orep(v)

Ores(v)

Ores(u)

Orep(u) Ob2

Distr. node

Sub-TSG for net ni

Gathering node

Ores(w) Orep(w)

MEA constraint ensures consistent option selection for common transforms in diff. star graphs

Background: Incremental Detailed Placement [Dutt et al., ICCAD’06]

C11 C12 C13 C14

C21 C22 C24

C31 C32 C33

A1

W2

W1

W3

Sink

Row1

Row2

Row3

W21

Cells to be legalized

Flow amount Cell movement Arcs possible movement dire

ctions Arc cost Deterioration on the

objective metric of the corresponding movement

Cells to be legalized are connected to the source

White spaces are connected to the sink.

Flows from the source to the sink perform cell legalization via white spaces.

Source

W1C11 C12 C13 C14

C24 W2W21C21 C22

A1

Directly send branch flows to the detailed placement network flow graph (DPG) to perform simultaneous detailed placement

Flow is sent from the replacement option node of a cell to the corresponding position in the DPG.

Flow amount means the selected size of the cell.

Simultaneous Detailed Placement &

Area Constraint Satisfaction

Coupling between the flow and the size option nodes is needed: Shunting structurePos i of u

Pos j of u

i

j Opl(u)

DPG

i j k

To DPG

Sink

Shunting arc

Ores(u)

Aj(u)

Opl(u)

Amax

Amax -Aj

(u) Aj(u)

(Amax,0)

(Amax,0)

(Aj(u),0)

Experimental Results—Benchmarks

Three benchmark sets TD-Dragon [Yang et al., ICCAD’02], ISCAS’85, TD-IBM Available options

For cell sizing & type-1, type-2 buffers: 4 options for TD-Dragon and ISCAS’85, and 5 options for TD-IBM

For replication: 4 options: 3 replication options with different partitions of sink cells and a no-replication option

For replacement: 2 options: a timing-driven position of each cell is calculated using method in [Dutt et al., ICCAD’06]. A cell can either stay at its original position or be moved to its timing-driven position.

3% extra white space is added to initial circuits in TD-IBM, and 10% extra white space is added to circuits in ISCAS’85 and TD-Dragon

Sequential Application of Transforms

We compare our results to the sequential application of transforms

Order of transform application matters in sequential application. We tested three different orders: 1) Decreasing order of ΔT/ΔA ΔT=25.92% replacement isolating buffer cell resizing drive buffer replic

ation 2) Decreasing order of ΔT ΔT=18.11% replacement cell resizing isolating buffer replication drive b

uffer 3) Increasing order of ΔA ΔT=22.64% replacement isolating buffer drive buffer cell resizing replic

ation

TD-ibm benchmarks

Experimental Results

0

10

20

30

40

50

60

td-i bm01

td-i bm02

td-i bm06

td-i bm9

td-i bm14

td-i bm17

td-i bm18

Avg.

% ti

ming

imp

.

OursSeq

34.8

25.9 8.9

0

5

10

15

20

25

30

C499 C880 C3540 C5315 C7552 Avg.

% ti

ming

imp

.

OursSeq

20.4

12.5

ISCAS’857.9

63.2%

relatively

better

34.4%

relatively

better

Experimental Results

0

5

10

15

20

Mat r i x VP2 MAC32 MAC64 Avg.% ti

ming

imp

rove

ment Ours

Seq

y = 4x - 924

0

1000

2000

3000

4000

5000

6000

7000

0 500 1000 1500 2000

# of cel l s i n CP

Runt

ime

(sec

s)

15.1

8.8 6.3

TD-Dragon

Our run time is about 1.5 times that of the seq. approach Linear increase w.r.t. number of cells on CP.

71.6%

relatively

better

y = 0. 026x + 710

0

2

4

6

8

0 50 100 150 200

# of cel l s (k)

run

time

(k

secs

)

Conclusions

A general discretized n/w flow based approach to TD post-placement multiple physical synthesis; can handle most transforms in an unified manner

Considers transform applications simultaneously Obtained high-quality solutions; is not trapped in local optimas Performs simultaneous detailed placement (DP) so that DP cos

t is considered when selecting transform options Reasonable run time, good scalability & high quality solutions Demonstrates the power of using continuous opt. w/ well-stru

ctured discretizations Applicable to other constrained optimization problems (e.g., po

wer opt w/ area and timing constraints) Future Work: (a) Application to mixed-cell designs; (b) Consider g

lobal re-routing as a transform for signal integrity

Thank you