Parallelization by SimPL ification : A Case Study in VLSI Placement

Post on 10-Feb-2016

24 views 1 download

description

Parallelization by SimPL ification : A Case Study in VLSI Placement. Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan. Complexities of Parallel Algorithms & SW. Objectives of parallelization A. Improve completion time by using multiple cores in || - PowerPoint PPT Presentation

Transcript of Parallelization by SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelization by SimPLification:A Case Study in VLSI Placement

Myung-Chul Kim, Dong-Jin Leeand Igor L. MarkovDept. of EECS, University of Michigan

1

PAPA2011, University of Michigan

Complexities of Parallel Algorithms & SW1.Objectives of parallelization

A. Improve completion time by using multiple cores in ||B. Improve throughput by using stream processing

(latency may increase and become less predictable)C. Improve power consumption (by decreasing clk rate)2.Not an objective (a pitfall)

− Come up with a slow algorithm that is easy to parallelize

■In this talk: how to accomplish 1.A without 2− Take a leading algorithm and speed up its bottlenecks− Design a new algorithm that is

(a) better, (b) easy to parallelize

2

PAPA2011, University of Michigan

CAD Algorithms■Sequence of optimizations

− Subject to Amdahl’s law− The more the stages, the harder to parallelize effectively■Additional complications

− Elaborate data structures may entail overheadfor parallel access

− When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads)

■Recommendations− A simpler algorithm is often either to parallelize

(fewer stages, simpler data structures)− Using standard solvers, e.g., linear algebra

helps reuse previous work on parallelization

3

PAPA2011, University of Michigan

Global Placement: Motivation■Interconnect lagging in

performance while transistors continue scaling

− Circuit delay, power dissipation and areadominated by interconnect

− Routing quality highly controlled by placement

■Circuit size and complexity rapidly increasing− Scalable placement algorithm is critical− Simplicity, integration with other optimizations

4

Unloaded

Coupling IR drop

RC delay

PAPA2011, University of Michigan

Goals in Placement■Find good relative ordering of cells

− Minimize wire length and congestion− Maximize timing slack■Find good spacing of cells

− Eliminate wiring congestion problems− Provide space for post placement stages

–clock trees–buffer insertion–timing correction

■Find good global position

5

PAPA2011, University of Michigan

A B C

Optimize Relative Order

6

PAPA2011, University of Michigan

A B C

To spread ...

7

PAPA2011, University of Michigan

A B C

.. or not to spread

8

PAPA2011, University of Michigan

A B C

Place to the left

9

PAPA2011, University of Michigan

A B C

… or to the right

10

PAPA2011, University of Michigan

A B C

Optimize Relative Order

Without whitespace,placement is dominated by ordering

11

Example of Global Placement (APlace 2.04 from UCSD)

Example of Global Placement (mFar from UCSB)

PAPA2011, University of Michigan

Placement Formulation

■Objective: Minimize estimated wirelength− Half-perimeter wirelength (HPWL)− (max X – min X) + (max Y – min Y)

■Subject to constraints:− Legality: Row-based

placement with no overlaps− Routability: Limiting local

interconnect congestion forsuccessful routing

− Timing: Meeting performancetarget of a design

14

xy

PAPA2011, University of Michigan

Quadratic Placement■Consider a graph first, not a hypergraph■Minimize Σ(xi-xj)2+(yi-yj)2 (the sum is over eij)

− Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components

■Physical analogy: Hooke’s law− Consider an elastic spring, spread by x− Force F=-kx (k is the spring constant)− Energy E=kx2

− Our goal: minimize the energy of the system

A system of springs will only settle in a minimum

15

PAPA2011, University of Michigan

Iterative Optimization

16

PAPA2011, University of Michigan

Prior Work

■ Ideal Placer− Low runtime without sacrificing solution quality− Simplicity, integration with other optimizations

17

Spee

d

Solution Quality

Non-convex optimization

mFAR, Kraftwerk2, FastPlace3

Ideal placer

mPL6, APlace2, NTUPlace3

Quadratic and force-directed

PAPA2011, University of Michigan

Key features of SimPL■Flat quadratic placement■Primal dual optimization

− Closing the gap between upper and lower bounds

18

Final Solution

Lower-Bound Solutionby Linear System Solver

Wire

leng

th

Iteration

Final Legal Solution

Upper-Bound Solution by Look-ahead Legalization

Initial WL Opt.

PAPA2011, University of Michigan

Common Analytical Placement Flow

19

Placement Instance

Converge

yes

no

GlobalPlacement

Initial WLOptimization

Legalizationand Detailed Placement

SimPL Flow

20

We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]

Placement Instance

Legalizationand Detailed Placement

B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008]

yesno

Pseudonet Insertion

Look-aheadLegalization

(Upper-Bound)

B2B GraphBuilding

Linear System Solver (Lower-Bound)

ConvergeGlobal

Placement

B2B GraphBuilding

Linear System Solver

WLConverge

yes

noInitial WLOptimization

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization■Purpose: Produces almost-legal placement (Upper-

Bound)while preserving the relative cell ordering givenby linear system solver (Lower-Bound)

■Identify target region − Find overflow bin b− Create a minimal wide enough bin cluster B around b■Perform geometric top-down partitioning

− Find cell area median (Cc) and whitespace median (CB) − Assign cells (Cc) to corresponding partitions (CB) ■Non-linear scaling

− Form stripe regions− Move cells across stripe regions in-order based on whitespace

21

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (1)

Performing geometric top-down partitioning

Overfilled binCell-area median (Cc)

B0 B1

whitespacemedian (CB)

Bin cluster (B)

22

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2)

23

Cell-area median (Cc)

whitespacemedian (CB)

B0

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2)

CB

Obstacle

borders

Uniform cutlines

CellOrdering

Per-stripeLinear Scaling

26

4

37

58

1

CB

26

4

37

58

1

CB

24

SimPL: Look-ahead Legalization (3)■Example (adaptec1)

Look-ahead legalization stops when target regions become small enough

PAPA2011, University of Michigan

SimPL: Using legal locations as anchors■Purpose: Gradually perturb the linear system to

generate lower-bound solutions with less overlap

■Anchors and Pseudonets− Look-ahead locations used

as fixed, zero-area anchors − Anchors and original cells

connected with 2-pin pseudonets− Pseudonet weights grow

linearly with iterations

26

PAPA2011, University of Michigan

Next illustration: Tug-of-war between low-wirelength and

legalized placements

27

SimPL Iterations on Adaptec1 (1)Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound)

Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)

28

SimPL Iterations on Adaptec1 (2)Iteration=11 (Upper Bound)

Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)

Iteration=11 (Upper Bound)

Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)

Iteration=10 (Lower Bound)

29

SimPL Iterations on Adaptec1 (3)

30

Iteration=31 (Upper Bound)Iteration=30 (Lower Bound)

Iteration=40 (Lower Bound) Iteration=41 (Upper Bound)

PAPA2011, University of Michigan

Convergence of SimPL■ Legal solution is formed between two bounds

31

PAPA2011, University of Michigan

Empirical Results: ISPD05 Benchmarks■Experimental setup

− Single threaded runs on a 3.2GHz Intel core i7 Quad CPU Q660 Linux workstation

− HPWL is computed by GSRC Bookshelf Evaluator< 5000 lines of code in C++, including CG solver

for sparse linear systems (w Jacobi preconditioner)

32

PAPA2011, University of Michigan

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

Speeding Up Placement Using Parallelism■SimPL has very few components (5KLOC)■Each bottleneck is amenable to some form of ||-ism

− Thread-level − Instruction-level

34

PAPA2011, University of Michigan

Parallelism in Conjugate Gradient Solver■Coarse-grain row partitioning

− Implemented using OpenMP3.0 compiler intrinsic

■SSE2 (Streaming SIMD Extensions) instructions− Process 4 multiple data with a single instruction− Marginal runtime improvement in SpMxV

■Reducing memory bandwidth demand of SpMxV− CSR (Compressed Sparse Row) format

Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003

35

PAPA2011, University of Michigan

Parallelism in CG Solver - Example

36

PAPA2011, University of Michigan

Parallelism in B2B Mode Update■B2B net model update

– B2B model is separable– Can process the x and y cases in parallel

− Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads.

37

PAPA2011, University of Michigan

SSE optimization affects Runtime Profile

38

Initial placement 5%

CG solver 19%

Sparse matrix and B2B net

modeling10%

Look-ahead legalization

18%

Pseudo-net insertion 1%

Post Global Placement

46%

IO 1%

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (1)■Look-ahead legalization (LAL) started consuming

a significant fraction of overall runtime

■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization

− Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel

− After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells

39

PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (2)■LAL keeps the global queue of bin clusters Q■Static partitioning

− Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start

■Subtask updates− Thread ti processes one of two sub-clusters (for the next

level of T&N), the remainder is added to the global cluster queue Q

■Dynamic task scheduling − When thread ti is idle, it dynamically retrieves clusters

from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1)

40

PAPA2011, University of Michigan

Empirical Results – Overall Speed-ups■Experimental setup

− Multithreaded runs on a 8-core AMD-based system with four dual-core CPUs and 16GByte RAM

− Each CPU was Opteron 880 processor running at 2.4GHz with 1024KB cache

41

Empirical Results – Component Speed-ups

42PAPA2011, University of Michigan

PAPA2011, University of Michigan

Empirical Results – Component Speed-ups

43

PAPA2011, University of Michigan

Extending the Routability-driven Placement■Ongoing work: simultaneous place-and-route

44

PAPA2011, University of Michigan

Simultaneous Place-and-Route■After Look-Ahead Legalization (LAL)

perform Look-Ahead Routing (LAR)− Integrate an in-house router through clean API− Cell locations in, accurate congestion maps out− The placer accounts for congestion in addition to density

(slightly modified formulas, almost no extra work)■ISPD 2011 contest organized by IBM Research

− New, large benchmarks− Placements evaluated by a common global router

45

PAPA2011, University of Michigan

SimPL SimPLR■Key metric is #overflows (OF)■Also shown – routed WL (RtWL)

46

PAPA2011, University of Michigan

Conclusions■ New flat quadratic placement algorithm: SimPL

− Novel primal-dual based approach − Amenable to integration with physical synthesis

■ Self-contained, compact implementation − Fastest among available academic placers − Highly competitive solution quality− Amenable to parallelism− Easy to extend to simultaneous place-and-route

47

Questions and Answers

Thank you!Time for Questions

48PAPA2011, University of Michigan