Floorplanning by Annealing on a Hypercube Architecture · PDF fileFloorplanning by Annealing...
-
Upload
trannguyet -
Category
Documents
-
view
222 -
download
1
Transcript of Floorplanning by Annealing on a Hypercube Architecture · PDF fileFloorplanning by Annealing...
IEoMELLON~Department of Electrical and Computer Engineering__
Floorplanning by Annealingon a Hypercube Architecture
Rajeev Jayaraman1987
Floorplanning by Annealing
on a Hypercube Architecture
Rajeev Jayaraman
Department of Electrical and Computer Engineering
Carnegie-Mellon University
Pittsburgh, PA 15213
A project report submitted in partial fulfillmentof the requirements for the degree of
Master of Science in Computer Engineering
March, 1987
This research has been funded by the Semiconductor Research Corporation
To my parents.
Table of Contents
Acknowledgements
Abstract
1. Introduction2. Background and Motivation
2.1. The Floorplanning Task2.2. Floorplanning Methods
2.2.1. Mincut Techniques2.2.2. Rectangular Dualization Techniques2.2.3. Simulated Annealing Techniques
2.3. Optimization and Parallelism in Simulated Annealing2.3.1. Serial Optimization2.3.2. Parallelism and Parallel Simulated Annealing2.3.3. Shared-Memory Implementations2.3.4. Hypercube Implementations
2.4. Motivation for Research
3. Serial Floorplanner3.1. Approach to Floorplanning3.2. Annealing Algorithm Implementation
3.2.1. Move Set3.2.2. Objective Function3.2.3. Annealing Schedule
3.3. Performance Evaluation of the Serial Algorithm
4. Parallel Fioorplenning Algodb’xns4.1. Hypercube Architecture4.2. Uncertainty in Parallel Move Evaluation4.3. Partitioning Strategy 1: Static Parallel Algorithm4.4. Partitioning Strategy 2: Simple Pipeline Algorithm4.5. Partitioning Strategy 3: Modified Pipeline Algorithm4.6. Comparison of Partitioning Strategies
5. Parallel Implementation5.1. Parallel Programming Environment
5.1.1. iPSC Hardware and Software5.1.2. iPSC Interprocessor Communication Mechanisms
5.2. Parallel Implementation Details5.2.1, Efficient Message Passing Patterns5.2.2. Message Composition5.2.3. Data Structures
12
37789
131520212425303133333737394244494951525660656868686971717374
5.3. Debugging6. Performance Evaluatkm of Parallel Algorithms
6.1. Methodology6.2, Speedup Results6.3. Convergence Results6.4. Summary
7. Co~clusionsReferences
75
7777798691~294
III
List of Figures
Figure 2-1:Figure 2-2:Figure 2-3:Figure 2-4:Rgure 2-5:Rgure 2-6:Rgure 2-7:Rgure 2-8:Rgure 3-1:Rgure 3-2:Rgure 3-3:Rgure 3-4:Rgure 3-5:Rgure 4-1:Rgure 4-2:Rgure 4-3:Figure 4-4:Rgure 4-5:
Rgure 4-6:Rgure 5-1 :Rgure 5-2:Rgure 5-3:
Rgure 5-4:Rgure 6-1:Rgure 6-2:Rgure 6-3:Rgure 6-4:
Rgure 6-5:Rgure 6-6:Rgure 6-7:Rgure 6-8:Rgure 6-9:
Mincut Partitioning 10Polar Graph Representation 11Slicing Tree Structure 12Rectangular Dualization 14The Simulated Annealing Algorithm 16Variation of the Cost Function During Annealing 22Heuristic Spanning 28Multiple-Seed Collusion 29Move set for PASHA 38Center Weighting Function for Overlap Cost Evaluation 41Comparison of Overlap Penalty Functions 42Final Floorplans Produced by PASHA 46MASON and PASHA Solutions for a Non-Slicing Structure 48Topology for 2, 3 and 4-Dimensional Hypercubes 51Static Parallel Algorithm on a 3-Dimensional Hypercube 56Pipeline Algorithm for a 4-Dimensional Hypercube 59Percolation of Update Information among Pipelines : Lazy Updating 63Topology for a Modified Pipeline algorithm on a 3-Dimensional 64HypercubeModified Pipeline Algorithm for a 4-Dimensional Hypercube 66Message Communication Overhead of the Intel iPSC Hypercube 70Sequence of Messages for a Broadcast Tree 72Binary Reflected Gray Code and its Topology on a 3-Dimensional 74HypercubeBin Data Structure 76Percentage of Time Spent in Each Move Task 80Execution Times for the Static Parallel Algorithm 81Effect of Lazy Updates in the Modified Pipeline Algorithm 83Total Execution Times for Modified Pipeline Algorithm using 84Benchmark BExecution Times for Different Pipeline Lengths 85Variation of Time Taken per Temperature 86Speedup for Parallel Algorithms 87Quality of Parallel Solutions 89Wirelength Variation in a Serial and a Parallel Algorithm 90
iv
List of Tables
Table 3-1: Comparison between PASHA and MASONTable 4-1: Comparison of Partitioning Strategies
4867
Acknowledgements
I would like to express my sincere thanks to my research advisor Prof. Rob Rutenbar. His
intellectual insight, penchant for perfection and infectious enthusiasm have been
instrumental in the completion of this work. I wish to place on record my grateful
acknowledgement to Mr. George Dodd of the Computer Science Department at General
Motors Technical Center Warren, Michigan for allowing me to use the Intel iPSC machine at
their premises. My grateful thanks are also due to Mr. Alan Baum and Mr. Don McMillan of
General Motors Technical Center for introducing me to the pleasures and pains of
programming on the iPSC system. I wish to acknowledge the support given by Intel Corp.
in giving me access to an Intel iPSC hypercube here at CMU. I would also like to express
my sincere appreciation to my committee members: Prof. Andrzej Strojwas and Prof. Zary
Segall.
I would like to acknowledge the work done by Dave Bohman in the installation of the
hypercube. I also would like to thank many members of the ECE community, especially
Saul Kravitz and Jim Quinlan for the many fruitful discussions which have subtly moulded
the nature of this work, and Dottie Setliff for suggesting an elegant acronym for this
research effort. And, finally a word of special thanks to all my officemates for the friendly,
fun filled environment which has been so very conducive to my work.
2
Abstract
Simulated annealing algorithms for VLSI layout tasks produce solutions of high quality
but are computationally expensive. This thesis examines some parallel approaches to
accelerate simulated annealing using message-passing multiprocessors with a hypercube
architecture. Floorplanning is chosen as a typical application of annealing in physical
design.
Different partitioning strategies which map this annealing algorithm onto a hypercube
architecture are presented. The objective in the design of these partitioning strategies is
to exploit maximum parallelism in the algorithm within the constraints of a message-
passing multiprocessor environment. Besides utilizing the limited parallelism inherent in
individual move evaluations, we also exploit the tolerance of annealing to errors in the
value of the system cost function as seen locally in each processor. To map these
partitioning strategies onto hypercube architectures, optimized message patterns are
developed.
Two parallel algorithms based on these partitioning strategies have been implemented on
a 16 node Intel hypercube. Practical speedups roughly between 4 and 8 have been
obtained on 16 processors for different strategies. The performance and solution quality
of these algorithms is presented and critically analyzed. With respect to solutions
produced by the analogous serial annealing algorithm, it is shown experimentally that the
introduction of uncertainty in the parallel algorithms does not compromise the solution
quality,
3
Chapter 1
Introduction
VLSI systems are becoming increasingly complex and, consequently, their design times
are also increasing. To enable the designer to complete chips at a faster rate, design
methodologies which result in shorter design cycles are employed. To manage the
complexity of the design of such VLSI systems, most of these design methodologies try to
decompose the entire design into smaller, more easily manageable tasks. Decisions made
early in the design process can profoundly affect the final quality of the design. Hence, it
is very important to predict the implications of early design decisions on the final quality of
the design. Typically, these methodologies stress the need for a hierarchical approach,
and the necessity for high-level planning at the start of the actual design process. A
hierarchical approach enables the designer to understand the implications of early design
decisions more completely, and reduces the possibility of design flaws. This results in
fewer iterations and faster turnaround times.
Physical design is the phase of the IC design process in which the functional design of a
piece of hardware is actually mapped onto the surface of silicon. Layout tasks must try to
optimize layout parameters which directly affect system performance, for example, the
aggregate wirelength, and exact geometric shape of each module. In physical design, the
floorplanning task determines a suitable geometric arrangement for the basic functional
blocks of the system, and perhaps the rough shape of the blocks themselves. The
floorplanner must optimize critical parameters, such as the total estimated areas and
4
wirelength, in order to ensure the success of subsequent design steps such as placement
and routing.
For our purposes, floorplanning produces a geometric arrangement of the functional
blocks, and a set of possible shapes for each block. Floorplanning, like most physical
design problems, is an NP-hard problem. The complexity of this class of problems grows
exponentially and, therefore, large floorplanning problems may require enormous amounts
of time to determine the most optimal solution. For practical reasons, heuristics which
strive to find fast, near-optimal solutions ere employed to solve such problems. Such
heuristics differ in the tradeoffs they make between execution time end the optimality of
their solutions. Iterative improvement methods are a class of heuristic methods which
often give good solutions, but very often tend to be slow. In addition, they do not
guarantee convergence to near-optimal solutions. Iterative improvement methods typically
start with some initial solution and iteratively improve, or refine this solution until no further
improvement is possible. This sometimes causes these methods to get stuck in locally
optimal but globally inferior solutions. Thus, the final solution produced by a typical
iterative improvement algorithm may be extremely sensitive to the initial starting solution.
Simulated annealing methods represent an alternative to classical iterative improvement
techniques. Annealing methods, which are also iterative improvement techniques, avoid
one major disadvantage common to most iterative methods: they provide a controlled
mechanism for the system to climb out of local minima to reach global minima. In a variety
of physical design applications simulated annealing algorithms have produced excellent
solutions, but they are almost always computationally very expensive to run. Since the
results of simulated annealing have been very encouraging, there have been various
modifications proposed which try to accelerate the basic serial algorithm. Of primary
interest to us ere multiprocessor implementations which try to exploit the inherent
concurrency of annealing algorithms. Different partitioning strategies have been used to
5
exploit this concurrency by dividing the computation involved in annealing among
cooperating processors. These partitioning strategies are usually specific to the target
machine on which they are to be implemented. Most of the work in parallel implementations
to date has been on shared-memory multiprocessors. The focus of this thesis is the study
of parallel partitioning schemes for annealing algorithms running on message-passing
multiprocessors, in particular, hypercube multiprocessors. These machines differ from
shared-memory machines in that they lack any global, transparently shareable memory; all
synchronization operations and data sharing for parallel computation is done by
messages. One of the main reasons for the attractiveness of message-passing machines
such as hypercube multiprocessors is that they can be incrementally upgraded to larger
systems more easily than many shared-memory machines.
In this project we have implemented a basic simulated annealing algorithm, and then used
this serial algorithm as a vehicle to study parallel algorithm partitioning schemes for
implementation on a hypercube. We have chosen a floorplanning task as a typical
application of simulated annealing in physical design. Although our serial floorplanner
requires some minor extensions and tuning to be of use as a practical tool, it nevertheless
exhibits all salient characteristics of a good application of simulated annealing, and
consequently suffices as a benchmark for our studies of annealing on hypercube
architectures. We examine different partitioning strategies and message passing patterns
which exploit the inherent concurrency of the basic serial algorithm. In particular, the
parallel schemes we propose exploit the tolerance of simulated annealing to errors in cost
function evaluation during individual iterative improvement steps.
This thesis is organized as follows. Chapter 2 discusses the formulation of the
floorplanning problem. We also review simulated annealing algorithms and some previous
work in the area of accelerating annealing algorithms. This is followed by a comparative
review of parallel simulated annealing algorithms. Chapter 3 discusses the serial version
6
of the simulated annealing floorplanner. Chapter 4 examines general issues in parallel
partitioning strategies and hypercube message passing patterns. Specific strategies are
proposed to map our serial algorithm onto hypercube machines. We also examine in detail
the error tolerance property of simulated annealing and its potential uses in parallel
annealing. Chapter 5 discusses implementation details of the serial and parallel algorithms.
In Chapter 6 we present results obtained by implementing the proposed parallel algorithms
on an Intel iPSC hypercube. Results of experiments performed with the parallel algorithms
are analyzed, and the advantages and shortcomings of these are critically reviewed.
Finally, a brief summary of the contributions of this thesis is presented and areas of future
research are identified in Chapter 7.
Chapter 2
Background and Motivation
In this chapter previous work related to this thesis is reviewed. We begin the discussion
with a specification of the fioorplanning task. This is followed by a review of some basic
techniques for solving floorplanning problems. In this context a basic review of simulated
annealing is presented along with some of its applications to fioorplanning. Prior efforts in
optimizing simulated annealing algorithms are reviewed, followed by a discussion of
recent parallel approaches to simulated annealing. We conclude this chapter by a
discussion of the motivation for this thesis and the goals to be accomplished in this work.
2.1. The Floorplanning Task
Floorplanning is the process of choosing geometrical attributes for hierarchically
partitioned functional modules so as to satisfy a set of electrical and topological
constraints. After the entire design is partitioned into a set of modules, the physical layout
of these modules must be determined so as to optimize the total interconnect wirelength,
total area and other layout parameters. The placement of these modules and the optimal
choice of their attributes in planning the area of the chip are the goals of the floorplanning
task.
The functional modules or cells have certain geometric constraints to be satisfied during
fioorplanning. These constraints typically result in a number of possible shapes and sizes
for each module and often reflect different possible layout styles for this cell. In addition,
8
the area of the chip is also sometimes constrained, either by its aspect ratio or maximum
allowable size. One primary objective of most floorplanners is the minimization of the
interconnection wirelengths while retaining maximum routability. The process of
floorplanning decides the optimal shapes and arrangement of all the modules, attempts to
pack all the modules in a compact rectangular area, and attempts to minimize the total
wirelength and area occupied by the floorplan.
Floorplanning and placement seem to be very similar tasks, but they differ in many ways.
Unlike placement, which occurs later in the layout, floorplanning is one of the earliest
tasks in layout design. Placement determines the arrangement of cells that have fixed
shapes and sizes. Floorplanning, on the other hand, not only determines the arrangement
of the cells but also decides the shapes and sizes of the cells which optimize the layout. In
addition, I/0 pin connections may have variable locations in some cells and their optimal
positions are also determined during floorplanning. Floorplanning, typically, deals with
less than 150 modules while placement very often has a couple of hundred modules to
deal with. In our model of floorplanning, the floorplanner determines the size and rough
arrangement of modules. Subsequently a detailed placement phase is required to
determine the precise positions of the modules and routing areas.
2.2. Floorplanning Methods
There are many different methods to solve the floorplanning problem. These methods
can be broadly classified into mincut techniques, rectangular dualization methods and
simulated annealing methods. Some of these techniques solve the floorplanning problem
in the absence of variable shapes and pin locations on modules. In such cases the module
is often abstracted as a macrocell, i.e., a cell with a definite shape. This subset of the
floorplanning problem is referred to as macrocell placement. Some floorplanning and
macrocell placement techniques are reviewed in the following sections.
9
2.2.1. Mincut Techniques
These techniques are based on a partitioning technique referred to as mincut partitioning
I~Kernighan 70, Breuer 77a, Breuer 77b’1. Assuming that we have a certain placement of
modules, a cutline is s horizontal or vertical line which divides the modules into two
distinct sets, one on each side of the cutline. There is, typically, an objective function that
assigns a cost to placing the cutline at a particular location. This cost of the cutline is
usually a function of the number of nets which cross the cutline, for example, the number
of nets connecting modules on different sides of the cutline and the relative imbalance
between the total areas of the modules on each side of the cutline. Weighted sums of
crossing net count and area imbalance are common objective functions. Starting with the
entire chip area, an optimal partition into two areas, each containing a subset of the total
number of modules, is done first. The process of partitioning continues recursively, on
each of these two areas and so forth, until the entire chip area is divided into rectangles
each enclosing a single module. Fig.2-1 illustrates the determination of cutlines in mincut
partitioning.
Lauther I~Lauther 79~] employs the mincut technique with a unique graph representation
to solve the macrocell placement problem. The layout is represented by two mutually dual,
acyclic and planar graphs each representing one of the two dimensions: vertical and
horizontal. Each macrocell is a rectangle, and is represented by a pair of edges, one in
each graph, where each edge represents one of the two dimensions of the rectangle. The
basic idea is to start with a rectangular area, with a size equal to the total aggregate area
of each cell to be placed, and proceed by recursively dissecting this area to obtain a final
topological placement for the modules. Each dissection partitions a region of the area into
two subregions; mincut techniques decide which modules go in each dissected subregion.
Each dissection contributes nodes or edges to the two graphs: the two graphs are
constructed in parallel with the dissections, and represent the topological placement of
10
First Cutline 2 Second Cutlines Third Cutline
2-1: Mincut Partitioning
the modules. Fig.2-2 illustrates a polar graph representation of a simple topology. The
process of finding cutlines and partitioning the modules into subregions continues
recursively until every region consists exactly of a single module. The final graphs can
then be converted to a detailed layout which shows the true cell dimensions while
maintaining the neighbour relations obtained from the graph.
Another mincut-based approach is the slicing technique [Brooks 40]. Slicing is a
technique in which a rectangular area is divided by a set of parallel line segments into
smaller rectangles. Each smaller rectangle so obtained is called a slice. Slicing methods
partition the modules into subsets, usually optimizing some function of net connectivity
across the slices, such that every subset can be placed within its corresponding slice. A
slicing tree is a graph used to represent a slicing structure. Each node of the slicing tree
represents a rectangular region which entirely encloses all the modules in each of the
nodes’ subtrees. In a complete slicing tree the leaves correspond to the individual
11
A
D C
B
Slicing Representation
Rgure 2-2:
A
Horizontal Graph
B
D
vertic= Graph
Polar Graph Representation
Polar Graph Representation
modules. The levels of a slicing tree represent either horizontal or vertical cuts, and the
slices at each level alternate between horizontal and vertical cuts. An optimal slicing tree
that determines the topological configuration is first found. A final rectangular dissection
is then derived from the topological configuration of the slicing tree and the shape
constraints of the modules. Fig.2-3 illustrates a binary slicing tree which is a specific
case of the general slicing tree.
A floorplanning tool, MASON [’Lapotin 85"1, uses similar partitioning heuristics for min cut
placement of arbitrarily shaped cells. The problem specification consists of a standard
graph in which the nodes correspond to the modules and the edges to the interconnection
between the modules. This graph is partitioned repeatedly until each partition contains
exactly one node. The partitioning optimizes a weighted sum of the nets crossing the cut
and the relative area imbalance on each side of the cut. Partitioning of srrmll graphs is done
12
Q horizontal cut
Q vertical cutSlicing Tree
Rgure 2-3: Slicing Tree Structure
ml m2
m3
m7
m6
m8 m9
Floorplan Equivalent
using exhaustive search, but heuristics are employed to partition large graphs. This
partioning is followed by the construction of a binary slicing tree. The final phase of the
algorithm converts this binary slicing tree to detailed layout. This is performed by two
slicing tree traversals. The first tree traversal is a Depth-First traversal that walks up the
slicing tree and evaluates the effect of alternate module dimensions on the quality of the
floorplan. At the completion of this traversal, optimal module dimensions are determined.
The second traversal is a pre-order traversal to determine the actual module positions that
satisfy the topological constraints laid down by the slicing tree.
The main advantage of mincut and slicing approaches lies in their inherent routability.
Cutlines in the slicing tree correspond to routing channels and the slicing tree always
yields cycle free routing [Supowit 83, Szepieniec 80"!. The routing pro~ess can be
completed using global routing followed by detailed channel routing. Due to the
minimization of the cutline net crossings at each step of the algorithm, channel congestion
13
is minimized. Mincut methods are also very popular because of their clarity of
representation and their speed. A disadvantage of a strict mincut approach is its relative
inflexibility. For example, user defined constraints such as alternate shapes for modules,
or a priori fixed module locations are difficult to handle. MASON [Lapotin 85"1 proposes
some extensions to the mincut approach which are able to handle some of these
problems. Like any iterative improvement technique, mincut approaches tend to get stuck
in locally optimal but globally inferior solutions. One method to get around this is to
conduct multiple runs of the algorithm with different initial configurations, and then select
the best available final solution.
2.2.2. Rectangular Dualization Techniques
Another technique used for floorplanning is the rectangular dualization method [Heller
82, Leinwald 84, Kozminski 84]. In this technique, a configuration is represented as a
graph in which the vertices represent the modules, and the edges represent the module
interconnections. The rectangular dual of this graph is constructed. The dual graph has
vertices which map to the rectangular faces of the modules and edges which correspond
to the adjacent sides of modules in the rectangular dual. Construction of the dual graph
involves branch and bound techniques to generate an exhaustive list of possible
configurations which yield minimum module area. A necessary condition to construct a
dual is that the original graph must be planar; non-planar graphs have to be planarized
before this method is applied. Non-planarity of the original graphs is due to the existence
of wiring crossovers that cannot be routed in the same plane. Consequently, planarization
is done by the introduction of some auxiliary modules which represent these wiring
crossovers.
Rectangular dualization is a very elegant graph theoretic characterization of the problem.
The representation of the problem is entirely geometric and mapping from dual graph to
14
floorplan and vice versa is very simple. One drawback with this approach is that it is a time
consuming method since it involves exhaustive evaluation of all possible duals of a given
graph. Another disadvantage of this approach is the possibility of the absence of any
satisfactory dual of a graph. As with the mincut approach, suboptimal solutions are very
likely here. In addition, during the planarization of the graph additional nodes
corresponding to wiring crossovers are introduced, creating a problem of determining
placement for these wiring crossovers in the dual. Fig.2-4 illustrates two rectangular dual
graphs and their equivalent geometric representation.
A B
A B
A
D
A
D
B
C
B
C
Dual Graphs Equivalent Slicing Representation
Rgure 2-4: Rectangular Dualization
15
2.2.3. Simulated Annealing Techniques
Simulated annealing [Kirkpatrick 83"1 is an iterative improvement method for attacking
combinatorial problems. This algorithm follows the analogy of finding a minimum energy
state in a physical system by annealing. Physical annealing consists of heating some
material to very high temperatures until it melts, followed by a gradual, thermodynamically
reversible cooling until the material freezes. At each of these intermediate temperatures
the constituent components of the system, e.g., molecules or atoms, rearrange themselves
in lower and lower energy configurations. Finally, when the system is frozen and no further
rearrangements are possible, the configuration of the system is in the lowest possible
energy state, called the ground state. The simulated annealing algorithm, as its name
suggests, uses an analogy to this process of annealing. To optimize the arrangement of
components in some system, we assume a certain objective function, analogous to the
energy, which is to be minimized. Random perturbations, called moves, are made to the
system, analogous to random molecule rearrangement occurring in the physical system.
Similar to the temperature in the physical system, we have a control parameter T which
regulates the acceptance of perturbations in the system during simulated annealing.
Random perturbations are attempted, and then evaluated by computation of the objective
function. If the change in the objective function .~E is negative, ioe., if this change results in
an improvement of the objective function, then the change is accepted. On the other hand,
if the change causes an increase in the objective function and worsens the arrangement,
the perturbation is accepted with a probability p(T,Z~E). Boltzmann-like probability
distributions are commonly used, for example:
p(T, z3E) = "’~E/T ( ,~E >O)
Thermal equilibrium is simulated by attempting a sufficient number of moves at every
temperature so as to explore a large fraction of the state space. Subsequent lowering of
the temperature reduces the probability of accepting positive changes and fewer uphill
16
moves are accepted. Finally, when the system is frozen, essentially no uphill moves are
accepted and since the objective function is near a minimum, few downhill moves are
found. The pseudo-code given in Fig.2-5 illustrates the simulated annealing algorithm.
Start with a sufficiently hIEh initial ~emperature (Too);while (’the state is still changinE’){
while (= state is not in thermal equilibrium with the current temperature"){make a random perturbation(move) ~o the configuration.evaluate the change in objective function (/~E) due ~o this perturbationif (improvement in the objective function i.e. /~E < O)
accept the change and update the configurationelse
evaluate ~he prob~b~l~y of ~ccep~nco~ccep~ ~he move wi~h ~h~s prob~bil~y ~nd update if necessary
lower the temperature /* T
Figure 2-5: The Simulated Annealing Algorithm
Simulated annealing has an advantage over greedy, downhill-only algorithms in its ability
to climb out of local minima. The presence of a controlled mechanism for the acceptance
of uphill moves is a critical new feature of these algorithms. Simulated annealing has been
used quite successfully to solve a variety of physical layout problems such as standard
cell placement [Sechen 84], macro cell placement [Jepsen 83], global routing [Vecchi
83], and gate matrix layout [Devadas 86].
We shall now briefly discuss some applications of simulated annealing in floorplanning.
Annealing approaches to floorplanning can be broadly classified into two categories
based on their problem representation. One method is a direct geometrical approach, in
which the floorplanning problem is modeled as a geometrical problem consisting of many
rectangles, each of which has to be placed to minimize the overall objective function.
Another method is to convert the floorplan to an abstract representation such as a polar
graph. Subsequently the transformed problem is annealed to get a solution which is then
mapped back to its geometrical equivalent.
17
Jepsen and Gelatt [Jepsen 83] propose a simulated annealing method for the placement
of macrocells with arbitrary rectangular sizes. This algorithm tries to minimize the total
wirelength of the placement, hence its objective function consists in part of a wirelength
estimator. Here annealing uses a direct geometric approach of moving the rectangles
around to find optimal placements. Consequently, the algorithm utilizes a move set
consisting of random relocations of the modules: moving a cell in either the horizontal or
vertical directions, rotating the cell in any of the four orientations, or reflecting the cell
along the vertical or horizontal axis. Apart from this, special macrocells like I/0 cells are
further constrained in that they are allowed to move only along the periphery of the chip.
The key innovation here is that overlaps among the macrocells are allowed during
annealing. These allowed overlaps greatly simplify the move set, but they clearly
represent an infeasible solution. Consequently, overlaps are penalized by the addition of
an overlap penalty to the objective function. The annealing schedule lowers the
temperature by a constant factor ~ and identifies the stopping criterion for annealing when
no moves have been accepted for three successive temperatures.
The TimberWolf package [Sechen 84"1 also includes a simulated annealing algorithm for
macrocell placement, and also anneals a direct geometric representation of the macrocell
placement problem. The objective function consists in part of wirelength minimization, and
an overlap penalty function similar to the one proposed in I’Jepsen 83"1. Another
component of the objective function reflects the cost of different I/0 locations on cells.
Pin locations are allowed to vary on individual modules, moving from site to site, where
each site has a limited capacity for pins. The objective function penalises pin sites which
exceed their allowable capacity. The proposed move set in this algorithm is richer than the
move set proposed in [’Jepsen 83] and includes: single macro cell displacements along
any arbitrary direction, position swapping between two macro cells, aspect ratio changes
in the shape of a single macrocell, and assignment of pins to new sites. The annealing
18
schedule of TimberWolf also uses Thew = ~Told, but varies the value of ~ dynamically
during the annealing process to proceed quickly through very hot and very cold
temperatures, and slowly through the critical intermediate temperatures. TimberWolf also
makes use of the concept of range limiting to avoid proposing unreasonably large-
perturbation moves at low temperatures. This ensures that a large percentage of moves
are not wastefully evaluated only to be rejected at low temperatures.
The approach adopted by Otten I’Otten 84~] uses a polar graph representation of the
floorplan for annealing, and differs distinctly from the two previously discussed
approaches. Moves are essentially transformations on the polar graph: the polar graph
itself is annealed to get a solution. The fundamental move in this algorithm is an exchange
of positions of the macro cells. The distance between the swapping macro cells is a
parameter which is used effectively to range-limit the moves. Wirelength minimization is
the sole objective function of the algorithm. The move set always explores only feasible
placements which are represented by polar graphs. Overlaps cannot occur in any
floorplan produced by this method and hence the objective function does not contain any
penalty function for overlaps. The starting temperature is derived empirically by attempting
a few moves and determining a temperature that will allow a very high percentage of the
uphill moves to be accepted. The value of ~ is derived theoretically, unlike the use of an
empirical value of ~ as is the case with the previous two methods.
Another approach to floorplanning using simulated annealing has been proposed by
Wong and Liu I~Wong 86~]. This algorithm uses a slicing tree representation called a
Normalized Polish Expression. The Normalized Polish Expression consists of a string of
symbols. The symbols are classified either as operands or operators. Operands represent
the modules and operators define the slicing cuts which dissect the entire floorplan.
There are two types of operators corresponding to the vertical and horizontal cuts. An
expression defines a complete layout in terms of its equivalent slicing tree. The objective
19
function consists of a total wirelength metric and a total area estimator. Moves consist of
manipulating symbols in an expression, such as swapping two operands or swapping two
adjacent operands and an operator. The swapping of two operands, or the complementing
a subexpression always result in legal Normalized Polish Expressions. On the other hand,
some moves, such as swapping an adjacent operator and an operand, may sometimes
yield an invalid Polish Expression. Hence the validity of this move must be established
before attempting it. The algorithm allows modules to have arbitrary rectilinear shapes
defined by a bounding curve. The bounding curve essentially determines the range of
feasible dimensions of the enclosing rectangular area of the module. A piecewise linear
bounding curve can define any rectilinear shape. The minimum area floorplan realization, a
task performed to evaluate a move, is done by adding the bounding curves of the modules
while walking up the slicing tree corresponding to the Polish expression. Incremental
methods of evaluating the minimum area floorplan realization are used to speed up
execution times. This representation of the floorplan reduces the number of neighbouring
states for each state and, consequently, enables the algorithm search many feasible
floorplans very quickly
Simulated annealing is an approach which has considerable flexibility compared to the
methods of mincut techniques and rectangular dualization. Many user-defined constraints
which cannot be easily handled by the previous two approaches can be implemented in
the algorithm by a simple change in the objective function. Another advantage of simulated
annealing is its controlled hill climbing mechanism which provides a way to climb out of
locally optimal solutions towards a globally optimal solution. However, these advantages
do not come without cost. Simulated annealing is a computationally expensive technique
end typically requires very long execution times. Various parameters in any actual
annealing algorithm must be tuned to a great degree to optimize performance, resulting in a
slight loss of generality. Nevertheless, the fact that simulated annealing is a general
approach for the solution of many different layout problems contributes to its popularity.
20
In the next chapter, we describe our own version of a fioorplanning algorithm using
simulated annealing; we employ a direct geometrical representation similar to that used in
~’Jepsen 83-1 and by Sechen [Sechen 84-1 in TimberWolf. The objective function consists
of a wirelength estimator, an area estimator and a penalty function for module overlaps. We
have used the idea of an overlap penalty function similar to that in I’Jepsen 83"1 with some
modifications to more accurately reflect the overlap situation. The move set used by our
algorithm is specifically adapted to our specification of the floorplanning task and is richer
than the simple move set proposed in I’Jepsen 83"1.
2.3. Optimization and Parallelism in Simulated Annealing
Simulated annealing essentially refines a random solution, cooling it from a high starting
temperature to the final frozen state through many intermediate temperatures. Computation
at every temperature involves the processing of thousands of moves. This means that
moves have to be proposed and evaluated, and configurations updated millions of times
during an entire annealing schedule. This is a computationally intensive process and
efforts have been made to optimize simulated annealing algorithms to improve their speed,
while at the same time maintaining the high quality of their solutions. Efforts to accelerate
annealing algorithms have been primarily in two directions. One method, focusing on serial
algorithms only, is to incorporate modifications in the algorithm to reduce the
computational complexity of the long sequence moves to be evaluated. The other method
focuses on parallelism in annealing, using multiprocessors and parallel algorithm
partitioning strategies to accelerate the computation. This section reviews serial and
parallel strategies to accelerate annealing.
21
2.3.1. Serial Optirnization
This subsection reviews serial strategies to accelerate annealing: optimal temperature
scales for annealing, rejectionless methods, optimal annealing schedules and error
tolerance in annealing.
Concepts of Scale: The cost function varies dynamically during annealing; Fig.2-6
illustrates this variation. As can be seen in Fig.2-6, the objective function does not
change appreciably at very high temperatures. Due to a high probability of acceptance of
uphill moves, annealing in this hot regime results in randomizing the configuration. This
suggests a modification to the basic serial algorithm which reduces the amount of high
temperature annealing to an extent sufficient to retain the optimality of the solution. White
l’White 84"] gives an empirical method to identify the optimum starting temperature based
on the parameters of the problem being solved. Certain assumptions are made regarding
the energy of the system, for example, the existence of finite energy maxima and energy
minima in the solution space. By using concepts from statistical thermodynamics, White
I’White 84"] shows that the standard deviation of the energy states defines a temperature
scale. These temperature scales identify the starting temperature to which the system
must be heated to obtain optimal solutions and also the freezing temperatures to which the
system must be cooled to get a good result. Knowledge of these temperatures tightens
the annealing schedule and results in faster annealing.
Rejectior~less methods: In a standard annealing algorithm, every move must be evaluated
in its entirety before its acceptance criterion is determined. Rejected moves, therefore,
result in a waste of computation. Greene and Supowit I’Greene 84"] propose a
modification of the simulated annealing algorithm which involves fewer rejected moves.
The move proposal stage is biased towards moves which will be eventually accepted. For
each move, a value is stored which is a weighted function of the change in cost it causes.
22
Temperature
Hot~ ~
Transition
Flgt~e 2-6: Variation of the Cost Function During Annealing
A move is selected with a probability given by a function of this value. This is followed by
regular updating of the state. As can be expected, this modification does not yield any
improvement in computation time over the basic algorithm at high temperatures, when the
acceptance rate of moves is high. However, at low temperatures, when only a small
percentage of moves are accepted, significant speed-ups are obtained. A crossover point
is determined, specific to the problem being annealed, and the selection of moves is
changed to the rejectionless method dynamically during annealing after this crossover
point. Range limiters, which are employed in several annealing algorithms, use a broadly
aimilar concept for their operation. Rajectionless methods are especially attractive when
the time to calculate the expected change to the objective function due to a move is
considerably less than the time to evaluate a move in its entirety. Greene l’Greene
84] uses this approach for a logic partitioning problem where it is easy to quickly evaluate
23
the expected change in the move. More complex problems such as floorplanning are not
amenable to this technique since there is no method to quickly establish expected
changes caused by moves.
Optimal Annealing Schedules: Choice of good annealing schedules increases the rate of
convergence of simulated annealing. Annealing schedules have been proposed with
optimal starting and stopping temperatures, temperature decrements, and thermal
equilibrium criteria. Huang et al. [Huang 86] have proposed an annealing schedule which
optimize each of these parameters of the schedule to get higher performance. Their
starting temperature is effectively infinite since they accept every move. They determine
the next temperature by the assumption that at that temperature any configuration whose
cost is worse by 3~ of the present configuration must be accepted with a very high
probability, where ~ is the standard deviation of energy states at this temperature. Since
thermal equilibrium is the establishment of e steady-state probability distribution of the
states of the system, the proposed annealing schedule identifies thermal equilibrium when
the ratio of the number of new states generated with their cost changes within a certain
fraction of ~ of the average cost reaches a stable value. This speeds up the establishment
of the equilibrium condition. Results with new annealing schedules typically show a factor
of 2 for increased rate of convergence.
Error Tolerance: A recent result by Grover [’Grover 86"1 explains why it is possible for the
simulated annealing algorithm to tolerate uncertainties. Uncertainties arise when the
evaluation of the objective function after a move has some error or ambiguity. The error
tolerance of simulated annealing implies that optimal solutions can be found even if the
exact value of the energy function can be in error by an amount ,~E. With concepts of
statistical mechanics, it is derived that the error tolerance of simulated annealing depends
very closely on the temperature of the system. It has been shown that when the error in
the energy function evaluation (/rE) is very much smaller than the temperature T(I,~EI
24
T) the algorithm preserves its convergence properties in spite of the errors and converges
to a good solution. This fact can be exploited to accelerate annealing algorithms by using
fast, approximate methods of move evaluation at high temperatures instead of slow, exact
methods of evaluation. This constraint on the error tolerance denotes an upper limit for the
error tolerance; errors beyond this limit may affect the optimality and convergence of the
algorithm. This result presents a way to exploit parallelism by allowing fast,parallel
evaluations of moves with some errors.
2.3.2. Paral;elisrn and Parallel Simulated Annealing
All the optimizations to the serial simulated annealing by way of modifications to move
computations or the annealing schedule have rarely contributed to speedups greater than
2. To obtain faster rates of convergence, efforts to accelerate simulated annealing
algorithms have been focussed more recently on the use of multiprocessors to exploit
parallelism inherent in annealing algorithms.
A close examination of simulated annealing algorithms reveals that there is potential
parallelism involved in the move evaluation process. Recent research in this area has
resulted in different ways of utilising this inherent parallelism to adapt annealing algorithms
tO parallel execution on various multiprocessors. Speedups here are obtained by efficient
partitioning schemes and by the use of a large number of processors. To date, most
parallel algorithms published for simulated annealing have been implemented on shared-
memory machines. Shared-memory machines have a disadvantage in that they cannot be
trivially expanded past some fixed limits arising from processor memory bandwidth
limitations and bus limitations. Typical, commercial shared-memory machines have upto
32 processors. Hypercube multiprocessors, on the other hand, are very nearly
incrementally expandable because they do not rely on global busses. Speedups are
limited almost entirely by algorithm performance. Current commercial hypercube have 16
25
to 1024 processors. These considerations have prompted us to study parallel
implementations of simulated annealing on hypercube architectures. In the following two
sections we review some of the main ideas in previous parallel implementations, both on
shared-memory architectures and on message-passing architectures.
2.3.3. Shared-Memory Implementatior=s
One of the earliest approaches to exploit parallelism in simulated annealing by Kravitz
[’Kravitz 86a, Kravitz 86b] uses a shared memory multiprocessor to do standard cell
placement, and identifies different parallel partitioning strategies. They identify two basic
kinds of parallelism in simulated annealing: Parallel-moves, which involves simultaneous
evaluation of a number of moves, and move-decomposition, which consists of
decomposing a single move into subtasks each of which can be performed
simultaneously. It is noted that these two types of parallelism are essentially orthogonah
one can perform many separate moves in parallel, and also decompose each move in
parallel subtasks.
For the Parallel-moves scheme, the concept of a Serializable subset is introduced;
moves which form a serializable subset can be evaluated in parallel due to their non
interacting nature and give the same result as a serial evaluation of the moves in some
known order. A simple serializable subset is the set of moves consisting of one accepted
move and the rest being rejected moves. Parallel moves are implemented by evaluating
moves in parallel on all processors until the first move is accepted. The acceptance of a
move automatically aborts other parallel move evaluations. The necessary updates
corresponding to this accepted move are done and parallel move evaluations begin all
over again. Move-decomposition schemes are also employed which use functional move
decompositions that divide the entire move evaluation into functional subtasks and assign
the evaluation of each of these subtasks to different processors.
26
The Parallel-moves algorithm works very well at low temperatures of annealing. This is
so because at low temperatures very few moves are accepted and large serializab/e
subsets can be found. However, at high temperatures the functional decomposition
strategies yield better results than the parallel moves scheme. An adaptive strategy is
suggested which changes partitioning strategies during the cooling process to produce
the best speedup across the entire temperature range. Kravitz and Rutenbar [’Kravitz
86a, Kravitz 86b’1 report speedups of about 3 for a 4-processor VAX 11/784
implementation.
A parallel simulated annealing algorithm for macro cell placement is presented by
Casotto [Casotto 86"1. Their objective function includes a wirelength estimator, a total
area estimator and a penalty function for the total overlap. A parallel moves scheme is
employed to exploit parallelism. Each processor has the responsibility to independently
propose, evaluate and accept moves pertaining to a certain set of modules. Since each
processor is evaluating moves in parallel and also accepting them asynchronously there is
some error involved in the move evaluation process. Every processor does not have
entirely correct information about the state of each module before it tries each new move,
unlike [~Kravitz 86a, Kravitz 86b"1 who always accept only one move and throw away the
rest. The algorithm accepts all acceptable parallel moves and in the process introduces
uncertainty in the value of the objective function. Experimentally they show that such an
uncertainty does not cause any serious problems with the convergence properties of the
annealing algorithm, as predicted by the results of EGrover 86"1. To force this uncertainty
to extremely small values at very low temperatures the concept of clustering cost is
introduced as part of the objective function. The clustering cost tries to force modules
which interact strongly amongst themselves to be allocated to the same processor node.
In effect, the partitioning of modules among physical processors is itself annealed, just as
the placement of cells on the chip is annealed. This clustering tries to find an optimal
27
partitioning of the modules that reduces the uncertainty of move evaluation by ensuring
that all the modules interacting with a move reside in the same processor; consequently,
the uncertainty in the move evaluation is reduced. Speedups of about 6 have been
reported while using 8 processors on a Sequent Balance 8000 shared-memory
multiprocessor.
Rose ERose 86] proposes three parallel algorithms which replace different phases of
simulated annealing for a standard cell placement task. The first technique, referred to as
Heuristic Spanning, is used to entirely replace annealing in the hot regime. With the help of
some mincut based heuristics, Heuristic Spanning searches for coarse interim placements
i.e., the sort of placements found during high temperature annealing. Once the Heuristic
Spanning phase is over, the best partial solution thus obtained is selected and several
independent, low temperature annealings are done in parallel. Each processor thus tries to
improve this placement with low temperature annealing. When annealing is completed in
each processor, the best solution is accepted as the final solution. Fig.2-7 illustrates this
algorithm.
The second technique is called Multiple-Seed Collusion. Similar to the previous method,
each processor carries out annealing in parallel, independently of the other processors.
After a certain number of moves, the partial solution in each of the processors is
examined. The best solution is accepted, and this is selected as the next configuration
from which all the processors repeat the whole procedure of independent annealing. This
process, intuitively at least, enables quick identification of search paths that lead to non-
optimal solutions. These paths are then discarded from the search space, thereby
reducing the complexity of search. The granularity of this method, which is the number of
moves after which the processors synchronize to select the best partial solution amongst
them as the new seed, turns out to be an important parameter. If this parameter is too
small, the probabilistic hill climbing property is essentially destroyed, and the
28
Proc. #1Fast Heuristics
Proc. #1: Lowtemp. annealing
DivideSearch space
Proc. #2Fast Heurlstlcs
Select bestInterim solution
Proc. #2: Lowtemp. annealing
Prec. #nFast Heuristics
Proc. On: LOwtemp. annealing
final solution
Rgure 2-7: Heuristic Spanning
convergence of the algorithm to optimal solutions is degraded. Also, this involves a large
interprocessor communication overhead. On the other hand, if this parameter is large the
search space is not reduced and the problem of expensive searching along non optimal
search path is not addressed efficiently. Rose [Rose 86] compares and contrasts these
two techniques and concludes that the Multiple-Seed Collusion method does not yield
very good solutions. Fig,2-8 illustrates the Multiple-Seed Collusion algorithm.
The third approach used in [Rose 86] uses geographical partitioning of modules on
processors. Processors assume the responsibility to move only those modules which lie
in its area. Processors propose, evaluate and accept moves independently of moves
occurring on other processors. Due to this independence in move evaluation, an effort
must be made to maintain information int.egrity within reasonable limits. This integrity is
maintained by three different communication patterns among the processors. In the Gross
29
Initial ~olutlon
interim eolutlon
Proc. #1: I I Proc. ~2:Make N moves Make N moves
interim ~olutlon
Proc. #n:Make N moves
No
Figure 2-8: Multiple-Seed Collusion
Collusion, method the processors are always responsible for the same subset of modules.
After making a certain number of moves all the processors send a message to the master
processor that calculates the updated state of the system and sends them back to the
individual processors. A global update is done to inform all the processors after every
move is made in the Full Broadcast scheme. This scheme involves heavy message traffic
between processors and can result in significant communication overheads. To minimize
the message traffic generated by the Full Broadcast scheme the Need to Know scheme is
proposed. The Need to Know strategy involves interprocess communication only to
update the processors which need to know the update information during subsequent
move evaluations. This reduces interprocessor communication to minimal required levels.
Speedups of about 4 are reported for the Full Broadcast and the Need to Know strategies
running on a 5-processor multiprocessor.
30
2.3.4. I-lypercube Implementations
An interesting solution of the travelling salesman problem (TSP) by simulated annealing
using a hypercube is given by Felten et al. [~Felten 85]. There is no shared memory here,
and hence all synchronization is implemented by message-passing. Each processor is
assigned a set of cities and an random initial tour is chosen. A move constitutes the
swapping of the positions of the cities on the tour. This swapping can be between cities in
the same processor or between cities residing on different processors. This is followed
by a global update phase which enables the cities to redistribute themselves throughout
the hypercube. After the entire annealing process, the cities come to reside in their proper
nodes and hence they have migrated to their proper location in the tour. Evaluating
Hyperswaps, or swapping between cities residing on different processors, is done by
using the message links existing between adjacent nodes. Since the hyperswaps
represent large changes their acceptance is very low at low temperatures. They are
useful at high temperatures since they can force the system to diffuse quickly out of local
minima. At low temperatures the adjacent pair swaps are more predominant. With this
algorithm speedups of 55 for a 64 node 6 dimensional hypercube have been reported. The
TSP has a very elegant characterization, i.e., it has a simple move set and a’ simple
objective function. Moves do not interact very much and can be evaluated in parallel fairly
accurately, thereby maintaining information integrity in the parallel moves scheme. This
fact makes the parallel moves scheme quite successful and 86% utilization of processors
is reported.
Banerjee l’Banerjee 86"1 presents a parallel simulated annealing algorithm for standard
cell placement on a hypercube. Their approach to the problem consists of partitioning the
modules by area amongst the processors. Moves consist of displacement moves and
swapping moves. The evaluation of these moves is performed in parallel with the help of
message passing. To help in the move evaluation, every processor also keeps all relevant
31
information about modules not in the area for which it is responsible. Once a move is
evaluated and its acceptance is decided, the necessary updates are made in individual
processors. Propagation of the update information to all other processors is done by the
use of a Hamiltonian circuit in the hypercube topology. This algorithm entails a very heavy
volume of message traffic which is disadvantageous. Another disadvantage of this
approach is that the communication overhead in message passing becomes very
expensive when the ratio of the communication time between processors to the
computation time on a single processor is significant. To cope with the heavy amounts of
message traffic entailed by this communication pattern a different strategy is suggested in
[’Banerjee 87]. Broadcast trees are used for broadcasting information to all the
processors. Broadcast trees route broadcast messages in a hypercube topology in times
proportional to the dimension of the cube. Speedups are predicted in the range from 6 to
13 for a 6-dimensional hypercube. These speedups are predicted from simulation times on
a hypercube simulator.
2.4. Motivation for Research
Simulated annealing is a technique which can be used to solve a variety of CAD
problems with extremely good results. However, given the extreme runtimes for typical
annealing algorithms, the need for methods to accelerate annealing techniques cannot be
overemphasized. The use of multiprocessors and parallel annealing strategies present
themselves as very interesting and useful areas of research. Most of the current research
in this area has been focussed towards parallel annealing algorithms on shared memory
machines. Our efforts in this area focus on the implementation of parallel simulated
annealing algorithms for machines with a hypercube message passing architecture.
This thesis examines different partitioning schemes to attack simulated annealing
problems on a hypercube multiprocessor. An effort has been made to identify the inherent
32
concurrency in a particular annealing algorithm, and deduce appropriate parallel algorithm
decomposition strategies. Since message passing primitives are our only tools for
synchronization and data sharing, we are forced to focus on optimizing the allocation of
computations and data to different processors. In an inappropriate decomposition, the
message communication overhead can sometimes become prohibitively expensive. This
fact has prompted us to study optimized patterns of message traffic for parallel annealing.
An effort has been made to partition the algorithm into large-grain subtasks to increase the
ratio of computation time to communication time.
We have chosen to implement a floorplanning algorithm as a typical application of
simulated annealing. Compared to the placement and routing problem, floorplanning has
many more degrees of freedom and hence a wider variety of solutions. The move
evaluation phase has many different subtasks, and has, therefore, a larger granularity than
the move evaluation phase for a placement or routing problem, This provides us with a
richer move set as compared to other simulated annealing problems. We have first
implemented a serial version of the floorplanner which serves as a vehicle for the parallel
implementations. The serial floorplanner that we have implemented is a "no frills"
floorplanner: it does not attempt to solve the floorplanning problem in its entirety. Instead,
we have made an attempt to capture the most important features of the floorplanning
problem which reflect the power of the simulated annealing technique, without making the
problem unnecessarily complex. The next chapter discusses the design of the serial
floorplanning algorithm.
33
Chapter 3
Serial Floorplanner
This research effort attempts to investigate some parallel approaches to floorplanning
using simulated annealing. The algorithms, both serial and parallel, which implement these
parallel approaches are collectively referred to as PASHA1. This chapter describes the
floorplanning algorithm which has been implemented in the serial version of PASHA.
Design considerations for this floorplanner are discussed. Simplifications of the problem,
which have been made to reduce the complexity of implementation, are critically reviewed.
This is followed by a performance evaluation of this serial implementation. Benchmarks are
run on the serial version of PASHA and the quality of the final solution is compared to that
obtained by another floorplanning program: MASON [’Lapotin 85"].
3.1. Approach to Floorplanning
The design for the serial version of PASHA has been influenced greatly by the macro cell
placement of Jepsen and Gelatt I~Jepsen 83]. The approach that we have chosen uses a
representation of the problem which is akin to the geometric nature of the problem. We do
not use any indirect graph based representation, such as a polar graph or slicing tree, for
the layout. Instead, modules are represented by rectangles which are moved and resized
by the annealing algorithm. Annealing attempts to find an optimal arrangement for these
rectangles together with their optimal shapes and sizes.
1pASHA: Parallel Approach to Simulated annealing on Hypercube Architectures
34 ’
We now discuss a general set of objectives and constraints that characterize an ideal
floorplanner. Essential input consists of modules and their interconnections. Modules can
have varying shapes and sizes depending on the layout style of the cell. Consequently,
the specification of modules includes a list of such alternate shapes and sizes. The
objective of the floorplanning process is to choose optimal shapes and sizes for the
modules from among these prespecified alternatives. Besides their shapes and sizes, the
positions of the I/0 connections on the boundaries of these modules are also variable,
depending on their internal layouts. Optimal positions of the I/0 connections must be
determined during the floorplanning process to minimize the total wirelength. In addition,
some global topology constraints may also exist. These constraints, typically, force some
modules to be positioned in specific configurations; for example, we might force some
modules to be placed adjacent to other modules, or force modules to be placed only in
some fixed area of the chip. These constraints arise mainly due to I/0 considerations. Bus
topology is another factor which critically affects system performance. Consequently,
optimum bus topology must also be determined, subject to a similar set of constraints.
The floorplanning area (the total acceptable area of the layout) is also usually
constrained. These constraints limit the size of the layout and may also restrict the aspect
ratios of the floorplan area. These constraints reflect fabrication and packaging
considerations. Floorplanning attempts to achieve a highly compact layout while
satisfying these constraints on the area. The layout, thus produced, must also ensure the
routability of all the nets.
In our implementation of a floorplanner, we have made a number of engineering
approximations and design judgements to reduce the complexity of the implementation
while still preserving the core of the problem. Instead of solving the floorplanning problem
in its entirety, our simplifications attempt to solve a sufficiently large subset of the actual
floorplanning problem. This subset accurately reflects the important characteristics of the
35
actual floorplanning problem. We shall now discuss these simplifications and their effects
on the floorplanning task.
We have chosen to calculate the net wirelengths by using the half perimeter method.
This method involves calculating the bounding box of each net (i.e., the bounding box of all
the modules which the net connects). The half perimeter of this bounding box
approximates the net wirelength. This method is chosen over other methods, such as
center-to-center wirelength evaluation and minimum Steiner tree estimations, because it
is a fairly accurate estimator of net wirelength and, more importantly, provides for faster
evaluations. Compared to the other wirelength estimation methods, this bounding box
metric always overestimates the wirelength of nets. The exact position of I/0 connections
on the boundary of each module is ignored in our simplification. This simplification affects
wirelength minimization minimally because of the overestimation of wirelength by the
bounding box metric.
In an ideal floorplanner, the modules have unconstrained space in which to move around
during the annealing process. As annealing proceeds, the modules rearrange themselves
in close proximity to occupy a compact area. However, the implementation of such an
"infinite" space for the modules presents a difficult problem. Therefore, we have chosen
to represent this space as a finite area by establishing some auxiliary constraints on
moves which allow modules to move only within this finite space. The dimensions of this
area are determined as a function of the estimated area of the floorplan. The estimated
area in turn is a function of the actual module sizes. In our implementation, the size of this
"playing" space is given by: c~ (maximum modules sizes). The value of the constant
user-defined. The aspect ratio for the "playing" field is the same as the desired aspect
ratio. Th choice of a "playing" field effectively produces a rigid boundary inside which the
modules ar(~ constrained to move. Moves which might move a module outside this rigid
boundary are termed illegal and are disallowed. This restriction on the floorplan area
keeps the aspect ratios of the final floorplan within reasonable limits of the desired value.
36
Topology constraints, such as those which limit the position of modules to be in specific
areas of the floorplan, are not considered in our implementation. Such constraints tend to
clutter an otherwise clean characterization of the problem and hence we have avoided this
class of constraints altogether. Implementation of such constraints can be done with
.minimal change in the problem representation. Future versions of PASHA will have the
capacity to tackle such constraints.
Our implementation does not allow bus constraints to be specified. Buses are unique
objects and it is desirable to handle them separately. Most floorplanning algorithms,
especially those with polar graphs and slicing trees, cannot handle different types of
objects. This is a basic drawback in their problem representation. A geometrical
representation like the one PASHA uses is easier to tailor to represent buses and bus
constraints. Another aspect of floorplanning which has not been implemented in the
current version of PASHA is the ability to handle external pins and pads.
Routing space for nets in the final floorplan is addressed by overestimating the areas of
the modules. Modules are expanded artificially just before the annealing process and
these expanded sizes are used during annealing. When annealing finally terminates and a
final floorplan is obtained, the modules are shrunk back to their original sizes. This results
in the creation of some routing space between modules. Presently, there is a user-
specified option to overestimate the size of module by a fixed fraction of its area. A
module which has a large number of nets connected to it needs a large routing space
requirement and the amount of overestimation in the area of the module should be,
consequently, a function of the number of nets which connect the module. Future versions
of PASHA will incorporate this feature.
37
3.2. Annealing Algorithm Implementation
The design of a good annealing algorithm involves determination of essentially four
aspects of annealing; the move set, the objective function, the annealing schedule and the
data structures. We shall briefly discuss these aspects with respect to our implementation
of an annealing algorithm for floorplanning.
3.2.1. Move Set
The move set for our annealing algorithm for floorplanning is designed to enable the
system to explore all possible degrees of freedom, and reach any feasible configuration.
The move set must specifically attempt to reconfigure the system by moving the modules
in the floorplan, and by exploring different shapes of the module. The move set for PASHA
is as follows :
Lateral shifts : These moves laterally shift the modules in any of the fourcompass directions. This is the most basic kind of movement the modulescan make in order to rearrange themselves during the process of annealing. Amovement of the module to an arbitrary location can be decomposed to atmost two lateral shifts. The simplicity of this move and the possibility ofdecomposing all other movements into a sequence of lateral shifts promptedits inclusion in the move set.
Swap : Two modules can exchange their positions in the layout. Though thismove can be essentially decomposed into a set of lateral shifts, we haveincorporated this move since it results in a sufficiently big perturbation to thesystem. Large perturbations help the system to climb out of local minimaquickly, or to proceed downhill quickly.
Rotate : A module can be located in any orientation in the final floorplan. Thismove serves to explore optimum orientations of the modules in the floorplan.Rotation of a module is done along any of the four directions. Since we aredealing with Manhattan geometry alone, modules are only rotated in multiplesof 90°
Change size : To choose the optimum size of the module from among thespecified sizes, this move simply explores alternative sizes. This move isdefined only for modules which have a list of alternate sizes specified. Themove consists of picking a random size for the module from this list.
Fig.3-1 illustrates the different types of moves in the move set for PASHA. Moves are
38
Lateral Shift Swap
Rotate Change Size
Rgure 3-1: Move set for PASHA
chosen at random from this move set. However, the relative proportion in which different
types of moves are chosen from the move set is critical to the performance of the
algorithm. An empirically determined optimal proportion of moves types can enhance the
convergence of the algorithm. For example, the TimberWolf package [’Sechen 84] uses an
empirical ratio of 10 to 1 as the ratio of single module moves to module exchanges for
optimal results. In our implementation, we have maintained the same ratio of 10 to 1
between single module moves and moduleexchanges. Further, since our single module
moves comprise three different types of moves we have chosen a ratio of 3:1:1 among
single module moves as the proportion of lateral shifts, rotates, and size changes,
respectively.
39
3.2.2. Objective Function
The objective function for the simulated annealing algorithm for floorplanning must
accurately quantify the goals of floorplanning within the framework of the various
constraints. Wirelength minimization is one of the primary objectives of floorplanning.
Consequently the objective function has a wirelength estimator. The wirelength estimator
employed in our implementation is the half perimeter metric. Floorplanning must also
attempt to pack the cells in the minimum possible area. Packaging considerations dictate
certain optimum aspect ratios of the chip area. Consequently, the floorplan area must be
optimized with respect to the total area and optimum aspect ratios. These considerations
are taken into account by the presence of an area estimator in the objective function.
An estimated area for the floorplan is calculated as the sum of the maximum possible
areas of the modules. The space in which the modules move around is made a fraction
greater than this estimated area. During the course of annealing, when the floorplan area
shrinks from the area of "playing" field to smaller values, the floorplan aspect ratios
always remain within a fraction of the desired aspect ratios. Floorplans which have a
greater area than the estimated area can be packed into more compact layouts. On the
other hand, floorplans with areas smaller than the estimated area might imply some
residual overlaps. To account for this, the area estimator in the objective function is a
function of the difference between the floorplan area and estimated area.
The move set that we have chosen to implement perturbs the location of rectangles in
the floorplan. Such perturbations, as dictated by the move set, allow the rectangles to
overlap. Overlaps of modules in the fioorplan represent an infeasibility in the layout and
must be penalisedo The introduction of an overlap penalty function in the objective function
drives away overlaps during annealing. The TimberWolf package [’Sechen 84"1 uses a
simple overlap penalty function which is proportional to the square of the area of overlap
4O
between modules. We have chosen to implement a more sophisticated overlap penalty
function, similar to the concept of a centre weighting function proposed in ~’Jepsen 83~.
The motivation for implementing a centre weighting function comes from the fact that the
total overlap area is not a very good estimator of the overlap penalty since it does not
account for the position of the overlap with respect to the module. Overlaps confined to
the periphery of modules are less harmful than overlaps near their centers, but the simple
method based on total area of overlap evaluates both these cases identically. The method
we have implemented penalizes the overlap depending on its position with respect to the
module.
Our representation of an overlap weighting function consists of the construction of two
imaginary pyramids on the two overlapping modules. The pyramids have their bases as the
area of the modules. The heights of the pyramids are equal and are user defined. When two
modules overlap, the two pyramids intersect. To calculate the value of the centre-
weighting function, we roughly approximate the total intersected volume between the two
pyramids. This volume represents the overlap penalty. Fig.3-2 illustrates this center
weighting function for the overlap penalties.
This simple characterization yields a center weighting function which reflects the
overlap penalty more accurately. Small modules overlapping big modules are more
accurately penalised than a simple overlap area measurement. The closer the overlap area
is to the centre of a module, the higher the overlap penalty. Consequently, overlaps are
repelled away from the centre of modules. Eventually, when annealing is complete very
few overlaps remain and the residual overlaps tend to be on the periphery of the modules
and not near the center. Fig.3-3 compares the two overlap penalty functions: the simple
overlap area estimator and our implementation of a centre-weighting function. Notice that
configurations C, D and E yield the same overlap penalty in the simple overlap area
function while the centre weighting function yields different overlap penalties. The overlap
41
Module #1
Figure 3-2: Center Weighting Function for Overlap Cost Evaluation
penalty for configuration D is maximum due to the central overlap between the modules
while the peripheral overlaps of configuration C and E are penalised to a lesser extent. In
our implementation of the centre weighting function, when more than 2 modules overlap
with each other all the pairwise overlap penalties are calculated and added to obtain the
total overlap penalty. The aggregate objective function is a simple weighted sum of the
values of the total wirelength, total area and the total overlap between modules. The
relative values of the weights attached to each aspect of the objective function are very
important. Biasing the weights towards one of the parameters yields solutions which are
optimal with respect to that parameter but non-optimal with respect to the others. These
weights must be carefully balanced so as to improve the final quality of the solution with
respect to all the parameters.
42
A C D
II
E
D
Position Position
E
Simple Overlap Area Penalty Function Centre Weighting Overlap Penalty Function
Rgure 3-3: Comparison of Overlap Penalty Functions
3.2.3. Annealing Schedule
The choice of a good annealing schedule involves the determination of four essential
parameters: the starting temperature, a temperature reduction technique, a thermal
equilibrium criterion and finally the stopping criterion. For the annealing to proceed to a
globally optimal solution, the starting temperature must be sufficiently high for efficient
traversal of the search space, but not so high as to cause unnecessary and expensive
computation at high temperatures. We choose a starting temperature which is hot enough
to enable randomization of the system without unnecessary computation at high
temperatures. The algorithm dynamically determines the starting temperature for each
problem. A large number of random moves are initially proposed and evaluated. The
average value of the change in cost function due to these moves is determined. The
43
starting temperature is chosen such that a large percentage (~ 95%) of these moves
would be accepted. This method gives a very good estimate of the value of the starting
temperature.
The temperature of the system is lowered by a constant factor. This is implemented by
using a simple method where Thew = ~ To=d (~ is a constant ~ 1 ). A value of <~ greater than
0.95 is a very conservative annealing schedule and can be very time consuming, on the
other hand a value of 0.7 or less can result in quenching the system to non-optimal
solutions. After some experimentation we have chosen a value of 0.9 for the value of ~.
The criterion to decide thermal equilibrium is another aspect of the annealing schedule
which is highly empirical. Usually thermal equilibrium is said to be attained when a
sufficient number of moves have been tried to explore a large percentage of the search
space at that temperature. Typically, this is implemented by attempting a certain number
of moves per module. As the degrees of freedom of the problem increase more moves
must be attempted per cell to attain thermal equilibrium Empirically, we have determined
that attempting 200 moves per module gives good results for the problems attempted.
In keeping with our objective to implement a "no frills" floorplanner, this conservative
annealing schedule serves its purpose and produces good results. Many improvements
can be made to this conservative annealing schedule to speedup convergence which have
not been implemented in the current version of PASHA but will be implemented in future
versions.
44
3.3. Performance Evaluation of the Serial Algorithm
We have implemented a serial version of PASHA, a floorplanner based on the annealing
algorithm discussed in previous sections. PASHA consists of approximately 3000 lines of
code written in C and runs under 4.2 BSD Unix. It accepts input in a very simple format,
similar to the one used by MASON I’Lapotin 85]. Input consists of a list of alternative sizes
and shapes of the modules and a netlist. PASHA plots a picture of the final floorplan in
GKS format for GKS-supported graphic displays. Additional output routines are also
provided which enable the user to view intermediate configurations dynamically during
annealing.
To evaluate the quality of floorplans produced by PASHA, we have chosen three
benchmarks representing small, medium and large floorplanning problems. Benchmark A
is a small floorplanning problem with 20 modules. A medium size problem with about 40
modules is Benchmark B. Benchmark C contains 60 modules and is a large problem
obtained from industry. The weights associated with each aspect of the objective function
are tuned for each benchmark to obtain the most optimal solution. The values of the
objective function and the CPU time are measured at the termination of annealing. Fig.3-4
shows the final floorplans for the different benchmarks. It must be mentioned that these
solutions can be further tuned, and they are presented here to demonstrate the fact that
PASHA performs reasonably as a "no frills" floorplanner. It can be observed that there are
a lot of residual overlaps in the solution for the Benchmark C. Benchmark C has modules
ranging in complexity from a complete RAM to a single inverter. Consequently, modules
have widely varying sizes. Center weighting does not seem to compensate for this
problem perfectly, although we conjecture it probably works better than the simpler
overlap schemes. Typical floorplanning problems have modules with more similar
complexity and the serial version of PASHA with centre weighting is able to tackle such
problems fairly effectively. To illustrate this, we reduce the disparity of complexity among
45
modules in Benchmark C by combining several closely connected lower-level modules
into fewer high-level modules. This modified version yields better solutions. The modified
Benchmark C contains 32 modules of approximately equal complexity. The final solution
of this modified Benchmark is shown in Fig.3-4.
We have used MASON to compare the quality of solutions obtained by PASHA. However,
it must be noted that there are some factors which must be taken into account in making
this comparison. First, both MASON and PASHA have an extensive set of different tuning
parameters. One of the critical user-defined parameters in MASON is the relative use of
heuristic and exhaustive search methods. The wirelength metric used by MASON and
PASHA also differ: MASON uses a centre to centre approximation for the wirelength while
PASHA uses a bounding box approximation. Due to these factors, only rough comparisons
can be made between MASON and PASHA. Nevertheless, these comparisons are made to
demonstrate that PASHA gives reasonable solutions. Table 3-1 compares the wirelength
and area objective functions obtained by PASHA and MASON for the three benchmarks.
The wirelength of the final floorplan produced by MASON is processed to determine the
bounding box wirelength for the sake of comparison. It must be noted that definite
conclusions regarding the quality of solutions cannot be drawn by comparing these
values. Nevertheless, this comparison serves to establish that the solutions of PASHA are
reasonable and of comparable quality to those produced by another floorplanning tool.
It can be seen from Table 3-1 that PASHA gives solutions of comparable quality with
those produced by MASON. However, PASHA is very slow compared to MASON. MASON
uses a slicing tree approach and is, consequently, very fast. Unlike PASHA, MASON also
performs global routing of the final floorplan. On the other hand, the main advantage of
PASHA over MASON is its flexibility. It is easier to add new constraints to the objective
function in PASHA than in MASON. In addition, sometimes the most compact and optimal
floorplans cannot be represented as slicing trees. Due to the slicing tree approach,
46
Bend’m~rk A : 20 Modules
Benchmark B : 38 ModulesFigure 3-4: Final Floorplans produced by PASHA
47
Benchmark C : 66 Modules
r --~---~ II I~ []
Modified Benclvnad( C : 83 ModulesFigure 3-4, concluded
48
Benchmark
Benchmark A (20 Modules)Benchmark B (40 Modules)
WirelengthMASON PASHA4538 42164970 4594
AreaMASON PASHA84882 8190060800 50176
Table 3-1: Comparison between PASHA and MASON
MASON cannot find floorplans that do not have a slicing structure, whereas PASHA can
reach such a solution. To demonstrate this, we set up a synthetic problem with 9 modules
and a known non-slicing optimal packing. MASON .and PASHA were run on this problem.
PASHA obtains the optimal solution for this problem, which MASON cannot obtain. Fig.3-5
illustrates the optimal solution for the synthetic benchmark and the results of PASHA end
MASON.
0
32
56 4
7
OptimalRgure 3-5:
8
0 17
65
32
0
25
3
7
6
4
8
MASON PASHA
MASON and PASHA Solutions for a Non-Slicing Structure
We have discussed the serial implementation of PASHA in this chapter. This
implementation of an annealing algorithm for floorplanning is used as a vehicle in our
studies of parallel strategies for simulated annealing on a hypercube.
49
Chapter 4
Parallel Floorplanning Algorithms
This chapter deals with parallel simulated annealing algorithms for floorplanning. These
algorithms have been targeted towards implementation on a multiprocessor with a
message passing architecture, in particular, a hypercube. In this chapter we propose three
partitioning strategies for floorplanning by annealing on a hypercube. Details of these
strategies are discussed, along with a critical evaluation of their advantages and
disadvantages. We propose some approaches which modify the basic annealing algorithm
to create a greater degree of parallelism. The additional parallelism, achieved by the
introduction of error in move evaluation, is exploited for faster execution. We begin with a
brief review of hypercube architectures. This is followed by the discussion of uncertainty
in move evaluation caused by parallel move evaluation. The final sections describe three
proposed parallel floorplanning algorithms.
4.1. Hypercube Architecture
All our parallel annealing algorithms have been targeted towards a hypercube
multiprocessor. A typical hypercube is a distributed-memory, message-passing
multiprocessor: all the processors have local memory, and they synchronize their
computation by sending messages among themselves through an interconnection
network I~Seitz 85]. The topology of the interconnection network is that of a hypercube,
where the nodes of the hypercube correspond to the individual processors end the edges
correspond to the message links between them. A hypercube of d dimensions consists of
50
2d nodes. The nodes are tagged with binary coded integers from 0 through 2d. Two nodes
whose tags differ by exactly one bit are connected by a link, and since the tags are bit
strings of length d every node has exactly d links to other nodes. The nodes send
messages to adjacent nodes through these links. Messages sent between non-adjacent
nodes are routed through intermediate nodes until they reach their target node. Efficient
routing algorithms exist which route messages in such a way that the path length of the
message route is equal to the number of bits in which the binary tags of the source and
target nodes differ. For example, the binary tags of a source node and a target node in an
N-dimensional hypercube cannot differ by more than N bits and hence the maximum path
length for a message in this case is N. The routing is not guaranteed to be commutative,
i.e., a path from node i to node j is not necessarily the same as the path from j to L Many
paths exist between any two nodes. These additional paths can be utilised to increase the
communication bandwidth or to enhance the fault tolerance of the hypercube. A simple
2-cube consists of four processors. A hypercube of any desired dimension can be
constructed with two hypercubes of the immediate lower dimension by connecting their
corresponding nodes. The number of interconnection links per processor, therefore,
grows only logarithmically with the number of processors. This is an advantage of the
hypercube topology, since a large number of processors can be used without prohibitively
complex interconnection networks. Fig.4-1 illustrates the topology of 2, 3 and 4-
dimensional hypercubes. Another advantage of the hypercube topology is that numerous
other network topologies such as trees, meshes, end rings can be easily mapped onto
hypercubes.
51
dim = 2
dim = 3
dim = 4
Rgure 4-1: Topology for 2, 3 and 4-Dimensional Hypercubes
4.2. Uncertainty in Parallel Move Evaluation
Wl~en moves are evaluated in parallel, each move cannot predict the changes caused by
other moves being concurrently evaluated. Sometimes it is possible for the parallel moves
to attempt to move the same object. Ambiguities in updating here must be resolved by
arbitrarily accepting one of the parallel moves while discarding all other parallel moves
which attempt to move the same object [Kravitz 86a, Kravitz 86b]. This results in wasted
move computations. One way of circumventing this problem is to use mutual exclusion
during the move generation stage; locking of objects serves the purpose of mutual
exclusion and prevents multiple parallel moves from moving the same object [Casotto 86~].
To implement the locking arrangement, we use the concept of ownership of objects. Each
processor is allowed to generate, evaluate and update moves pertaining only to those
objects on which it has ownership rights. Ownership rights of a processor are unique and
each module has an exclusive owner.
52
There is another type of ambiguity associated with parallel move evaluation. The
evaluation of a move on a processor relies on the local state, i.e., the state of the system
as seen by that processor, to determine the cost function. Though the local evaluation is
correct, globally the move evaluation could be erroneous. This error is the cause of
uncertainty in parallel move evaluation. Grover I’Grover 86~] has theoretically examined the
effects of error in move evaluation on the convergence of simulated annealing algorithms.
From the equations of statistical mechanics Grover l’Grover 86"1 shows that as long as
the magnitude of error in the energy function ~lErnax is less than the temperature, the value
of the partition function, a measure of the state of the system, is not changed significantly.
An important conclusion from this is that the state of the system can tolerate errors as
long as the maximum error is less than the value of the temperature. This property allows
us to introduce errors into simulated annealing without affecting the convergence
properties of the algorithm. Clearly, the tolerance to errors is dynamically dependent on
the temperature: large errors can be tolerated at high temperatures but low temperatures
allow very small error tolerance.
This property of error tolerance is critical in the partitioning strategies we have
developed. To reduce message traffic while doing parallel moves, our strategies employ
long sequences of relatively fast, but possibly erroneous moves before global updates are
made. This property gives us a good idea about the error that can be introduced in the
annealing without loss of convergence.
4.3. Partitioning Strategy 1: Static Parallel Algorithm
In this section we present the first of the three partitioning strategies that we have
developed for PASHA. The parallel moves scheme presents itself as a fairly simple
strategy to exploit parallelism in annealing and forms the basis of this strategy.
Independent moves are performed in parallel by processors. Every parallel move is
53
accepted or rejected independent of the other moves. The processors update their-state
depending on the moves they perform, completely independent of the moves in progress
in other processors. The state of the system in PASHA refers to the locations of modules
and the wirelengths of the nets connecting these modules. After a set of parallel moves is
performed on any processor, and the necessary updates made, the state of the system in
each processor is no longer identical. The next set of parallel moves are evaluated with
respect to different states in different processors. The state in each processor changes
with each accepted move and becomes more and more out of step with the states in the
other processors. This situation cannot be allowed to continue forever since the error in
the move evaluation auccesively increases, e.g., eventually every module will have been
moved, and no processor will have even an approximately correct state. To remedy this
situation and restore the integrity of the state among the processors, a global update is
performed. This update correlates the states from each processor and determines a new
global state which is then relayed to all the processors. Following each global update
phase all processors asynchronously continue parallel move evaluation.
To perform global updates, each processor sends a copy of its state to some
synchronizing processor. The synchronizing processor calculates the true global state
from the information about changes in the states done in each processor and then sends a
copy of this global state to each processor. Since this process of updating entails
messages between every processor and the synchronizing processor, the volume and
density of message traffic involved in a global update is heavy. The actual process of
determining the global state from the individual states of the processors is done using the
concept of object ownership as discussed previously. For floorplanning in PASHA, the
objects which are modified directly by moves are modules. Each processor owns a set of
modules. Ownership of a module gives a processor exclusive rights to move that module.
This implies that the change in the state in each processor is entirely due to modules
54
which a processor owns. This fact is used by the synchronizing node in determining a
global state of the system during a global update. Besides updating the global state of the
system, the synchronizing node also performs several other chores such as reallotment of
ownerships among modules, determination of equilibrium, and evaluation of the stopping
criterion. Note that swaps between two modules can be performed by a processor only
when both the modules being swapped are owned by that processor. Consequently,
random reallotment of ownership is performed after each global update. This enables any
two modules, to be, eventually, swapped. If reallotment of module ownerships is not done,
the effectiveness of swaps to provide large perturbations necessary to climb out of local
minima is reduced. Our algorithm simply randomizes the ownerships of modules after
every global update. However, instead of simple random reallotment of ownership,
changes of module ownership also appear in the parallel macrocell algorithm of I’Casotto
86] in such a way that strongly connected modules tend to be owned by the same
processors. This method is chosen to coerce strongly interacting modules to be owned
by the same processor during the "freezing" stages of annealing. This reduces the
number of interacting moves across processors and, consequently, reduces the error in
move evaluation at low temperatures.
Global updates result in considerable communication overheads due to the message
traffic. All processors except the synchronizing processor are idle during the global
updating which wastes computing resources. It is, therefore, desirable to reduce the
frequency of global updates. In our proposed scheme a sequence of moves is performed
by each processor before a global update. This makes global updates less frequent, but
also introduces more error into the computation since fewer updates mean that each
processor sees a less correct state of the system. The number of moves which each
processor performs before a global update is a crucial factor in determining the error in
evaluation and, consequently the convergence of the algorithm.
55
To exploit an additional source of parallelism, we also introduce functional move
decomposition while evaluating individual moves. The task of a move evaluation is shared
between two processors. In every such pair of processors, one processor proposes a
move, evaluates a small part of the move and also decides its acceptance criteria, while
the complementary processor in the pair performs the remaining subtasks of move
evaluation. Since the first processor controls the move evaluation by actually proposing
and accepting moves it is referred to as the master processor. The complementary
processor is called the slave processor. During a global update it is the master processor
which sends out the local state to the synchronizing processor and receives the updated
global state. This global state is subsequently passed on to the slave processor.
We divide the entire hypercube into pairs of adjacent nodes. Each master-slave pair is
chosen to reside on physically adjacent nodes of the hypercube. Since they form a unit of
move computation, the message traffic between the two processors in the pair is relatively
high and keeping them as adjacent nodes reduces the communication overhead. The
hypercube topology enables us to divide the cube into such pairs of adjacent processors.
We also chose not to employ a separate synchronizing node. Instead, one of the master
processors does double duty as a master processor as well as a synchronizing node. The
processors pairs are static and do not vary during execution. Moreover, the functional
subtasks performed by the processors also remain static and do not change during
annealing. Hence, we refer to this strategy as the static parallel algorithm. Fig.4-2
illustrates the static parallel algorithm on a 3-dimensional hypercube in which each
master-slave pair performs N local moves before a global update.
56
Complete n moves
Initialise
Complete n moves
~
Complete n moves ~ , Complete n moves
Updatestate
SynchronizingNode ui
Figure 4-2: Static Parallel Algorithm on a 3-Dimensional Hypercube
4.4. Partitioning Strategy 2: Simple Pipeline Algorithm
Pipelining is a fundamental strategy used in increasing the throughput of any task. This
section examines pipelining with a similar objective: to increase the throughput of move
computation. We consider processors that are arranged logically and physically in a
contiguous sequence to form a pipeline. As a move propagates from the beginning to the
end of the pipeline, every stage performs a unique subtask of the move. The move
computation is completed in its entirety at the end of the pipeline. The first stage of the
pipeline proposes the move, the last stage of the pipeline performs the update
corresponding to the move, and the intermediate stages perform intermediate parts of the
57
move evaluation. Every move is broken into a number Of functional subtasks and, hence,
this division of subtasks among the stages of the pipeline is a functional decomposition.
After a set of moves has been completed, only the last stage in the pipeline has the
updated state.
Due to the decomposition of the moves into smaller subtasks the computation time of
each subtask is small. Typical subtasks of a move in PASHA take a few milliseconds,
whereas the communication time for a message between adjacent nodes can be a
millisecond. Hence, the communication time between adjacent processors in the pipeline
can become comparable to the computation time of a subtask in a stage. This increases
the message communication overhead. One method of reducing this communications
overhead is to amortize the communication overhead over several move computations by
increasing the ratio of the computation time to the communication time. We achieve this by
grouping moves together while sending them through the pipeline. All the moves in each
group are evaluated before sending them to the next stage in the pipeline. This increases
the computation time while keeping the communications time essentially constant, thereby
reducing the communications overhead. Grouping moves is essentially a parallel moves
scheme since all the moves grouped together are evaluated independently of each other.
The length of a pipeline, in such a case where the stages perform functionally different
subtasks, cannot be increased indefinitely due to the coarse grain parallelism of functional
decomposition. To utilize more processors, we propose to employ multiple, parallel
pipelines. Consequently, global updates must be performed to synchronize the individual
states of all the pipes. A single processor performs these synchronizing functions. Notice
that a multiple pipeline strategy is similar to the static parallel algorithm in the sense that
the pipelines each process a group of moves before globally updating all the states in all
the pipes. We employ the concept of ownership, similar to the Static Parallel algorithm, to
avoid ambiguous moves. Modules are owned by pipes, instead of individual processors.
58
One of the main factors in deciding the length of each pipeline is that it must be a power
of 2 to enable a clean topological division of the hypercube into an integral number of
pipes. This restriction on the length of the pipeline also makes it possible to find adjacent
processors corresponding to adjacent stages in the pipeline. The message traffic between
adjacent stages in the pipeline is especially heavy and such an arrangement reduces the
communication overhead. If pipeline lengths of 2k are used then an arrangement is always
possible which enables the adjacent stages of the pipeline to be topologically adjacent
nodes. A move computation in PASHA can be broken into roughly four functionally
different subtasks: move proposal and updating, wirelength evaluation, overlap evaluation
and area evaluation. Consequently, we have used a pipeline with 4 stages in our pipeline
algorithm. The first stage proposes a group of moves, evaluates the wirelength parameters
for all these moves in the group and sends these moves off to the second stage. The
second stage evaluates the change in overlaps due to these moves while the first stage is
proposing the next set of moves. In a similar manner the move propagates through the third
stage of the pipeline to the fourth and final stage where the move is accepted or rejected
followed by necessary updates in the system state.
An interesting point of observation in such a pipeline system is that there is no direct
communication between the last and first stages of pipeline. Due to this lack of direct
communication, update information present in the final stage of the pipeline does not pass
to the other stages in the pipeline. As a result of this, some of the moves proposed by the
first stage, which are perfectly legal with respect to its copy of the system state, become
illegal with respect to the system state of the last stage. In our algorithm such moves are
not accepted regardless of their acceptance criteria. As we shall describe in the next
section, we have proposed certain modifications which allow a mechanism for updating
the stages of the pipeline with the state changes made in the last stage. Fig.4-3 illustrates
the pipeline algorithm for a four dimensional hypercube.
59
11081Read input, ~
P~rform moves Perform moves Perform n moves Perform
Updatestate
Synchronizing
imov~
Figure 4-3: Pipeline Algorithm for a 4-Dimensional Hypercube
6O
4.5. Partitioning Strategy 3: Modified Pipeline Algorithm
The two previously discussed parallel algorithms have one common drawback: the
global update phase in each of the algorithms represents a serial bottleneck in an
otherwise parallel algorithm. We have attempted to address this problem in our design of
another partitioning strategy: the Modified Pipeline algorithm. This is the most complex
partitioning strategy we have adopted for PASHA. This algorithm, as the name suggests, is
similar to the pipeline scheme just discussed. However, some important distinctions exist
between this algorithm and the simple pipeline algorithm. Topologically, this algorithm is
structured in such a way that the last stage of each pipeline communicates with the first
stage of the pipeline. Moreover, individual pipes are arranged in such a way that the
interconnection between neighbouring pipes itself forms a ring. The ring connection
between individual pipes enables the pipes to pass state update information among
themselves, neighbour to neighbour, to reduce the number of global updates.
An object decomposition is used to split the move computation across the different
stages of the pipeline. Unlike the simple pipeline strategy, each stage in this algorithm
owns a set of objects like nets and modules. Each stage evaluates the contribution of
each of its owned objects to the move. Moves propagate through the pipeline and are
completely evaluated when they reach the last stage of the pipeline. The first stage
proposes the move and computes some part of move before passing it on to the next
stage. After it passes on the move information to the next stage it immediately begins the
process of proposing another move. Each stage in tum computes a part of the move and
passes the information to the next stage and waits for the next move to be sent to it from
its previous stage. When the move reaches the last stage it has been completely
evaluated. The last stage then decides the acceptance criteria of the move. If the move is
accepted, the last stage updates the state of the system and, unlike the simple Pipeline
algorithm, passes this updated state to the first stage of the pipe. This small detail differs
61
from the previous pipeline algorithm where there is no mechanism, local to the pipe, to
communicate the updated system state to the other stages of the pipe. Notice that this
updated system state which is sent back to the first stage of the pipeline "percolates"
through the rest of the pipe along with new moves. This percolation of the updated state
ensures that move evaluations always have errors never more than those caused by a few
delayed updates. The delay in updates, obviously, is the time it takes for the updated state
caused by an accepted move to circle from the last stage of the pipe to every other stage
in the pipeline: the length of the pipeline.
As explained in the previous section, the length of the pipe must be a power of 2.
Moreover, since our algorithm uses object decomposition to partition the move
computation across the stages of the pipe, the number of stages in the pipeline is a
function of the number of objects. Unlike the functional move decomposition across the
stages in the simple pipeline algorithm, an object decomposition enables the division of an
entire move computation into many fine-grained subtasks. However, the number of
subtasks which a move can be decomposed into depends on the number of objects
involved in the problem. Consequently, the length of the pipeline cannot always be
increased when more processors are used. In such situations multiple pipes are used. The
presence of multiple pipes forces the need for a global update phase which synchronizes
the different streams of annealing in each of the pipes. In the global update phase of the
Static Parallel and Pipeline algorithms, messages are sent to the synchronizing node by all
nodes performing moves, after which the synchronizing node transmits messages
containing the updated state back to these nodes. The global update phase, therefore,
presents a serial bottleneck which must be avoided to improve performance. We propose
a new technique for global update which essentially distributes the process of a global
update among the processors and partially mitigates the serial bottleneck. In this approach
there is no complete global update in the strictest sense of the term. Instead, we replace
62
some of the synchronized global updates with a partial update that is distributed in this
sense: all nodes are not updated simultaneously and the update information may be
slightly stale by the time it takes to reach all the other nodes. Instead of updating all other
pipes, each pipe updates only its neighbour pipe. This kind of global update is always
incomplete. As discussed earlier, we have structured the pipes such that they form a ring
by themselves. Updating neighbour pipes which are interconnected to form a ring ensures
that eventually the state change in a pipe percolates to all the other pipes. For example,
after n updates (n being the number of pipes) the update information reaches all the other
pipes. We call this type of updating delayed global updating or lazy updating. Such a
partial updating mechanism implies that there is never a complete synchronization
between the multiple streams of annealing. The advantage of lazy updating is that it is
faster than synchronized global updates since no global synchronization messages are
sent and only local update messages are sent. However, in the absence of a synchronized
update the state as seen by the processors may become completely out of step. This can
cause a runaway effect on the magnitude of the error in move evaluations. To avoid this
runaway effect a mechanism for a complete global update is also implemented and the
ratio of complete global updates to the number of lazy updates is user-defined. This ratio
can be chosen as tradeoff between a serial bottleneck of a global update versus the
runaway of the error in move evaluation. Fig.4-4 illustrates the percolation of update
information in lazy updating.
The topology of processors and their interconnections as required by this algorithm can
be very easily mapped onto a hypercube. The hypercube topology allows us to structure
the pipes in such a way that the pipes themselves form a ring. Since the last stage of every
pipeline communicates with the first stage, the topology of each pipeline should also be
such that it forms a ring interconnection. Fig.4-5 illustrates a way of structuring a 3-
dimensional hypercube network consisting of 8 processors into four pipes, each of length
2 where the pipes themselves are connected in a circular fashion.
63
Generate Set of Moves
PartialTime Step 1 Evaluation
Time Step 2
Time Step 3
Update 1
Update
Time Step 4 .............................. Update 1
Time Step 5
Update
Time Step 6
Time Step7
Time Step 8 ......................................................................................................................
Update I I
I Update 11
Rgure 4-4: Percolation of Update Information among Pipelines : Lazy Updating
Similar to the Static Parallel algorithm and the Simple Pipeline algorithm, the existence of
multiple pipes calls for an implementation for ownership of modules by pipes. However, in
the Modified Pipeline algorithm two levels of ownership can be distinguished. The first
level of ownership of modules is among the pipes; individual pipes possess ownership
rights over individual modules. This level of ownership, which determines which modules a
pipe can move, is referred to as the move-proposal ownership. In addition, another set of
ownership rights exists among the individual stages of a pipeline. These rights determine
Logical Structure
Connections Within a Pipe
~ Connections Between Pipes
....................Unused Connections
Actual Phyalcal Structure on a 3-Dlmenalonal Hypercube
Rgure 4-5: Topology for a Modified Pipeline algorithm on a 3-Dimensional Hypercube
the objects each stage is responsible for during move computation. This level of
ownership is referred to as the move-computation ownership. The move-computation
ownership includes both nets and modules, unlike the move-proposal ownership which
65
includes only modules. The move-proposal ownership must be dynamic during annealing
to enable complete exploration of the search space, while the move-computation
ownership among the stages of the pipeline need not be dynamic. Fig.4-6 illustrates the
modified pipeline algorithm for a four stage pipeline decomposition in a 4-dimensional
hypercube.
4.6. Comparison of Partitioning Strategies
Table.4-1 compares and contrasts our proposed partioning strategies with respect to
three components: how individual moves ere decomposed in parallel subtasks on
cooperating processors; how complete, independent parallel moves are attempted; and
how state update information is distributed to keep uncertainty about the current system
state within acceptable limits.
66
Initialise /
Collatesystem states
Result
Figure 4-6: Modified Pipeline Algorithm for a 4-Dimensional Hypercube
67
Move-Decomposition1. Functional decompo-sition of moves.
2. Two processorscooperating for a singlemove.
Move-Decomposition1. Functional decompo-sition of moves.
2. Four processors perpipeline cooperating fora single move evalua-tion. Fixed pipelinelenoth.
Static Parallel AlgorilhmParallel-Moves
1. Multiple processorpairs with each proces-sor pair evaluating indi-vidual moves.2. Groups of movesperformed by each pro-cessor before a globalupdate,
Simple Pipeline AlgorithmParallel-Moves
1. Multiple pipelines,each evaluating indivi-dual moves.2. Groups of movesperformed by eachpipeline before a globalupdate.
Update1. Synchronized globalupdate by a singlenode: node 0.
2. Local updating donein master processor.Slave processor is up-dated next evaluation.
Update1. Synchronized globalupdate by a singlenode: node 3.2. Local updating doneonly in the last stage ofthe pipeline. Update notpassed to other stages.
Move-Decomposition1. Object decompositionof moves.
2. Pipelines of varyinglength (restricted bypowers of 2). Proces-sors in a pipelinecooperate for a singlemove evaluation.
Parallel-Moves1. Multiple pipelines,each generating amove. Besides, move-generation ownership ofmodules, ownership ofnets and modules formove-computation.2. Groups of movesperformed by eachpipeline before a globalupdate.
Update1. Lazy updating: Eachpipeline updates its suc-cessive neighbour pipein the ring and gets up-dated, in turn, by itsprevious neighbour.
2. Global updates pro-vided to synchronizethe states betweenpipelines. It is per-formed after severallazy updates have beenmade to ensure conver-gence.3. Local updates withina pipeline are done inthe last stage. Updatesare passed back to thefirst stage.
Table 4-1: Comparison of Partitioning Strategies
68
Chapter 5
Parallel Implementation
This chapter examines implementation issues for the serial and parallel versions of
PASHA. The parallel programming environment in which all the parallel algorithms have
been implemented is reviewed. Other pertinent implementation issues, such as message
passing mechanisms, parallel programming details, data structures, and debugging in a
parallel environment, are discussed.
5.1. Parallel Programming Environment
All the parallel algorithms of PASHA have been implemented on an Intel iPSC2 hypercube.
In this section, we shall briefly review the hardware and software of the iPSC hypercube.
This is followed by discussion of message passing mechanisms on the iPSC.
5.1.1. IPSC Hardware and Software
The Intel iPSC3 is a commercially available parallel computer system with a hypercube
architecture. Individual processors on the nodes of the hypercube are Intel 80286
processors, each with Intel 80287 numeric processing units and 512Kb of memory. The
iPSC machine on which PASHA is implemented is a 4-dimensional hypercube (16
processors). This machine is a large memory version with 4Mb of memory per node. An
Intel 82586 communications coprocessor takes care of most of the communications while
2Trademark of Intel Corp.
’3intel Personal SuperComputer
69
reducing the communications overhead on the main node processor. The host machine
(called the cube manager) is both a control processor and a user interface to the nodes.
The host machine is an Intel 310 microcomputer running the Xenix operating system.
Adjacent nodes on the hypercube are connected by bidirectional communications links. All
interprocessor communications is performed by message passing over these links. There
is an exclusive communication link to the host processor from each of the node
processors.
A typical application on the hypercube consists of two different programs: the host
program and the node program. Both the programs are compiled on the cube manager and
linked to different sets of libraries. The host machine first loads the node-kernel, followed
by the object-code on each node of the hypercube. Once the object-code is loaded on
the nodes, the nodes commence execution asynchronously, coordinated and
synchronized by messages from other nodes. These messages either contain data or
control information. After completion, results are communicated to the user by shipping
them back to the host, where necessary I/O is performed. Hypercubes of all dimensions
have essentially the same topology, consequently, typical applications are written in such
a way that they can be scaled to higher dimensions at runtime.
5.1.2. IPSC Interprocessor Communication Mechanlsms
All communication between the nodes of the hypercube is done solely by messages. A
message consists of a string of bytes in a message buffer. Messages are limited in the
iPSC to a maximum length of 16K bytes. Messages which are longer than 1K are
automatically split into chunks of 1K, transparent to the user, and transmitted to the
destination processor. We notice that small messages have a high communication cost
per byte. Beyond a length of 1K the communication cost sharply rises and then falls again.
This can be explained by the extra communication overhead incurred when messages,
70
larger than 1K bytes, are split into chunks of 1K for separate transmission. To illustrate
this, we performed a very simple experiment. Messages of user-specified length are
passed around the nodes in the form of a ring. The elapsed time for the message beginning
from the time it is generated to the time it get back to the original node after completing a
full circle is measured. Fig.5-1 shows the results of this experiment. As explained earlier,
the sharp increase when the message length is 1K can be noticed.
Communication Timeper message byte(in milliseconds)
Rgure 5-1:
0.3
0.25 -
0,2m
0.15 -
0.1Length of ring - 4
512 1024 1536 2048Message length (in bytes)
Message Communication Overhead of the Intel iPSC Hypercube
Sending and receiving messages by a processor must be done by executing specific
software routines. These routines can be blocking or non-blocking. A blocking routine
waits for the completion of the operation (transmission or reception) it initiated before
returning to its calling process. On the other hand, non-blocking routines simply initiate
message transmission or reception and return to their calling process without waiting for
the completion of the initiated task. Care must be taken while using non-blocking routines:
it is possible to overwrite the contents of message buffer and corrupt its contents before
the initiated message transmission/reception is complete. We use non-blocking routines
wherever possible to obtain parallelism in execution.
71
5.2. Parallel Implementation Details
This section describes the implementation details of the parallel versions of PASHA.
Some efficient message passing patterns are discussed. This is followed by a discussion
of some common concerns in controlling message traffic. Data structures, used for the
serial and parallel version of PASHA, are described. Finally some mechanisms used for
debugging in a parallel environment are mentioned.
5.2.1. Efficient Message Passing Patterns
In this section we shall discuss an efficient message passing pattern to perform the
global update phase which is an essential aspect of the three parallel implementations.
During global update all node processors send a message to a single synchronizing node
which sequentially receives these messages and then, after necessary updating, transmits
the updated system states back to each node by executing a set of sequential sends.
Global update is an O(n) operation where n is the number of nodes in the cube.
expedite this process, we use the concept of a broadcast tree which accomplishes the
task of sending messages to all nodes in the cube from one node in O(Iog n) time.
If the binary tag of a node is x then its neighbours in an n-dimensional hypercube can be
determined by the following simple equation :
x (~ 21 for i = 0,1, 2, .... n-1
A broadcast tree with its root in node 0 has a very simple algorithm I’Brandenburg 86~].
Every node sends the message only to those neighbours with tag x ~ 2i such that 2~ > x.
Every node, except the root node 0, has exactly one parent from which it receives the
message and sends it to its children. Fig.5-2 illustrates a broadcast tree for a 4-
dimensional hypercube with the root at 0. The table in Fig.5-2 shows the exact sequence
in which the messages are sent according to this broadcast tree. The topology of a
72
Broadcast Tree with Root at Node 0
Time Step #1 Time Step #2Origin ! Dest. Origin Dest.
0 1 0 2- - 1 3
Time Step #3Origin Dest.
0 41 52 63 7
Time Step #4Origin Dest.
0 81 92 103 114 125 136 147 15
Rgure 5-2: Sequence of Messages for a Broadcast Tree
broadcast tree for a particular dimension remains the same and a broadcast tree with a
root other than node 0 can be easily derived from the broadcast tree with root 0.
The topology of a hypercube allows us to organize the interconneCtion network of the
73
hypercube in a variety of ways depending on the application such as pairs of adjacent
processors for the Static Parallel Algorithm, and pipes/rings for the Pipeline and Modified
Pipeline Algorithms. ’Identification of these interconnection topologies is greatly facilitated
by the use of a Binary Reflected Gray Code l’McBryan 86~]. This is a sequence of binary
numbers where each number differs from its neighbours by one bit in its binary
representation. The numbers in the BRGC are interpreted as processor numbers.
Consequently, processors with numbers adjacent in the BRGC sequence are physically
adjacent in the hypercube. The interconnection topology represented by the BRGC itself
is a ring since the BRGC sequence, by definition, wraps around, i.e., the last number in the
sequence is the neighbour of the first number in the sequence. Other interconnection
topologies can be deduced from this sequence. The Binary Reflected Gray Code(BRGC)
sequence used for our implementation is given in Fig.5-3 along with its representation of a
ring on the hypercube.
5.2.2. Message ~itio~
All messages must be kept as short as possible for reasons discussed previously. To
ensure that messages are short, all excess message data must be reduced. This is done
by realising that some of the information about nets and modules is essentially static.
Once the information is separated into the static and dynamic categories only the dynamic
information needs to be sent as messages during annealing. The static information needs
to be passed just once at the beginning of the annealing. For example, the list of alternate
sizes that a module can take is static information, whereas the coordinates of the module
is dynamic information.
Also, since message communication overhead is high, where we have the option, we
pack maximum information into a single message packet rather than send several
individual packets of information. Sending several unrelated pieces of data at the same
74
ooo (o)
001 (1)
011 (3)
010 (2)
110 (6)
111 (7)
(5)(4)
ooo (o)
Binary Reflected Gray Code
7
2 3Equivalent Hypercube Topology
Figure 5-3: Binary Reflected Gray Code and its Topology on a 3-Dimensional Hypercube
time is done simply by packing them into a single message. Messages are packed at the
source and unpacked at the destination. This keeps the number of messages transmitted
to a minimum and, consequently, reduces communication overhead.
5.2.3. Data Structures
There are three main data structures in the floorplanner: the modules structure, the nets
structure and the bin structure. The basic framework of data structures essentially
remains the same for both the serial and parallel implementations of PASHA with a few
minor modifications in each parallel implementations. A brief description of these data
structures is given below.
o ModuleslCells: Modules are basically rectangles represented by thefollowing information:
o The (x,y) coordinates of the lower left comer of the module.
75
o The length and the width of the module
An array of alternate sizes which the cell can assume, as well as thecurrent dimensions.
o A list of nets to which the cell is connected.
¯ Nets: This data structure stores necessary information about the nets neededfor cost computation, and includes :
The coordinates of the bounding rectangle of the net used to calculatethe half-perimeter wirelength metric.
list of modules to which the net is connected.
Bin structure: The entire chip area is divided into a number of rectangularareas referred to as bins. Cells are represented by their four edges and thesebins keep track of cell edges which fall in their area. The edges of the cellswhich lie in the area covered by a bin are sorted and stored within that bin ina circular, doubly linked list. 4 The implementation of the bins is done bysplitting the entire area into vertical and horizontal strips~ There is a twodimensional array of pointers (vertical and horizontal) which keeps track the edges of the cells. Fig.5-4 illustrates the bin structure for a simplefloorplan.
5.3. Debugging
Debugging the parallel versions of PASHA for the hypercube turned out to be a very
challenging task. All the applications were first developed on an Intel supplied hypercube
simulator running on a VAX 11/785 under 4.2BSD UNIX. The simulator uses UNIX process
creation and interprocess communication primitives to simulate processes on nodes and
communication between them. The first step in the implementation of the parallel versions
involved the implementation of the skeletal structure of the required message passing for
the particular strategy. This step enabled us to debug the message passing patterns and
to ensure, for example, that there were no potential deadlock situations. Once the
communication patterns were debugged, routines for move computation were added. Most
of these move computation routines were those used in the serial version of PASHA.
4A version of PASHA uses a binary tree data structure instead of the doubly linked list, and will yield betterperformance when the average number of edges per bin is large.
76
Vertical Edge Pointers
Figure 5-4: Bin Data Structure
Once the application worked on the simulator, the next step was to port it on the iPSC.
Debugging this parallel application was minimal and primitive I/O techniques, like printing
data values into a common Iogfile, were used to debug the consolidated code. We note
that the absence of a distributed debugger on the hypercube is an inconvenience. The
following chapter discusses the results of the parallel implementations of PASHA.
77
Chapter 6
Performance Evaluation of Parallel Algorithms
This chapter evaluates the algorithms discussed in the previous chapters.
Multiprocessor experiments are performed and the results are analyzed. This chapter
begins with a brief discussion of the methodology adopted for evaluating the parallel
algorithms. This is followed by a presentation of the results which are then analyzed to
gain a critical insight into the behaviour of the parallel algorithms. The relative speedup
obtained in the parallel algorithms is presented. These algorithms are compared, with
respect to the quality of their solutions, to the serial algorithm. This is followed by a
discussion of the effect of parallelism on convergence.
6.1. Methodology
All the parallel versions of PASHA have been written in the C programming language and
have been implemented on the iPSC hypercube. The Static Parallel and the Simple Pipeline
algorithm implementations are each roughly 3000 lines of code, while the more
sophisticated Modified Pipeline algorithm consists of about 4000 lines of code.
We evaluate the comparative performance of these parallel algorithms with respect to
the small benchmark presented in Chapter 3 in page 44. This choice is motivated by the
relatively short execution times required for this benchmark, which permit many
experiments to be attempted. Compared to a larger benchmark, it can also be observed to
be a more challenging task simply because it is more difficult to identify great parallelism
78
in a small benchmark, e.g., we are annealing only 20 moveable objects on 16 processors.
The serial algorithm was tuned to a high degree for the small benchmark and these tuned
global parameters, such as the weights of the wirelength, overlap and area objective
functions, were used in the parallel algorithms.
Since annealing is a dynamic process, the time taken to perform all the moves at each
temperature was recorded during the execution of all parallel annealing algorithms. A fixed
number of moves is performed at each temperature. The number of moves attempted at
each temperature has been taken as 200 times the number of modules in the problem to
give good results. All the parallel algorithms are started from the same initial temperature
as the serial algorithm. Their execution continues until the serial algorithm’s stopping
criteria of three consecutive temperatures with no change in the cost function is satisfied.
The quality of solutions obtained is compared with the serial solution. The execution time
was measured as elapsed time rather than CPU time. All the experiments on the hypercube
were run with a single user status to reduce competition with other processes.
One common factor which is varied for all the parallel implementations is the number of
processors in the cube. Every algorithm has several parameters which were varied in the
course of the experiments to observe their effects on algorithm performance. For
example, the frequency of lazy updates in the Modified Pipeline algorithm is varied to
determine its effect on speedup and convergence. The number of processors and the
number of moves performed before a global update are parameters which decide the
magnitude of "parallelism" in the algorithm. A rough estimate of the "parallelism" is the
total number of moves performed by all the processors before a global update is made.
Comparison between different algorithms was performed by keeping this "parallelism"
constant. Speedup for an algorithm is a function of a variety of parameters, consequently,
relative speedups are measured. We define relative speedups as the ratio of the elapsed
time for the parallel algorithm running on the smallest allowable number of processors to
79
the elapsed time for the parallel algorithm running on n-processors, while keeping all the
other parameters of the parallel algorithm the same.
6.:2. Speedup Results
A small experiment on the serial algorithm demonstrated the average amount of time
spent in the different subtasks of a accepting a move. A statistically large sample of moves
was generated, evaluated and forced to be accepted. The time spent by each move in its
different subtasks was then noted and averaged over the number of performed moves. The
results of this experiment are shown in the pie-chart in Fig.6-1. Wirelength evaluation
takes approximately 43% time while overlap penalty calculation takes about 32% and area
estimation takes about 3%. The remaining time is taken by the move proposal stage and
the move updating stage. These times are measured for the small 20-module benchmark.
Dividing the move evaluation functionally, with different processors evaluating wirelength,
overlap and area objective function, results in a heavily imbalanced decomposition.
Consequently, the Simple Pipeline algorithm that we have suggested is an ineffective
decomposition for this benchmark. However, the load balancing for the Static Parallel
algorithm seems to be fairly good since the master processor performs the move proposal
and wirelength evaluation tasks while the slave processor performs the overlap evaluation
and area estimation. Move updating is shared between the two; the master processor
updates the wirelength while the slave processor updates the overlaps and areas. Due to
its object decomposition of each move across the stages of a pipeline, each stage in the
Modified Pipeline algorithm performs every move evaluation subtask on the modules it
owns. The Simple Pipeline algorithm uses functional decomposition and, consequently,
different stages in the pipeline perform wirelength evaluation, area estimation, etc. Fig.6-1
clearly illustrates that load balancing for this case is not good. Consequently, we did not
perform any experiments on the Simple Pipeline algorithm.
8O
Overlap Evaluation (32 %)
Move Proposal and Updating (23 %)
Area Evaluation (3 %)
Wirelength Evaluation (42 %)
Rgure 6-1: Percentage of Time Spent in Each Move Task
Fig.6-2 illustrates the execution times for Benchmark A with 20 modules using the Static
Parallel algorithm. We vary the size of the hypercube, and vary the number of moves
performed per processor before a global update. Notice that as the number of moves
before ¯ global update increases, the execution time decreases. This can be explained by
the fact that the synchronization overhead of global updates is reduced by amortizing it
over s large number of moves. An interesting point to note here is that applications with
similar "parallelism", such as a 16-processor, 1 move per update case and an 8-
processor, 2 moves per update case, have very similar execution times. We conclude that
this simple measure of "parallelism" roughly defined as the product of the number of
processors and the number of moves performed by each processsor before a global
update is a fundamental factor in deciding the speedup which can be obtained from the
Static Parallel Algorithm.
81
Total Time toFloorplan
(in minutes)
450
400
350
300
250
200
150
100
50
Processors
Processors
~"-------~ 16 Processors
I I I I I I I I 11 2 3 4 5 6 7 8 9
Parallel Moves per Processor Pair Before Global Update
(e): Total Execution Time
Average Timeper
Temperature(in minutes)
2.4
2.22.01.8
1.61.41.21.0
0.80.60.40.2
0
Figure 6-2:
16 Processors
1 2 3 4 5 6 7 8 9Parallel Moves per Processor Pair Before Global Update
(b): Average Time per Temperature
Execution Times for the Static Parallel Algorithm
Fig.6-3 plots the execution times of the Modified Pipeline algorithm running the 20
module benchmark. The length of each pipeline is fixed at 1, and we vary the number of
82
moves per pipeline before a lazy/global update. If we compare the execution times for the
Static Parallel algorithm with identical parameters from Fig.6-2 with the Modified Pipeline
algorithm with no lazy updates, we see that the Modified Pipeline algorithm is faster. This
is due to the fact that unlike the Static Parallel algorithm, where two processors cooperate
on a single move computation, the Modified Pipeline algorithm with a pipeline length of 1
has a single processor performing a complete move computation. This yields in better load
balancing between the processors and lower communication overheads and,
consequently faster execution times. Fig.6-3 also shows the effect of lazy updates on the
execution times. As expected, the execution time decreases with increasing number of
lazy updates, due to reduced frequency of costly global updates. Our design of lazy
updating was intended to reduce the communication overhead in global updating. In this
respect our algorithm has been quite successful, but it has been seen that complete lazy
updating does not allow convergence of the algorithm. There is a certain tradeoff between
the number of lazy updates to be used to reduce communication overheads and the
number of global synchronized updates used to preserve c.onvergence. An interesting
result is that as the number of lazy updates is increased, the execution times for large
"parallelism" does not show significant improvement. However, increasing the number of
lazy updates shows significant improvement in the average time per temperature. This
anomaly can be explained by the fact that the introduction of error in annealing causes the
annealing to proceed through more temperatures before reaching an optimal solution. In
other words, with more lazy updates annealing at each temperature runs faster, but we
require more temperatures to reach the same stopping criterion.
Fig.6-4 illustrates the total execution times for the Modified Pipeline algorithm for large
Benchmark B with 40 modules. Notice that in this example 16 processors are being used
with pipeline length 1, and the maximum "parallelism" employs 8 moves in each pipeline
before a lazy/global update. Such a large number of parallel moves does not seem to
affect the solution drastically at all.
83
Total Timeto
Floorplan(in minutes)
300
250 -
200 -
150 -
100 -
50-
Number of Procellore - 8
Length of Pipeline - 1
,,~. ~..
-- No Lazy Updates
....... Complete Update after 1 Lazy Update
- - - Complete Update after 3 lazy Updates
I I I I1 2 3 4
Parallel Moves per Pipe Before Lazy/Global Update
(e): Total Execution Time
Average Timeper
Temperature(in minutes)
1.2m
1.0-
0.6-
0.6-
0.4-
0.2-
Rgure 6-3:
Number of Processors - 8
~ Lengt................. h of .Pipeline - 1
~ No lazy Updates....... Complete Update after 1 I.~zy Update- - - Complete Update after 3 Lazy Updates
I I I I1 2 3 4
Parallel Moves per Pipe Before Lazy/Global Update
(b): Average Time per Temperature
Effect of Lazy Updates in the Modified Pipeline Algorithm
Another experiment varied the length of the pipeline. It was seen that as the number of
stages in the pipeline increased, execution times did not reduce as expected. In fact the
Total Time toFloorplan
(in minutes)
300 -
250 -
200 -
150 -
100 -
Modified Pipeline AJgorlthm:Length of Pipeline - 1
Number of Proceesorl = 16
50 -- ~ No Lazy Updates- - - Complete Update efler 2 Lazy Updates
1 2 3 4 5 6 7 8 9Parallel Moves per Pipeline Before Global Update
Figure 6-4: Total Execution Times for Modified Pipeline Algorithm using Benchmark B
execution times actually increased almost linearly with larger lengths. This can be
attributed to improper load balancing between the stages of the pipeline. In particular,
increased throughput in a pipeline scheme can be obtained only if "filling" and "emptying"
effects of pipelining are negligible to the total computation. The "filling" and "emptying"
effects of pipelining can be amortized only if a large group of moves propagate through the
pipeline. Large groups of moves cannot not be attempted in parallel for this benchmark
since it consists of only 20 modules. This benchmark is not suited to test this particular
kind of parallelism. Fig.6-5 illustrates these results.
A large number of parallel moves attempted at high temperatures are accepted compared
to the number of accepted moves at low temperatures. Consequently, the overhead
involved in synchronizing the states after a global update must be high at high
temperatures and should reduce with the lowering of temperatures. This effect is
illustrated in Fig.6-6. Figo6-6 plots the time taken per temperature for a Modified Pipeline
algorithm as a function of the temperature. Note that since the communication overhead is
85
250 -
Total Time 200 -to
Floorplan(in minutes) 150
100 -
Modified Pipeline Algorithm :
S Processors4 Moves Before Global Update
o°
I I I I1 2 3 4
Length of Pipeline (only powers of 2)
(a): Total Execution Time
Average Timeper
Temperature(in minutes)
1.75 -
1.5-
1.25 -
1.0-
0.75 -
0.5-
0.25 -
Rgure 6-5:
Modified Pipeline A~gorlthm :8 Processors4 Moves Before Global Update
I I l I1 2 3 4
Length of Pipeline (only powers of 2)
(b): Average Time per TemperatureExecution Times for Different Pipeline Lengths
essentially constant over all temperatures, the reduction in the time per temperature
almost entirely reflects the reduction in updating overhead. The updating overhead
reflects both the updating within a stream of parallel moves and global updating between
different streams.
86
Time / Temperature(in seconds)
50-
40-
30-
20-
10-
Modified Pipeline Algorithm :
8 Processors
I I I I0.035 0.35 3.5 35 350 3500
Temperature
Rgure 6-6: Variation of Time Taken per Temperature
The relative speedup curves for the Static Parallel algorithm and the Modified Pipeline
algorithm are given in Fig.6-7. Note that for the Static Parallel Algorithm the smallest
allowable number of processors is 2.
6.3. Convergence Results
All this would be a futile exercise if it cannot be demonstrated that the parallel
implementations of PASHA do indeed converge to solutions of comparable quality to those
obtained by the serial algorithm. Fig.6-8 tabulates the quality of results obtained for the
different cases of the Static Parallel and Modified Pipeline algorithms. The quality of
results is measured by the total wirelength, total area and the existence of the residual
overlap (i.e., module overlaps remaining at the end of annealing) as compared to the best
results obtained from a tuned serial version of PASHA. Residual overlaps, if any, were
always peripheral in nature and usually only between two modules. We note that the
results are largely within 7-9% of the serial solutions. Due to its inherent statistical nature
87
RelativeSpeedup
, Ideal Llnaar Speedup
IIII
Static Parallel Algorlthrn :6 Moves Before Global Update
4 8 16
Number of Processors
(a): Static Parallel Algorithm
RelativeSpeedup
7--
6-
5-
4-
3-
2--
1--
Ideal Linear Speedup ,/
/
/~ 6 Moves before Update/’""
~ No Lazy Updates
....... I Lazy Update
1 2 4 8 16
Number of Processors
(b): Modified Pipeline AlgorithmRgure 6-7: Speedup for Parallel Algorithms
the statistical sample size required for such a comparison ought to be bigger and
comparisons should be made only in the distribution of results obtained from a larger
sample size. The constraints of time on this thesis constrained usto use a very small
88
sample size: nevertheless, the fact that the parallel solutions were reasonably close to
the serial answers is very encouraging. In fact, in some cases they are even better than
the corresponding serial results. This can be attributed to the fact that parallel moves
enable the system to explore a greater breadth in the search space of solutions. This
observation is strengthened by the fact that small amounts of parallelism almost always
tend to give better solutions than the serial algorithm. For sake of proper comparison,
module sizes are not overestimated during annealing in both the serial and parallel
versions of PASHA.
The objective function that we use has three separate parts and it was noticed during the
annealing that each of these parameters has a different variation during annealing. The
area reduces at early temperatures, followed a little later by the reduction in overlaps.
Wirelengths tend to reduce last during the final stages of annealing. This can also be
observed from Fig.6-8 where a majority of wirelength values for the parallel cases tend to
be higher than the serial value, indicating that the wirelength values tend to freeze out last
during annealing. Introducing error in annealing corresponds ~o the introduction of some
uncontrolled hill climbing. Instead of using the entire cost function we use the wirelength
variation to illustrate this effect. The use of wirelength variation for this purpose is justfied
by the fact that the variation in wirelength is analogous to the variation of the entire cost
function. Looking at the wirelength variation during annealing in the typical serial and
parallel cases shown in Fig.6-9, we can see that the parallel case fluctuates more than the
serial case before settling down to an optimal solution. This is equivalent to shifting the
entire temperature schedule towards lower temperatures. Temperatures in serial annealing
correspond to higher equivalent temperatures in parallel annealing.
89
Cube
Dimension
2
No. of moves
before updateWirelength
Parallel/SerialFtatio1.071.041,071.00
Cost FunctionArea
ParallellSerialRatio0.991.070.980.99
Residual
Overlap
YesNoNoNo
.o5 1,11 Yes
.01 1.08 Yes
.03 0.96 No
.06 0.97 No
.05 0.99 Yes
.04 0.91 No.99 0.94 No
0.98 Yes
0.95
1.061.032 1.07 No
3 1.03 1.01 No
4 1.00 0.97 No
5 0.94 0.97 NoYes0.98
(a): Static Parallel Algorithm
Type of
Update
CompleteGlobalUpdate
1 LazyUpdate
2 LazyUpdates
3 LazyUpdates
No. of moves
before update
Cost Function
3
WirelengthParallel/Serial
Ratio
AreaParallel/Serial
Ratio
Residual
Overlap
0.98
1 0.99 0.89 Yes
2 1.00 0.92 No0.94
1.011.000.960.990.98
1.112 1.11
1,041.040.95
0.99 1.031.06 1.051.10 1.1221.04 1.093
NoYesNoNoNoYesYesNoNoYes
(b): Modified Pipeline with Lazy Update (8 Processors)Rgure 6-8: Quality of Parallel Solutions
g0
Wirelength
6600 -6400 -6200 -6000 -5800 -5600 -5400 -5200 -5000 -4800 -4600 -4400 -4200 -
Typical Parallel Annealing Curve
7 ao
5800 -5600 --5400 --5200 -5000 -4800 -4600 -4400 --4200 --4000
0,0035
Figure 6-9:
Sedal Annealing Curve
0.035 0.35 3.5 35 350 3500Temperature
Wirelength Variation in a Serial and a Parallel Algorithm
91
6.4. Summary
We have run several experiments on the Static Parallel algorithm and the Modified
Pipeline algorithm. It was noted that to reduce the communication overhead it was
essential to amortize this overhead over many move computations. Consequently, in
cases where the number of moves before a global update were higher, execution times
were better. As expected, for the Static Parallel algorithm the execution times improved
with greater number of processors.
The Modified Pipeline algorithm was faster than a Static Parallel algorithm running on the
same number of processors with the same degree of "parallelism". It was observed that
increasing the pipeline length in the Modified Pipeline did not reduce execution times for
this benchmark. This was due to the inability to extract large parallelism from this small
benchmark. The lazy updates in the Modified Pipeline reduced the average time per
temperature but did not significantly reduce the total execution time. Speedup by a factor
of 4 was obtained with 16 processors for the Static Parallel.algorithm, a speedup by a
factor of 6 was obtained for the Modified Pipeline algorithm for 16 processors with no lazy
updates. Using lazy updates we obtained a speedup of 7.5 for the Modified Pipeline
algorithm running on 16 processors.
All the parallel implementations yield solutions which are of high quality as compared to
the serial solutions. The number of parallel moves before a global update is very crucial in
determining the uncertainty in annealing and, consequently, the convergence of the
algorithm. In addition, it was found that lazy updates tend to disturb convergence to a
greater extent than global synchronized updates. The following chapter discusses the
conclusions and contributions of this thesis and identifies topics of future research in this
context.
92
Chapter 7
Conclusions
Annealing is a general purpose optimization method which holds great promise. The main
drawback of annealing is that it is computationally expensive. This research effort has
focused on the objective of accelerating annealing algorithms with the use of hypercube
multiprocessors. Floorplanning was chosen as a typical application of simulated
annealing. Several parallel algorithms were presented for partitioning the annealing
algorithm across the processors of a hypercube. We have shown that larger parallelism
can be extracted by introduction of e certain amount of error in the algorithm. We propose
some new partitioning strategies in PASHA which were implemented and tested on a 16-
processor Intel iPSC hypercube. Two of these strategies were tested completely: Static
Parallel and Modified Pipeline. Results obtained show a very encouraging trend. A speedup
of 4 was obtained for the Static Parallel algorithm running on 16 processors. The Modified
Pipeline algorithm running on 16 processors yielded a speedup of roughly 6 when not
using lazy updates, while it gave a speedup of a factor of 7.5 with the use of lazy updates.
Solutions obtained by these algorithms are of comparable quality to those obtained in the
serial case, This research opens up new directions towards which future work can be
directed. The short term goals prompted by this work include:
¯ Addition of some sophisticated annealing schedules to PASHA. We believethat addition of some sophisticated serial annealing schedules can easilyimprove the speed by a factor of 2.
¯ Evaluate the performance of PASHA when running on other hypercubemultiprocessors, such as the NCUBE hypercube, to determine the exacteffect of communication time on the time of execution of the algorithm.
93
¯ Minor tuning of the move evaluation tasks. This should lead to better loadbalancing in all the algorithms.
Addition of new constraints to the algorithm for floorplanning such asorientation of pins on modules, bus constraints, etc.
¯ Improvement in the data structures to reduce move evaluation time. Already,the latest version of the serial version of PASHA incorporates someoptimized data structures.
During the work with PASHA we have come across some areas where long term goals
can be focussed. Specifically, we feel that the area of parallel annealing schedules is an
area which deserves considerable investigation. Presently, efforts to parallelize annealing
applications have been using simple serial annealing schedules. Parallel simulated
annealing differs in many respects from the serial algorithms and annealing schedules
which take the error caused by parallelism into account will greatly enhance the
parallelism which can be exploited. Efforts must be made in the field of formalising parallel
annealing algorithms with some theoretical models. Such efforts will go a long way in
quantifying the effects of parallel moves and error on annealing.
94
References
[Banerjee 86]
[Banerjee 87]
[Brandenburg 86"]
[Breuer77a]
[Breuer 77b]
[Brooks 40]
[Casotto 86]
[Devadas 86]
Banerjee, Prithviraj and Mark Jones.A Parallel Simulated Annealing Algorithm for Standard Cell Placement
on a Hypercube Computer.In Proceedings of the International Conference on Computer-Aided
Design, pages 34-37. IEEE, November, 1986.
Banerjee Prithviraj and Mark Jones.Performance of a Parallel Algorithm for Standard Cell Placement on the
Intel Hypercube.1987.To be published in the Proceedings of the 24th Design Automation
Conference, 1987.
Brandenburg Joseph. E and David S. Scott.Embeddings of Communication Trees and Grids into Hypercubes.Technical Report 1, Intel Scientific Computers, 1986.
Breuer Melvin. A.A Class of Min-Cut Placement Algorithms.In Proceedings of the 14th Design Automation Conference, pages
284-290. IEEE, 1977.
Breuer Melvin A.Min-Cut Placement.Journal of Design Automation and Fault Tolerant Computing
1 (4):343-362, October, 1977.
Brooks R. L., C. A. B Smith, A. H. Stone and W. T. Tutte.The Dissection of Rectangles into Squares.Duke Mathematical Journal 7:312-340, 1940.
Casotto, Andrea, Fabio Romeo and Alberto Sangiovanni-Vincentelli.A Parallel Simulated Annealing Algorithm for the Placement of Macro
Cells.In Proceedings of the International Conference on Computer-Aided
Design, pages 30-33. IEEE, November, 1986.
Devadas Srinivas and A.Richard Newton.Topological Optimization of Multiple Level Array Logic: On Uni and
Multi-processors.In Proceedings of the International Conference on Computer-Aided
Design, pages 38-41. IEEE, November, 1986.
[Felten 85]
[Greene 84]
[Grover 86]
[Heller 82]
[Huang 86]
[Jepsen 83]
[Kernighan 70]
[Kirkpatrick 83]
[Kozminski 84]
[Kravitz 86a’1
[Kravitz 86b~]
95
Felten, Edward, Scott Karlin and Steve W.Otto.The Travelling Salesman Problem on a Hypercubic, MIMD Computer.In Proceeding of the 1985 International Conference on Parallel
Processing, pages 6-10. IEEE, August, 1985.
Greene, Jonathan W. and Kenneth J. Supowit.Simulated Annealing without Rejected Moves.In Proceedings of the Custom Integrated Circuits Conference, pages
658-663. 1984.
Grover, Lov K.A New Simulated Annealing Algorithm for Standard Cell Placement.In Proceedings of the International Conference on Computer-Aided
Design, pages 378-380. IEEE, November, 1986.
Heller W.R, G.Sorkin and K.Maling.The Planar Package Planner for System Designers.In Proceedings of the 19th Design Automation Conference, pages
777-784. IEEE, June, 1982.
Huang M. D., F. Romeo and Alberto Sangiovanni-Vincentelli.An Efficient General Cooling Schedule for Simulated Annealing.In Proceedings of the International Conference on Computer-Aided
Design, pages 381-384. IEEE, October, 1986.
Jepsen, D.W and C.D.Gelatt Jr..Macro Placement by Monte Carlo Annealing.In Proceedings of the International Conference on Computer-aided
design, pages 495-498. October, 1983.
Kemighan B. W and S. Lin.An efficient heuristic procedure for partitioning graphs.Bell System Technical Journal 49:291-308, February, 1970.
Kirkpatrick, S, C.D.Gelatt,Jr., and M.P.Vecchi.Optimization by Simulated Annealing.Science 220 (4598):671-680, May, 1983.
Kozminski K. and E.Kinnen.An Algorithm for Finding a Rectangular Dual of a Planar Graph for use in
Area Planning for VLSI Integrated Circuits.In Proceedings of the 21 st Design Automation Conference, pages
655-656. IEEE, June, 1984.
Kravitz, Saul. A.Multiprocessor-Based Placement by Simulated Annealing.Technical Report CMUCAD-86-6, SRC-CMU Center for Computer-
Aided Design, 1986.
Kravitz, Saul. A and Rob A. Rutenbar.Multiprocessor-Based Placement by Simulated Annealing.In Proceedings of the 23rd Design Automation Conference, pages
567-573. IEEE, June, 1986.
96
I’Lapotin 85]
[Lauther 79]
[Leinwald 84]
[McBryan 86]
[Otten 84]
[Rose 86]
[Sechen 84]
[Seitz 85]
[Supowit 83]
[Szepieniec 80]
[Vecchi83]
LaPotin, David Paul.A Global Floorplanning Approach for VLSI Design.PhD thesis, Carnegie Mellon University, December, 1985.
Lauther Ulrich.A Min-cut Placement Algorithm for General Cell Assemblies Based on a
Graph Representation.In Proceedings of the 14th Design Automation Conference, pages
1-10. IEEE, June, 1979.
Leinwald S.M., and Y. Lai.An Algorithm for Building Rectangular Floor-Plans.In Proceedings of the 21st Design Automation Conference, pages
663-664. IEEE, June, 1984.
McBryan O. A and E. F. Van de Velde.Hypercube Algorithms and Implementations.Technical Report, Courant Mathematics and Computing Laboratory,
February, 1986.
Otten ,Ralph H.J.M and Lukas P.P.P. van Ginneken.Floorplan design using Simulated Annealing.In Proceedings of the International Conference on Computer-aided
design, pages 96-98. 1984.
Rose, Jonathan Scott.Fast, High Quality VLSI Placement on an MIMD Multiprocessor.PhD thesis, University of Toronto, September, 1986.
Sechen Carl and Alberto Sangiovanni-Vincentelli.The TimberWolf Placement and Routing Package.In Proceedings of the Custom Integrated Circuits Conference, pages
522-527. May, 1984.
Seitz, Charles.L.The Cosmic Cube.Communications of the ACM 28(1 ):22-33, January, 1985.
Supowit K. J and E. A. Slutz.Placement Algorithms for Custom VLSI.In Proceedings of the 20th Design Automation Conference, pages
164-170. IEEE, June, 1983.
Szepieniec A. A and R. H. J. M. Otten.The Genealogical Approach to the Layout Problem.In Proceedings of the 17th Design Automation Conference, pages
535-542. IEEE, June, 1980.
Vecchi, Mario.P and Scott Kirkpatrick.Global Wiring by Simulated Annealing.IEEE Transactions on Computer-Aided Design CAD-2 (4):215-222,
October, 1983.
97
[White 84]
[Wong 86]
White, Steve. R.Concepts of Scale in Simulated Annealing.In Proceedings of the International Conference on Computers in Design
, pages 646-651. IEEE, 1984.
Wong, D.F and C.L.Liu.A New Algorithm for Floorplan Design.In Proceedings of the 23rd Design Automation Conference. IEEE, June,
1986.