Partitioning and Partitioning Toolsgraphics.stanford.edu/sss/barth_partitioning.pdf · Schwarz...
Transcript of Partitioning and Partitioning Toolsgraphics.stanford.edu/sss/barth_partitioning.pdf · Schwarz...
Partitioning and Partitioning Tools
Tim BarthNASA Ames Research Center
Moffett Field, California 94035-1000 USA
1
Graph/Mesh Partitioning
• Why do it?
• The graph bisection problem
• What are the standard heuristic algorithms?
• What tools are available?
2
Why do it?
• Efficient utilization of distributed computational resources
– Equidistribution of workload among processors (load balancing)
– Minimized time spend in interprocessor communication
∗ Communication takes time and it’s not always possible to hide
this latency in data tranfer
∗ Cost of communication is often modeled by the linear
relationship for n messages: Cost =∑
n(α+ βmn)
(a) (b)
Figure 1: (a) Mesh partitioning with minimized number of messages, (b) Mesh
with minimized message length.
3
Why do it?
• Efficient utilization of distributed computational resources
– Equidistribution of workload among processors (load balancing)
– Minimized time spend in interprocessor communication
∗ Communication takes time and it’s not always possible to hide
this latency in data tranfer
∗ Cost of communication is often modeled by the linear
relationship for n messages: Cost =∑
n(α+ βmn)
(a) (b)
Figure 1: (a) Mesh partitioning with minimized number of messages, (b) Mesh
with minimized message length.
3-a
Why do it?
• Efficient utilization of distributed computational resources
– Equidistribution of workload among processors (load balancing)
– Minimized time spend in interprocessor communication
∗ Communication takes time and it’s not always possible to hide
this latency in data tranfer
∗ Cost of communication is often modeled by the linear
relationship for n messages: Cost =∑
n(α+ βmn)
(a) (b)
Figure 1: (a) Mesh partitioning with minimized number of messages, (b) Mesh
with minimized message length.
3-b
Why do it?
• Efficient utilization of distributed computational resources
– Equidistribution of workload among processors (load balancing)
– Minimized time spend in interprocessor communication
∗ Communication takes time and it’s not always possible to hide
this latency in data tranfer
∗ Cost of communication is often modeled by the linear
relationship for n messages: Cost =∑
n(α+ βmn)
(a) (b)
Figure 1: (a) Mesh partitioning with minimized number of messages, (b) Mesh
with minimized message length.
3-c
Why do it?
• Efficient utilization of distributed computational resources
– Equidistribution of workload among processors (load balancing)
– Minimized time spend in interprocessor communication
∗ Communication takes time and it’s not always possible to hide
this latency in data tranfer
∗ Cost of communication is often modeled by the linear
relationship for n messages: Cost =∑
n(α+ βmn)
(a) (b)
Figure 1: (a) Mesh partitioning with minimized number of messages, (b) Mesh
with minimized message length.
3-d
Why do it?
• Efficient utilization of distributed computational resources
– Equidistribution of workload among processors (load balancing)
– Minimized time spend in interprocessor communication
∗ Communication takes time and it’s not always possible to hide
this latency in data tranfer
∗ Cost of communication is often modeled by the linear
relationship for n messages: Cost =∑
n(α+ βmn)
(a) (b)
Figure 1: (a) Mesh partitioning with minimized number of messages, (b) Mesh
with minimized message length.
3-e
Why do it?
• As a strategy for reducing the overall arithmetic complexity ofan algorithm
– Overlapping Schwarz methods
– “Divide and Conquer” methods, e.g. nested dissection ofmatrix, Schur complement substructuring
– Multiscale methods, e.g. agglomeration multigrid
4
Why do it?
• As a strategy for reducing the overall arithmetic complexity ofan algorithm
– Overlapping Schwarz methods
– “Divide and Conquer” methods, e.g. nested dissection ofmatrix, Schur complement substructuring
– Multiscale methods, e.g. agglomeration multigrid
4-a
Why do it?
• As a strategy for reducing the overall arithmetic complexity ofan algorithm
– Overlapping Schwarz methods
– “Divide and Conquer” methods, e.g. nested dissection ofmatrix, Schur complement substructuring
– Multiscale methods, e.g. agglomeration multigrid
4-b
Why do it?
• As a strategy for reducing the overall arithmetic complexity ofan algorithm
– Overlapping Schwarz methods
– “Divide and Conquer” methods, e.g. nested dissection ofmatrix, Schur complement substructuring
– Multiscale methods, e.g. agglomeration multigrid
4-c
Why do it?
• As a strategy for reducing the overall arithmetic complexity ofan algorithm
– Overlapping Schwarz methods
– “Divide and Conquer” methods, e.g. nested dissection ofmatrix, Schur complement substructuring
– Multiscale methods, e.g. agglomeration multigrid
4-d
Why do it?
• Overlapping Schwarz methods
0 20 40 60 80
Schwarz Iterations
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
10101010101010101010101010101010
Nor
m (
Glo
bal)
Res
idua
l
2nd Order Scheme2 Partition, 1 Overlap2 Partition, 2 Overlap2 Partition, 3 Overlap8 Partition, 1 Overlap8 Partition, 2 Overlap8 Partition, 3 Overlap
5
Why do it?
• Overlapping Schwarz methods with subdomain size H, mesh cell
size h and overlap δ
Let A be the discretization matrix and Mas the additive Schwarz
preconditioner. There exists a constant C independent of H and h
such that the condition number κ
κ(M−1as A) ≤ CH−2
(
1 +
(
H
δ
)2)
. (1)
with 2-level coarse space correction
There exists a constant C independent of H and h such that
κ(M−1as A) ≤ C
(
1 +
(
H
δ
))
. (2)
6
Why do it?
• Substructuring
A1 A2
A3 A4
x1
x2
=
b1
b2
, A−1 =
C1 C2
C3 C4
with S = A4 −A3A−11 A2, C1 = A−1
1 +A−11 A2S
−1A3A−11 ,
C2 = −A−11 A2S
−1, C3 = −S−1A3A−11 , C4 = S−1.
κ(M−1SchurA) = C(1 + log(H/δ))
7
Graph Bisection (np hard)
Define a partitioning vector p ∈ Zn which 2-colors the vertices of a graph
p = [+1,−1,−1,+1,+1, ...,+1,−1]T (3)
+1
+1
+1
+1
+1+1
+1
+1
+1
+1
-1
-1
-1
-1
-1
-1
-1
-1
-1
• Minimize the cut-weight of the weighted graph
• Produce balanced partitions
8
Heuristic Graph Partitioning
Three commonly used partitioning techniques
• Recursive coordinate bisection
• Recursive Cuthill-McKee
• Recusive Spectral bisection
9
Recursive Coordinate Bisection
• Spatial coordinates are sorted along alternating horizontal andvertical directions
• Divisors are found to balance partitions
10
Graph Ordering Cuthill-McKee
Algorithm: Graph ordering, Cuthill-McKee.
Step 1. Find vertex with lowest degree. This is the root vertex.
Step 2. Find all neighboring vertices connecting to the root by incident
edges. Order them by increasing vertex degree. This forms level 1.
Step 3. Form level k by finding all neighboring vertices of level k − 1
which have not been previously ordered. Order these new vertices by
increasing vertex degree.
Step 4. If vertices remain, go to step 3.
11
Graph Ordering Cuthill-McKee
Matrix nonzero pattern
Figure 2: Natural Ordering (left) and Cuthill-McKee ordering (right)
12
Recursive Cuthill-McKee
• The level structure computed in Cuthill-McKee ordering isutilized
• Divisors are found to balance partitions
13
Recursive Spectral Bisection
Motivated by the observation that the cut-weight of a graph is precisely
Wc =1
4pTLp
Algorithm: Spectral Graph Bisection.
Step 1. Calculate the matrix L associated with the Laplacian of the
graph.
Step 2. Calculate the eigenvalues and eigenvectors of L.
Step 3. Order the eigenvalues by magnitude, λ1 ≤ λ2 ≤ λ3...λn.
Step 4. Determine the smallest nonzero eigenvalue, λf and its associated
eigenvector xf (the Fiedler vector).
Step 5. Sort elements of the Fiedler vector.
Step 6. Choose a divisor at the median of the sorted list and 2-color
vertices of the graph which correspond to elements of the Fielder vector
less than or greater than the median value.
14
Recursive Spectral Bisection
15
Multilevel k-way Partitioning
• Utilized successive k-way graph contraction to coarsen graph
• Perform high quality partitioning on coarsened graph
• Prolongate to finer graphs with local interface optimization toimprove cut-weight
16
17
18
Metis, ParMetis
• Extremely fast
• Parallel implementation (requires some initial partitioning)
• Supports weighted graphs by vertices or edges
• Supports incremental load balancing (repartitioning) withminimized data migration
19
20
Zoltan
• Relatively new package under development at Sandia underGPL
• Interfaces with Metis or Jostle
• Documentation suggests that the package will contain most ofthe commonly needed services for parallel scientific codes:partitioning, repartitioning, data migration, etc.
21
Partitioning Tools for SSS?
• Domain specific languages?
– Language for finite element methods
– Language for molecular dynamics
– <Insert your favorite problem domain here>
• Partial or full data dependency specification (analogous toscene graph specification in Java3d).
• Automatic tools for performance enhancement
– Use hardware performance statistics (memory accesspatterns) of previous executions in subsequence compilations
– Runtime data migration
22