A parallel two-sided contact algorithm in ALE3D
Click here to load reader
-
Upload
tim-pierce -
Category
Documents
-
view
214 -
download
0
Transcript of A parallel two-sided contact algorithm in ALE3D
Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
www.elsevier.com/locate/cma
A parallel two-sided contact algorithm in ALE3D q
Tim Pierce a,*, Garry Rodrigue b
a Lawrence Livermore National Laboratory, Mail Stop L-98, Livermore, CA 94551, United Statesb Department of Applied Science, University of California at Davis, Davis, CA 95616, United States
Received 11 November 2003; accepted 20 August 2004
Abstract
A scalable parallel algorithm for treating two-sided contact in a finite-element multi-physics code (ALE3D) is pre-
sented. This method assumes that proximity between the two sides changes continuously, and uses a local search to
update proximity relations each cycle. The evolving communication pattern is treated entirely by local, nearest-neigh-
bor communication; there is no global communication. Opening and closing voids, overlapping and intersecting contact
surfaces, and a number of other special features are supported.
� 2004 Elsevier B.V. All rights reserved.
Keywords: Contact surface; ALE; Finite element; Parallel computing
1. Introduction
ALE3D is an abitrary-Lagrangian–Eulerian (ALE) finite-element code that treats fluid and elastic–plas-
tic response of materials on an unstructured 3D grid. The major components of the code are explicit and
implicit continuum mechanics, heat transfer, and chemistry [1]. ALE3D is used primarily to perform large-
deformation transient dynamics simulations, including those involving strong shocks. It has also been ap-
plied to a number of metal-forming applications [2].Problems of this type often involve contact surfaces, 2D surfaces defined by the meeting (potential or
actual) of separate material regions. An impenetrability constraint is enforced along the contact surface.
0045-7825/$ - see front matter � 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.cma.2004.08.011
q This work was performed under the auspices of the US Department of Energy by the University of California, Lawrence
Livermore National Laboratory under contract no. W-7405-Eng-48.* Corresponding author.
E-mail addresses: [email protected] (T. Pierce), [email protected] (G. Rodrigue).
3128 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
However, the materials are (usually) free to slide relative to each other tangential to the contact surface. A
friction model may be used in the simulation of tangential sliding.
The finite element procedure begins with a discretization of the material regions into a mesh of zones, or
elements. The mesh in the parallel version of ALE3D is further decomposed into submeshes (called do-
mains) and typically one domain is assigned to each processor of a parallel distributed-memory machine.This is a fairly standard approach to parallelizing finite-element codes, [3], and a number of tools such
as METIS, [4], have been developed to produce a well balanced decomposition. Typically each zone in a
domain is permanently assigned to that processor, i.e. the decomposition is static.
The force at a given node (mesh vertex), which ultimately dictates its displacement, is a function of the
stress in neighboring zones. If the neighboring zones fall into multiple domains, then inter-processor com-
munication will be required to calculate the force at the node. Since the decomposition is static, this com-
munication will have a static, local (or nearest-neighbor) topology.
Nodes on contact surfaces are exceptional, because the set of neighbor zones includes zones directlyacross from the node on the other side of the surface, and these change due to relative tangential motion
at the surface. Thus the calculation of nodal forces on contact nodes requires a dynamic communication
topology. A major part of the parallel contact problem consists of determining, at each time step, the cur-
rent neighbor zones of each contact node, finding where on the parallel machine those zones ‘‘live’’, and
bringing each contact node and all its current neighbors together on the same processor so that the nodal
force can be determined.
In surveying various approaches that have been taken to deal with parallel contact surfaces, it is neces-
sary to distinguish between one-sided and two-sided contact. In one-sided contact any node on a surfacedesignated as a contact surface may come into contact with any face of that surface (excluding faces to
which the node is directly connected). Typical uses of one-sided contact surfaces are car crash simulations,
in which extensive buckling of shell elements occurs, and many-body collisions, where the union of the
surfaces of all bodies forms the one-sided contact surface.
One-sided contact is discontinuous in nature, see Fig. 1. One consequence is that the search for contact
must consider, for each face, all nodes within a geometrical region within which contact is possible.
An example of this is the one-sided contact algorithm in DYNA3D, [5,6], which sorts all contact nodes
into a number of spatial ‘‘buckets’’ in one or more dimensions. The minimum bucket size is chosen suchthat no contact face is spread over more than three adjacent buckets in each sort dimension. A node in
a given bucket can potentially interact only with faces that intersect that bucket. This bucket sort, an expen-
sive global operation, can be amortized over multiple cycles. The rebucketing frequency is determined by
the maximum velocity of any contact node.
To parallelize, a contiguous set of contact nodes in one or more buckets is assigned to a given processor.
All nodes in these buckets and neighboring buckets must be sent to that processor in order to perform the
contact enforcement. Note that this is in effect a second independent, dynamic decomposition, in addition
to the static bulk decomposition. This feature is shared by PRONTO, discussed next, and by the ALE3Dcontact algorithm.
PRONTO [7,8] sorts the set of contact nodes and faces of a single-sided contact surface in three dimen-
sions using recursive coordinate bisection (RCB), [9]. RCB provides dynamic load balance efficiently.
ac
b d x
Fig. 1. Nodes d and b come into contact with face x as the buckled portion of the surface moves to the right. Node c located between b
and d does not.
ac
b d x
Fig. 2. Node d is nearest to face x at one point in time while node b is nearest to x at a later time. Thus, there is an intervening time
when c is nearest to x.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3129
Nodes and faces are sent at each time cycle from their bulk domain to the RCB contact domain of the pre-
vious time cycle. The new RCB is calculated from there and a small subset of data is transferred from the
old RCB domain to the new RCB domain, taking advantage of the fact that the RCB decompositionevolves in an incremental fashion. A number of global operations with short message lengths are required
to orchestrate the parallel RCB procedure.
Two-sided contact as formulated in ALE3D is simpler than one-sided contact because of the additional
assumption that contact between the two sides evolves continuously, see Fig. 2. This assumption will be
further developed in the next section. It is precisely what allows the ALE3D dynamic domain decomposi-
tion to evolve with a complete absence of global communication. In short, a global search is initially per-
formed to find nearest nodes on the opposing side, and thereafter the search for nearest nodes is restricted
to the mesh neighborhood of the nearest node during the preceding time cycle.
2. The contact algorithm
2.1. Terminology
The problem mesh is the discretization of one or more physical bodies into zones. The mesh is then par-
titioned into bulk domains, each assigned to its own processor with separate memory. The processors sharedata by message passing.
The faces forming the exterior of the problem mesh are called the boundary faces. A two-sided contact
surface consists of a pair of boundary face subsets, the master side and the slave side, see Fig. 3(b). These
subsets are specified by the user, and presuppose some knowledge of which parts of the bodies will come in
contact. Each side of a contact surface is a 2D mesh. The relationship (or connectivity) between the various
faces and nodes of each mesh is fixed. Each face on a contact side always consists of the same nodes, and
each node is a vertex for the same set faces.
Fig. 3. (a) Bulk decomposition into eight domains where each color represents a different bulk domain and (b) the red and blue
surfaces form the master and slave sides of one contact surface, while the green and yellow surfaces form the master and slave sides of a
different contact surface. (For interpretation of color the reader is referred to the web version of this article.)
a
Fig. 4. The neighborhood of node ‘‘a’’ consists of all circled nodes (plus ‘‘a’’ itself).
3130 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
The neighborhood of a contact node consists of all the nodes within two faces of the specified node, see
Fig. 4. Although a uniform mesh is shown in Fig. 4 for simplicity, ALE3D supports arbitrarily connected
meshes.
Contact enforcement in explicit time-differencing codes is essentially a local process where each ‘‘patch’’
of nodes on one side of a contact surface is only affected by a patch of nodes directly opposite it. However,
the opposing patch of nodes may be continually changing as the two sides move relative to each other. The
order nodeof a given node on one side of a contact surface is the node nearest to it on the opposite side ofthe contact surface. Each contact node has an order node during a time cycle but this order node constantly
changes as the problem evolves. The amount of change allowed in a given time cycle is limited; the order
node of a particular node at the next time cycle must share a face with its order node during the present time
cycle. In other words there must exist a face on the opposing side such that the previous and current order
node are both vertices. This allows us to limit the search for the new order node to a local search at each
time cycle.
The locale of a contact node consists of the node itself, its neighborhood, its order node, and the neigh-
borhood of its order node, see Fig. 5. The locale of a node is the minimum amount of state required toperform the contact enforcement and subsequent order node update on that node.The locale is a more pre-
cise definition of the ‘‘patch’’ referred to earlier.
Contact enforcement is parallelized by partitioning the contact surface into a number of contact domains,
or CDs, and then assigning each contact domain to a processor. The nodal data resides permanently in the
bulk domains and is sent to the contact domains for contact enforcement during a time cycle. The updated
nodal data is then returned to the bulk domains.
Within a single contact domain, contact forces are balanced using a symmetric, momentum-conserving
approach that is modelled after the 2D method of HEMP, [10]. This method has proven to be robust, sta-ble, and accurate in the presence of strong shocks. Details may be found in [11].
The communication topology is the overall organization of inter-processor communication and deter-
mines the specific communication pattern between bulk domains and contact domains during a given time
a
Fig. 5. The locale of node ‘‘a’’ consists of all nodes marked by open circles.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3131
cycle. The cross-section of the topology for a given processor is the set of processors with which it commu-
nicates. Classes of communication topology relevant to this paper are
1. Static. Each processor communicates with the same set of other processors (its ‘‘neighbors’’) during each
time cycle.2. Dynamic. Each processor�s neighbors may change from one time cycle to the next.
3. Local. There is a upper bound on the number of processors that a given processor communicates with,
independent of the total number of processors.
4. Global. The number of processors that a given processor communicates with increases as the overall
number of processors increases.
Communication between bulk domains (BDs) and contact domains (CDs) is accomplished by:
1. Each BD determining what information to send to which CD.
2. Each CD determining from which BDs it will receive information.
Both of the above steps are necessary as each communication requires the sender and receiver to explic-
itly cooperate. BD messages cannot be simply sent from various places while the CD continuously polls for
incoming messages. The CD would never know when it had waited long enough (unless it waited till it got
messages from all BDs, which would imply global communication).
2.2. Contact domains
If a contact node is assigned to a particular contact domain then its state is updated and returned by that
contact domain during the process of contact enforcement. A given master node is always assigned to the
same contact domain. Slave nodes, on the other hand, are assigned to the contact domains of their order
nodes. The contact domain a slave node is assigned to may change from cycle to cycle because of relative
movement of the order nodes.
Simplified 1D diagrams will henceforth be used to represent the contact surface. Also, node neighbor-hoods will only include nodes within one face (or segment, in 1D) of a given node. The description of var-
ious scenarios is thus less cluttered while the extension to 2D surfaces is obvious.
A contact surface is shown in Fig. 6 at two different times, t1, t2.
If a node is assigned to a CD, then the CD also requires the node�s locale, see Fig. 7.
The nodes m10 and s8 in Fig. 7 are referred to as ghost nodes of CD 5.
The set of master nodes, both assigned and ghost, that are required by a given contact domain is the
same each cycle, so the focus from now on will be on the more complicated and dynamic issue of slave
nodes.
CD 4 CD 5 CD 6 CD 7m1 m2 m3 m4 m5 m6
s2 s3 s4 s5 s6 s7
CD 4 CD 5 CD 6 CD 7m1 m2 m3 m4 m5 m6
s2 s3 s4 s5 s6 s7
t1 t2
Fig. 6. t1: s4 orders on m3 so that both m3 and s4 are assigned to CD 5. t2 The slave side has moved so that s4 now orders on in m4 and is
assigned to CD 6.
CD 4 CD 5 CD 6m6 m7 m8 m9 m10
s5 s6 s7 s8
Fig. 7. m9 is assigned to CD 5, so all circled nodes must be sent to CD 5. Nodes m10 and s8 are included even though they are assigned
to a different CD.
3132 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
2.3. Bulk domains
A bulk domain will send one of its local slave nodes to a particular CD when:
1. The slave node�s current order node is statically assigned to the CD,
2. the slave node is the order node of a master node that is assigned to the CD, or
3. the slave node is in the neighborhood of another slave node of the first or second class.
The second and third classes are a consequence of the locale requirements for assigned nodes. A node in the
first class is called a rule 1 generator for the CD because it generates a list of accompanying nodes (its neigh-
bors) that must also be sent to the CD. A slave node is a rule 1 generator for exactly one CD, namely, its
assigned CD (which changes dynamically).A slave node in the second class is called a rule 2 generator for the CD. A slave node can be a rule 2
generator for multiple CDs (or none), see Fig. 8.
The rule 1 CD for a given slave node is determined at the end of the previous cycle, after the position of
the slave and all other nodes in its locale have been updated. At that point the new order node for the slave
is searched for, starting from the old order node (this is done in the contact domain). Once the new order
node is found, the order node�s assigned CD becomes the slave�s new rule 1 (i.e. assigned) CD.
The rule 2 CD list is built a little less directly. The slave was sent to a number of CDs as either an as-
signed node or a ghost node during the previous cycle. The contact domain determined the new order nodeof each of its assigned masters at the end of that cycle. If the slave node was chosen as an order node of any
assigned master node(s) on that CD, the slave node was ‘‘tagged’’. The CD returned to the BD a list of all
of the BD�s slave nodes that were tagged.
If a slave node was sent to multiple CDs, then any or all of those CDs may have returned a tagged status
on that node. Any CD that did so is then added to the node�s rule 2 CD list for the next (i.e. current) time
cycle. For example, both CD 2 and CD 3 in Fig. 8 will return a tag on node s4 to the BD(s) where s4 resides,
and s4�s rule 2 CD list will be {CD 2, CD 3}.
m2 m3 m4 m5 m6
s2 s3 s4
CD 1 CD 2 CD 3
Fig. 8. Both m5 and m6 order on s4. Hence, s4 is a rule 2 generator for both CD 2 and CD 3 (and a rule 1 generator for CD 3).
s1 s2 s3 s4 s5 s6
BD 1 BD 2
Fig. 9. The circle represents the neighborhood of node s4.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3133
2.4. Domain communication
2.4.1. BDM BD
Consider the example in Fig. 9. If node s4 has rule 1 CD set to CD 3, then nodes s3, s4, and s5 must all be
sent to CD 3. Since node s4 is on BD 2, then BD 2 can determine to send nodes s4 and s5 to CD 3 by looking
at s4�s rule 1 CD. BD 1 determines whether to send s3 to CD 3 by first creating a ghost layer of nodes (called
proxy nodes) around each bulk domain�s local set. These nodes are generated by taking the union of the
bulk domain�s local nodes� neighbors (i.e. nodes within two faces of all the local nodes) and then removingthe local nodes so that only the proxy nodes for the bulk domain remain. A list of the proxy nodes is sent to
the other bulk domains for matching against their own local nodes. A list (possibly null) of matches is then
returned.
A static communication topology is established between a bulk domain and the other bulk domains con-
taining the proxy nodes. The topology and list of nodes to be sent is static. Each bulk domain sends to the
other BDs in its ‘‘neighborhood’’ (i.e. the BDs that have proxy copies of any of its local nodes) the latest
rule 1 CD and rule 2 CD list for each of the shared nodes, and receives the same. In the example of Fig. 9,
this allows BD 1 to know what s4�s rule 1 and rule 2 CDs are.For each local node in a bulk domain, a list of neighboring domains, if any, on which the node is proxy is
built and permanently associated with the node. This is referred to as the node�s neighbor BD list.The BD M BD communication topology is a local, nearest-neighbor topology because it is bounded by
the product of the number of native nodes and the number of neighbors per node. In practice, each bulk
domain will have on the order of eight neighbors, i.e., a 3 · 3 grid where the central element represents the
local BD itself.
2.4.2. BDM CD
Each contact domain builds a list of bulk domains called the next time BDs list (Section 2.5.2) from
which it expects to receive data during the next time cycle. The contact domain preposts the ‘‘receives’’ from
each bulk domain in its next_time_BDs list at the start of a time cycle�s BD M CD communication pass. A
buffer is allocated on each bulk domain for each contact domain on its CD ‘‘send’’ list, containing state
information such as the node�s position, velocity, and order node, and its neighbor BD list, assigned (i.e.
rule 1) CD, and list of ghost CDs,. This buffer is sent to the destination CD upon completion where it is
assembled by the contact domain with the data from all other received buffers into CD�s piece of the contactsurface.
m5 m6
s7
CD 1 CD 2 CD 3
Fig. 10. CDM CD communication.
3134 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
2.4.3. CDM CD
Consider the situation in Fig. 10.
s7�s order node is m6 at the start of the cycle, so s7 is assigned to CD 3. However, surface movement re-
sults in s7�s new order node being m5. This calculation is done by CD 3, since that is where s7 is assigned.
However, CD 2 also needs to know s7�s new order node because it has to accumulate s7�s neighbor BD listinto its next time BDs list, since this is where s7 will be assigned next cycle. This situation is accomodated by
defining another communication topology between a contact domain and other contact domains where any
of its ghosts are assigned or where any of its assigned nodes are ghosts. This is a local topology that is
bounded by the total number of nodes in a contact domain.The typical number of neighbors is on the order
of eight.
This communication topology is also used several times in a time cycle to communicate intermediate re-
sults during the course of the contact calculation. For example, in Fig. 10, node s7 needed the freshly up-
dated coordinates of m5 (a ghost node to CD 3) in order to determine that its new order node is indeed m5.To construct this topology, for each local node the bulk domain builds a list of which contact domains
the local node will be sent to as a ghost. The bulk domain sends this list when it sends the node to its as-
signed CD. When it sends the node to any other CD as a ghost, it sends along with it the information of
which CD it is assigned to. Each CD constructs from this information a set of CDs where its assigned nodes
are ghosts, and another set of CDs where its ghost nodes are assigned. Unlike the BD M BD topology, the
CD M CD communication topology is dynamic.
2.5. Domain preparation
2.5.1. Bulk domain
The bulk domain builds a package of data to send to each of its neighbor contact domains after the
BD M BD communication pass by looping through all its local and proxy nodes and looking at each node�srule 1 CD and rule 2 CD list. The first time a CD is encountered, it creates a new list for nodes going to that
CD and adds the CD to its list of CDs to send to.
If the node is local, it is added to the list of nodes being sent to the rule 1 CD (if not already on the list)
and flags the node as an assigned node. The BD adds all the local neighbors of the node to the lists of allCDs in the rule 1 and rule 2 CD lists regardless of whether the node is local or proxy.
2.5.2. Contact domain
Each contact domain must know beforehand which bulk domains will send it messages, in order to pre-
post the receives of those messages. This determination is actually done at the end of the previous cycle.
After the order nodes have been updated at the end of the previous time cycle, each contact domain
determines which slave nodes have order nodes that are among its assigned masters (the CD�s new rule
1 generators) and which slave nodes are the order nodes of its assigned masters (the CD�s rule 2 generators).The bulk domains that own these generators will communicate with the contact domain next (i.e. current)
cycle. In addition, the contact domain will also receive all the neighbors of its generators, and these neigh-
bors may not all be from the same bulk domains as the generators themselves (see Fig. 9).
To solve this problem, when a node is sent to a CD, extra data is also sent with it called the neighbor BD
list. For a given node, its neighbor BD list is the set of all bulk domains on which the node is local or proxy.
This is a static list, and can be computed during initialization. If a bulk domain appears in a given node�sneighbor BD list, then that bulk domain contains at least one node in the neighborhood of the given node.
Thus, the contact domain will also communicate with all bulk domains in the neighbor BD list of any of itsgenerators.
Thus, at the end of the previous cycle the contact domain finds all slaves that order on its assigned mas-
ters and all slaves that are ordered on by its assigned masters. Then, the contact domain takes the union of
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3135
these nodes� neighbor BD lists in order to determine which BDs it will talk to during the next time cycle.
This list is stored in the next time BDs list until the start of the next (i.e. current) cycle.
2.6. Communication summary
The ALE3D parallel contact algorithm uses three types of communication topologies in dynamically
evolving the communication patterns of material contact.
BDM BD communication. Ties the bulk domains together by sharing information near mutual bound-
aries and thus allows a given bulk domain to ‘‘sense’’ when it is about to become involved in the work of a
specific contact domain by involvement of its proxy nodes.
BDM CD communication. Supplies the contact domain with the data needed to perform the contact
enforcement, to update the order nodes of the CD�s assigned nodes, and to predict, on the basis of the up-
dated order nodes, which bulk domains will communicate with it during the next time cycle. Returns theupdated nodal state for its assigned nodes to their BD homes.
CDM CD communication. Supports the contact domain in determining which slaves order on its as-
signed master nodes, and thus which bulk domains the contact domain will communicate with during
the next time cycle. CD M CD communication also permits less redundant calculation of the ghost state
by trading communication for redundant calculation.
The control flow for a single time cycle is given in Fig. 11.
2.7. A simplified example
A simple example is provided to demonstrate how the communication pattern evolves. In a more real-
istic example the surfaces would be 2D, and the node neighborhoods would include all nodes within two
faces. As a further simplification, all the nodes on one side of the surface will exactly match up with a node
on the other side, so that the rule 2 CD list will only contain the rule 1 CD. Nevertheless this example shows
much of the mechanism by which BDs and CDs start and stop communicating with each other in the course
of contact evolution.
Fig. 11. Control flow for a single time cycle.
3136 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
1. Initial configuration; mi orders on si
s1 s2 s3 s4 s5 s6
m1 m2 m3 m4 m5 m6
2. Static layout of the contact surface into the six bulk domains. The nodes represented by open circles are
proxy nodes
BD 4 BD 5 BD 6
BD 1 BD 2 BD 3
s1 s2 s3 s2 s3 s4 s5 s4 s5 s6
m1 m2 m3 m2 m3 m4 m5 m4 m5 m6
3. Static BD M BD topology
BD
BD
m
m
s
s1
1 $ BD 2 BD 2 $ fBD 1;BD 3g BD 3 $ BD 2
4 $ BD 5 BD 5 $ fBD 4; BD 6g BD 6 $ BD 5
4. Static master side contact decomposition. The assigned nodes to a domain are represented by filled cir-cles and the ghost nodes are presented by open circles
m1 m2 m1 m2 m3 m2 m3 m4
3 m4 m5 m4 m5 m6 m5 m6
CD 1 CD 2 CD 3
CD 4 CD 5 CD 6
5. Contact domain configuration
m1 m2 m1 m2 m3 m2 m3 m4
3 m4 m5 m4 m5 m6 m5 m6
CD 1 CD 2 CD 3
CD 4 CD 5 CD 6
3 s4 s5
s1 s2s1 s2 s3 s2 s3 s4
s4 s5 s6 s5 s6
6. Contact surface movement during time cycle tn�1
s2 s3 s4 s5 s6
m1 m2 m3 m4 m5 m6
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3137
7. CD allocation and order node updates at tn�1
s1 s2 s1 s2 s3 s2 s3 s4
s3 s4 s5 s4 s5 s6 s5 s6
m1 m2 m1 m2 m3 m2 m3 m4
m3 m4 m5 m4 m5 m6 m5 m6
CD 1 CD 2 CD 3
CD 4 CD 5 CD 6
8. The contact domains during time cycle tn
m1 m2 m1 m2 m3 m2 m3 m4
m3 m4 m5 m4 m5 m6 m5 m6
CD 1 CD 2 CD 3
CD 4 CD 5 CD 6
s2 s3 s2 s3 s4 s3 s4 s5
s4 s5 s6 s5 s6 s7 s5 s6
s1
A particular event is illustrated in Fig. 12. Note that BD 2 does not send anything to CD 1 during tn�1 as
BD 2 ! CD 1 is not part of the communication topology. Also, BD 2 finds it will send data during tn to CD
1 and BD 2 ! CD 1 gets added to the BD ! CD topology. Similarly, communication paths can disappear
when they are no longer needed. For example, CD 3 received s2 from BD 1 during time cycle tn�1 while in
the next time cycle CD 3 no longer needed s2, and consequently BD 1 ! CD 3 is dropped out of the
topology.
3. Timing studies
Scalability of the contact algorithm is tested by performing several timing studies. The test geometries
are rectangular, cylindrical, and spherical, and have been designed to isolate the contact scalability issues.
Scaled efficiency and fixed efficiency are both measured. Since contact surfaces are two dimensional, the
number of contact surface elements in conventional scaling grows as N2/3 of the overall element count N. If
the overall amount of work per processor is held constant, contact work per processor actually shrinks for
scaled speedup. This clouds the issue of scalability of the contact algorithm itself. Ideally one would hope tosee super-linear scaling of the contact phase in this case, but quantifying this is difficult. The simplest solu-
tion is to hold the contact work per processor constant, by performing mesh refinement only in the dimen-
sions of the contact surface.
Fixed efficiency tests are achieved by running a fixed problem size on various numbers of processors.
Ideally each doubling of processors should cut the run time by a half. However, as the amount of actual
work per processor becomes small compared to the time spent in various fixed-overhead work, e.g., com-
munication latency times, this ideal will inevitably fail at some point. The problem sizes chosen exhibit this
failure at the largest processor counts run.All tests were performed on the Lawrence Livermore National Laboratory ASCI ‘‘Y’’ machine, which is
an IBM SP system with 488 nodes and 1.5GBytes of memory per node. Each node consists of 4 PowerPC
604e processors, running at 332MHz with a peak of 664MFlops, that share the node�s memory, for a total
of 1952 processors.
Fig. 12. One particular event of the example.
3138 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3.1. A case for local communication
It was observed in the following sections that there was both a steady increase in typical cycle times and
an increasing variance in the individual cycle times when running with larger numbers of processor. Indeed,
some time cycles were four times as long as others. This occurred even in the fixed planes test (see below)
where identical work and communication is performed during each time cycle. Moreover, which cycles took
longer on a given run was entirely unrepeatable.
This behavior is now understood to be a result of system ‘‘daemon’’ processes that randomly interrupt
various processors to perform housekeeping tasks. If each processor ran entirely independently and the sys-tem interruptions are uniformly distributed, then scalability would not be affected. However, if one proc-
essor is interrupted during processor intercommunication, then all processor communication must wait.
While this is most dramatic for global operations such as MPI_Barrier and MPI_Allreduce, it also affects
local communication patterns. To see this, consider Fig. 13.
A B C D E
Fig. 13. Five processor example where each processor communicates only with its adjacent neighbor.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3139
If processor A is interrupted and temporarily cannot send the expected data to processors B and E, then
at some point processors B and E can go no further and cannot send the data to processors C and D. Thus,
processors C and D must wait for processor A to return from the interruption. This ‘‘daisy-chain’’ can be of
any length. The probability of a processor being interrupted by the system in any given time interval in-
creases as the processor count is increased, resulting in the increased run times observed as well as the var-
iance in cycle times.
A simple test to measure this phenomenon is constructed where each processor communicates with 6neighbor processors. For simplicity the neighbors are the six with processor numbers (MPI rank) nearest
to the given processor. The test is run 1000 time cycles where each processor sends and receives 2000 float-
ing-point numbers to its neighbors during a time cycle. This test should ideally take the same amount of
time regardless of the total number of processors involved. The results are shown (normalized) in Fig.
14 and may seem to indicate that producing an algorithm that uses only local communication is futile.
However, a similar test was constructed which instead performs a global operation (MPI_Barrier) within
the loop. The results of this test are again shown in Fig. 14.
Since barriers are the simplest form of global communication in that they involve no data transfer, it isclear that global communication produces a much bleaker picture for scalability than local communication.
Thus efforts expended in restricting communication to local interchanges only are worthwhile.
3.2. Fixed planes test
Two rectangular boxes of equal size are oriented so that one sits on top of the other with perfect overlap,
see Fig. 15. Larger runs use a finer grid in x- and y-directions while shrinking the box thickness in the z-
direction. The contact surface is defined to be the surface where the blocks meet. The mesh for the twoblocks is slightly different as there is one more zone in the x- and y-directions of the bottom block than
in the top block. Thus, the contact nodes do not coincide except at the edges (the contact surface is orthog-
onal to the z-direction). The entire machinery of the contact algorithm, including all communication, is
Fig. 14. (a) Local communication (green line); (b) barrier communication (red line) and run-time is the average of 4 runs. (For
interpretation of color the reader is referred to the web version of this article.)
Fig. 15. Fixed planes.
Table 1
Results for fixed planes test
Procs. Total zones Zones/proc. Contact nodes Nodes/proc. Contact time
8 25,282 3160 12,961 1620 1.0
16 50,622 3164 25,763 1610 1.11
32 101,762 3180 51,521 1610 1.14
64 203,386 3178 102,597 1603 1.24
128 408,322 3190 205,441 1605 1.36
256 815,346 3185 409,481 1600 1.51
512 1,635,842 3195 820,481 1603 1.65
1024 3,264,994 3188 1,636,113 1598 2.57
3140 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
exercised even though there is no movement in this test. Computational load balance should be very good
and any non-scalability is attributable to communication.
Table 1 lists the parameters and results for a set of timed runs. Each test is approximately twice the size
of the previous test, and is executed on twice as many processors. All times are normalized.
There is a gradual increase in time attributable to system interrupts leading to communication blocks as
expected from the previous discussion on IBM SP scalability issues. The overall increase in time is about 3·,for a 128· increase in problem size and processor count. Fig. 16 compares the actual scaled efficiency com-
pared to the ideal. The normalized maximum run time is shown and is actually the inverse of scaledefficiency.
A fixed-size test was also executed with various processor counts. The problem chosen is identical to the
one used for the scaled test with 256 processors. Results for the fixed test are compared with ideal scaling in
Fig. 17.
Fig. 16. Comparison of actual scaled efficiency (red line) to the ideal (green line) for fixed planes test. (For interpretation of color the
reader is referred to the web version of this article.)
Fig. 17. Comparison of actual efficiency (red line) with the ideal (green line) for fixed sized problem on the fixed planes test. (For
interpretation of color the reader is referred to the web version of this article.)
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3141
3.3. Sliding planes test
Two rectangular boxes are again modelled in this test. In this case they sit at an initial off-set as shown in
Fig. 18(a). One block is given a velocity, so that at a later time they are in the configuration shown in Fig.
18(b).
This test exercises the ability of the algorithm to change its communication pattern over time. The par-
titioning algorithm decomposes the problem into domains such that each domain is in one block or theother but not both. This is a consequence of the rule that attempts to minimize connectivity between do-
mains. The most obvious cut is between the blocks along the contact surface as there is no connectivity
between blocks.
The domains directly across from a given domain will change as the block slides, thus altering the com-
munication pattern. Individual slave nodes are assigned to the contact domain of their associated master
node. As the block slides, slaves that were previously ordered on masters located on the edge of a contact
domain will change ordering to new masters in a different contact domain. Thus, the individual data sent to
contact domains changes each time cycle even when the pattern of communicating domains does not.Table 2 and Fig. 19 present the normalized results for the scaled sliding planes test.
It is clear from the above results that this test does not scale as well as the fixed plane test. The source of
the problem is poor load-balance and is a consequence of the way in which the slave nodes are assigned to
contact domains. As previously discussed, each slave node is dynamically assigned to the same contact do-
main as its statically assigned master order node. Since the blocks are offset, all the slave nodes in the offset
region order on the limited set of master nodes along the edge of the master side closest to the offset slave
region. Fig. 20 depicts a simplified 1D example.
Fig. 18. Sliding planes.
Table 2
Results for sliding planes test
Procs. Total zones Zones/proc. Contact nodes Nodes/proc. Contact time
8 25,600 3200 13,122 1640 1.0
16 51,072 3192 25,990 1624 1.10
32 102,400 3200 51,842 1620 1.20
64 204,288 3192 103,050 1610 1.40
128 409,600 3200 206,082 1610 1.78
256 817,152 3192 410,368 1603 2.15
512 1,628,400 3200 821,762 1605 3.05
1024 3,268,608 3192 1,637,922 1600 5.21
Fig. 19. Comparison of actual scaled efficiency to the ideal for sliding planes test.
A B C
A B
Master
Slave
Fig. 20. The slave side is on the bottom. Each partition along the top represents a different contact domain.
3142 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
Although the master side is well load-balanced, most of the slave side is nearest to the master nodes in
contact domain A. Thus, most of the slave nodes are assigned to contact domain A while a much smaller
number are assigned to contact domain B and none are assigned to contact domain C. This results in severe
load imbalance.
The above imbalance gets worse as the problem is scaled up. In Fig. 20, 50% of the slave side lies to the
left of the master side and thus more than 50% of the slave side will be assigned to 33% of the contact do-
mains. If the mesh is refined 10 times with a proportional increase in processor count yielding 30 contact
domains, then more than 50% of the slave nodes will now be assigned to 1/30, or about 3%, of the contactdomains. This is a definite weakness in the current version of the contact algorithm but can be ‘‘fixed’’ by
using a different method of assigning slave nodes to contact domains.
Fig. 21 shows the results for the fixed size scaling test for sliding planes.
3.4. Cylinder test
Two concentric cylinders are modelled in this test, see Fig. 22. The outer cylinder is held fixed while the
inner cylinder is given a constant angular velocity. Each part of the contact surface in this problem even-
Fig. 21. Comparison of actual efficiency with the ideal for a fixed sized problem on the sliding planes test.
Fig. 22. Cylinder test.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3143
tually comes into contact with many other parts of the surface. Consequently, there is a large amount of
change in the communication pattern over time. This test does not have the load-imbalance problem of
the sliding planes test and so scalability is improved. One full revolution of the interior cylinder is modelled
over the course of the test.The results for this test are shown in Table 3 and Fig. 23.
Fig. 24 shows the corresponding fixed size results for the problem used in the 256 processor scaled test.
It can be seen that the fixed-size problem becomes communication-bound past 512 processors. This is
not surprising. There are less than 400 contact nodes per contact domain at this point and each processor
is left without a lot of computation to balance the communication overhead.
Table 3
Results for cylinder test
Procs. Total zones Zones/proc. Contact nodes Nodes/proc. Contact time
8 12,800 1600 6,642 830 1.0
16 25,536 1596 13,110 819 1.01
32 51,200 1600 26,082 815 1.13
64 102,144 1596 51,754 809 1.21
128 204,800 1600 103,362 808 1.40
256 408,576 1596 205,650 803 1.56
512 819,200 1600 411,522 804 1.85
1024 1,634,304 1596 819,847 801 2.37
Fig. 23. Comparison of actual scaled efficiency to the ideal for the cylinder test.
Fig. 24. Comparison of actual efficiency with the ideal for fixed sized problem on the cylinder test.
3144 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3.5. Concentric spheres test
This final test has two contact surfaces separating three concentric spherical shells. The outer and inner
shell are a metallic material, while the middle shell is an explosive material. Only one octant of the materialis modelled using symmetric boundary conditions. At the initial time the explosive is ignited, leading to a
rapid expansion of the material. Fig. 25 shows the initial problem configuration. While there is not a lot of
movement along the contact surfaces, there is some, and to capture the correct physics it is necessary to
Fig. 25. Initial configuration.
Table 4
Results for concentric spheres test
Procs. Total zones Zones/proc. Contact nodes Nodes/proc. Contact time
8 21,240 2655 5312 664 1.0
16 42,350 2646 7669 479 0.90
32 84,860 2652 12,936 404 0.90
64 169,920 2655 20,740 324 1.09
128 338,800 2646 30,064 235 1.11
256 678,880 2652 50,948 199 1.37
512 1,359,360 2655 81,956 160 1.62
1024 2,710,400 2646 119,044 116 1.93
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3145
allow the free motion. The most important requirement on the contact surfaces in this application is the
need to accurately model shock waves as they cross the surface.
Note that unlike the previous tests, the ratio of contact nodes to overall zones decreases as this test is
scaled up, see Table 4.
The shrinkage in the amount of actual computation initially leads to a reduced run time in the above
results. However, as the processor count increases, the amount of computation becomes less significant
Fig. 26. Comparison of actual scaled efficiency to the ideal for the concentric spheres test.
Fig. 27. Comparison of actual efficiency with the ideal for fixed sized problem on the concentric spheres test.
3146 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
relative to the communication time and suffers the mild non-scalability discussed earlier. A plot of the tim-
ing results is presented in Fig. 26.
Fig. 27 shows the results of a fixed size scaling test. As in the other tests, the problem size previously used
for 256 processors is again used for all processor counts of the fixed size test.
References
[1] A. Anderson et al., User�s Manual forALE3D, Lawrence Livermore National Laboratory, 2003.
[2] R. Couch, R. McCallen, I. Otero, R. Sharp, 3D Metal Forming Applications of ALE Techniques, Simulation of Materials
Processing: Theory, Methods, and Applications, Balkema, Rotterdam, 1995.
[3] A. Koniges (Ed.), Industrial Strength Parallel Computing: Programming Massively Parallel Systems, Morgan Kaufmann
Publisher Inc., Los Altos, CA, 1999.
[4] G. Karypis, V. Kumar, METIS: Unstructured Graph Partitioning and Sparse Matrix Ordereing System, Version 1.0, University
of Minnesota, Department of Computer Science, 1995.
[5] D.J. Benson, J.O. Hallquist, A single surface contact algorithm for the post-buckling analysis of shell structures, Comput.
Methods Appl. Mech. Engrg. 78 (1990).
[6] C.G. Hoover, A.J. DeGroot, J.D. Maltby, R.D. Procassini, ParaDyn: DYNA3D for massively parallel computers, in:
Presentation at Tri-Laboratory Engineering Conference on Computational Modelling, October 1995.
[7] S.W. Attaway, B.A. Hendrickson, S.J. Plimpton, D.R. Gardner, C.T. Vaughan, K.H. Brown, M.W. Heinstein, A parallel
contact detection algorithm for transient solid dynamics simulations using PRONTO3D, Computat. Mech. 22 (1998).
[8] K. Brown, S. Attaway, S. Plimpton, B. Hendrickson, Parallel strategies for crash and impact simulations, Comput. Methods
Appl. Mech. Engrg. 184 (2000).
[9] M.J. Berger, Bokhari, A partitioning strategy for nonuniform problems on multiprocessors, IEEE Trans. Comput. C 36 (1987)
570–580.
[10] M.L. Wilkins, Calculation of elastic–plastic flow, Lawrence Livermore National Laboratory, UCRL-7322, rev 1, 1969.
[11] T.G. Pierce, A parallel algorithm for contact in a finite element hydrocode, Lawrence Livermore National Laboratory, UCRL-
LR-154063.