A parallel two-sided contact algorithm in ALE3D

Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146

www.elsevier.com/locate/cma

A parallel two-sided contact algorithm in ALE3D q

Tim Pierce a,*, Garry Rodrigue b

a Lawrence Livermore National Laboratory, Mail Stop L-98, Livermore, CA 94551, United Statesb Department of Applied Science, University of California at Davis, Davis, CA 95616, United States

Received 11 November 2003; accepted 20 August 2004

Abstract

A scalable parallel algorithm for treating two-sided contact in a finite-element multi-physics code (ALE3D) is pre-

sented. This method assumes that proximity between the two sides changes continuously, and uses a local search to

update proximity relations each cycle. The evolving communication pattern is treated entirely by local, nearest-neigh-

bor communication; there is no global communication. Opening and closing voids, overlapping and intersecting contact

surfaces, and a number of other special features are supported.

� 2004 Elsevier B.V. All rights reserved.

Keywords: Contact surface; ALE; Finite element; Parallel computing

1. Introduction

ALE3D is an abitrary-Lagrangian–Eulerian (ALE) finite-element code that treats fluid and elastic–plas-

tic response of materials on an unstructured 3D grid. The major components of the code are explicit and

implicit continuum mechanics, heat transfer, and chemistry [1]. ALE3D is used primarily to perform large-

deformation transient dynamics simulations, including those involving strong shocks. It has also been ap-

plied to a number of metal-forming applications [2].Problems of this type often involve contact surfaces, 2D surfaces defined by the meeting (potential or

actual) of separate material regions. An impenetrability constraint is enforced along the contact surface.

0045-7825/$ - see front matter � 2004 Elsevier B.V. All rights reserved.

doi:10.1016/j.cma.2004.08.011

q This work was performed under the auspices of the US Department of Energy by the University of California, Lawrence

Livermore National Laboratory under contract no. W-7405-Eng-48.* Corresponding author.

E-mail addresses: [email protected] (T. Pierce), [email protected] (G. Rodrigue).

mailto:[email protected]

mailto:[email protected]

3128 T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146

However, the materials are (usually) free to slide relative to each other tangential to the contact surface. A

friction model may be used in the simulation of tangential sliding.

The finite element procedure begins with a discretization of the material regions into a mesh of zones, or

elements. The mesh in the parallel version of ALE3D is further decomposed into submeshes (called do-

mains) and typically one domain is assigned to each processor of a parallel distributed-memory machine.This is a fairly standard approach to parallelizing finite-element codes, [3], and a number of tools such

as METIS, [4], have been developed to produce a well balanced decomposition. Typically each zone in a

domain is permanently assigned to that processor, i.e. the decomposition is static.

The force at a given node (mesh vertex), which ultimately dictates its displacement, is a function of the

stress in neighboring zones. If the neighboring zones fall into multiple domains, then inter-processor com-

munication will be required to calculate the force at the node. Since the decomposition is static, this com-

munication will have a static, local (or nearest-neighbor) topology.

Nodes on contact surfaces are exceptional, because the set of neighbor zones includes zones directlyacross from the node on the other side of the surface, and these change due to relative tangential motion

at the surface. Thus the calculation of nodal forces on contact nodes requires a dynamic communication

topology. A major part of the parallel contact problem consists of determining, at each time step, the cur-

rent neighbor zones of each contact node, finding where on the parallel machine those zones ‘‘live’’, and

bringing each contact node and all its current neighbors together on the same processor so that the nodal

force can be determined.

In surveying various approaches that have been taken to deal with parallel contact surfaces, it is neces-

sary to distinguish between one-sided and two-sided contact. In one-sided contact any node on a surfacedesignated as a contact surface may come into contact with any face of that surface (excluding faces to

which the node is directly connected). Typical uses of one-sided contact surfaces are car crash simulations,

in which extensive buckling of shell elements occurs, and many-body collisions, where the union of the

surfaces of all bodies forms the one-sided contact surface.

One-sided contact is discontinuous in nature, see Fig. 1. One consequence is that the search for contact

must consider, for each face, all nodes within a geometrical region within which contact is possible.

An example of this is the one-sided contact algorithm in DYNA3D, [5,6], which sorts all contact nodes

into a number of spatial ‘‘buckets’’ in one or more dimensions. The minimum bucket size is chosen suchthat no contact face is spread over more than three adjacent buckets in each sort dimension. A node in

a given bucket can potentially interact only with faces that intersect that bucket. This bucket sort, an expen-

sive global operation, can be amortized over multiple cycles. The rebucketing frequency is determined by

the maximum velocity of any contact node.

To parallelize, a contiguous set of contact nodes in one or more buckets is assigned to a given processor.

All nodes in these buckets and neighboring buckets must be sent to that processor in order to perform the

contact enforcement. Note that this is in effect a second independent, dynamic decomposition, in addition

to the static bulk decomposition. This feature is shared by PRONTO, discussed next, and by the ALE3Dcontact algorithm.

PRONTO [7,8] sorts the set of contact nodes and faces of a single-sided contact surface in three dimen-

sions using recursive coordinate bisection (RCB), [9]. RCB provides dynamic load balance efficiently.

ac

b d x

Fig. 1. Nodes d and b come into contact with face x as the buckled portion of the surface moves to the right. Node c located between b

and d does not.

ac

b d x

Fig. 2. Node d is nearest to face x at one point in time while node b is nearest to x at a later time. Thus, there is an intervening time

when c is nearest to x.

T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 3129

Nodes and faces are sent at each time cycle from their bulk domain to the RCB contact domain of the pre-

vious time cycle. The new RCB is calculated from there and a small subset of data is transferred from the

old RCB domain to the new RCB domain, taking advantage of the fact that the RCB decompositionevolves in an incremental fashion. A number of global operations with short message lengths are required

to orchestrate the parallel RCB procedure.

Two-sided contact as formulated in ALE3D is simpler than one-sided contact because of the additional

assumption that contact between the two sides evolves continuously, see Fig. 2. This assumption will be

further developed in the next section. It is precisely what allows the ALE3D dynamic domain decomposi-

tion to evolve with a complete absence of global communication. In short, a global search is initially per-

formed to find nearest nodes on the opposing side, and thereafter the search for nearest nodes is restricted

to the mesh neighborhood of the nearest node during the preceding time cycle.

2. The contact algorithm

2.1. Terminology

The problem mesh is the discretization of one or more physical bodies into zones. The mesh is then par-

titioned into bulk domains, each assigned to its own processor with separate memory. The processors sharedata by message passing.

The faces forming the exterior of the problem mesh are called the boundary faces. A two-sided contact

surface consists of a pair of boundary face subsets, the master side and the slave side, see Fig. 3(b). These

subsets are specified by the user, and presuppose some knowledge of which parts of the bodies will come in

contact. Each side of a contact surface is a 2D mesh. The relationship (or connectivity) between the various

faces and nodes of each mesh is fixed. Each face on a contact side always consists of the same nodes, and

each node is a vertex for the same set faces.

Fig. 3. (a) Bulk decomposition into eight domains where each color represents a different bulk domain and (b) the red and blue

surfaces form the master and slave sides of one contact surface, while the green and yellow surfaces form the master and slave sides of a

different contact surface. (For interpretation of color the reader is referred to the web version of this article.)

a

Fig. 4. The neighborhood of node ‘‘a’’ consists of all circled nodes (plus ‘‘a’’ itself).


The neighborhood of a contact node consists of all the nodes within two faces of the specified node, see

Fig. 4. Although a uniform mesh is shown in Fig. 4 for simplicity, ALE3D supports arbitrarily connected

meshes.

Contact enforcement in explicit time-differencing codes is essentially a local process where each ‘‘patch’’

of nodes on one side of a contact surface is only affected by a patch of nodes directly opposite it. However,

the opposing patch of nodes may be continually changing as the two sides move relative to each other. The

order nodeof a given node on one side of a contact surface is the node nearest to it on the opposite side ofthe contact surface. Each contact node has an order node during a time cycle but this order node constantly

changes as the problem evolves. The amount of change allowed in a given time cycle is limited; the order

node of a particular node at the next time cycle must share a face with its order node during the present time

cycle. In other words there must exist a face on the opposing side such that the previous and current order

node are both vertices. This allows us to limit the search for the new order node to a local search at each

time cycle.

The locale of a contact node consists of the node itself, its neighborhood, its order node, and the neigh-

borhood of its order node, see Fig. 5. The locale of a node is the minimum amount of state required toperform the contact enforcement and subsequent order node update on that node.The locale is a more pre-

cise definition of the ‘‘patch’’ referred to earlier.

Contact enforcement is parallelized by partitioning the contact surface into a number of contact domains,

or CDs, and then assigning each contact domain to a processor. The nodal data resides permanently in the

bulk domains and is sent to the contact domains for contact enforcement during a time cycle. The updated

nodal data is then returned to the bulk domains.

Within a single contact domain, contact forces are balanced using a symmetric, momentum-conserving

approach that is modelled after the 2D method of HEMP, [10]. This method has proven to be robust, sta-ble, and accurate in the presence of strong shocks. Details may be found in [11].

The communication topology is the overall organization of inter-processor communication and deter-

mines the specific communication pattern between bulk domains and contact domains during a given time

a

Fig. 5. The locale of node ‘‘a’’ consists of all nodes marked by open circles.


cycle. The cross-section of the topology for a given processor is the set of processors with which it commu-

nicates. Classes of communication topology relevant to this paper are

1. Static. Each processor communicates with the same set of other processors (its ‘‘neighbors’’) during each

time cycle.2. Dynamic. Each processor�s neighbors may change from one time cycle to the next.

3. Local. There is a upper bound on the number of processors that a given processor communicates with,

independent of the total number of processors.

4. Global. The number of processors that a given processor communicates with increases as the overall

number of processors increases.

Communication between bulk domains (BDs) and contact domains (CDs) is accomplished by:

1. Each BD determining what information to send to which CD.

2. Each CD determining from which BDs it will receive information.

Both of the above steps are necessary as each communication requires the sender and receiver to explic-

itly cooperate. BD messages cannot be simply sent from various places while the CD continuously polls for

incoming messages. The CD would never know when it had waited long enough (unless it waited till it got

messages from all BDs, which would imply global communication).

2.2. Contact domains

If a contact node is assigned to a particular contact domain then its state is updated and returned by that

contact domain during the process of contact enforcement. A given master node is always assigned to the

same contact domain. Slave nodes, on the other hand, are assigned to the contact domains of their order

nodes. The contact domain a slave node is assigned to may change from cycle to cycle because of relative

movement of the order nodes.

Simplified 1D diagrams will henceforth be used to represent the contact surface. Also, node neighbor-hoods will only include nodes within one face (or segment, in 1D) of a given node. The description of var-

ious scenarios is thus less cluttered while the extension to 2D surfaces is obvious.

A contact surface is shown in Fig. 6 at two different times, t1, t2.

If a node is assigned to a CD, then the CD also requires the node�s locale, see Fig. 7.

The nodes m10 and s8 in Fig. 7 are referred to as ghost nodes of CD 5.

The set of master nodes, both assigned and ghost, that are required by a given contact domain is the

same each cycle, so the focus from now on will be on the more complicated and dynamic issue of slave

nodes.

CD 4 CD 5 CD 6 CD 7m1 m2 m3 m4 m5 m6

s2 s3 s4 s5 s6 s7

CD 4 CD 5 CD 6 CD 7m1 m2 m3 m4 m5 m6

s2 s3 s4 s5 s6 s7

t1 t2

Fig. 6. t1: s4 orders on m3 so that both m3 and s4 are assigned to CD 5. t2 The slave side has moved so that s4 now orders on in m4 and is

assigned to CD 6.

CD 4 CD 5 CD 6m6 m7 m8 m9 m10

s5 s6 s7 s8

Fig. 7. m9 is assigned to CD 5, so all circled nodes must be sent to CD 5. Nodes m10 and s8 are included even though they are assigned

to a different CD.


2.3. Bulk domains

A bulk domain will send one of its local slave nodes to a particular CD when:

1. The slave node�s current order node is statically assigned to the CD,

2. the slave node is the order node of a master node that is assigned to the CD, or

3. the slave node is in the neighborhood of another slave node of the first or second class.

The second and third classes are a consequence of the locale requirements for assigned nodes. A node in the

first class is called a rule 1 generator for the CD because it generates a list of accompanying nodes (its neigh-

bors) that must also be sent to the CD. A slave node is a rule 1 generator for exactly one CD, namely, its

assigned CD (which changes dynamically).A slave node in the second class is called a rule 2 generator for the CD. A slave node can be a rule 2

generator for multiple CDs (or none), see Fig. 8.

The rule 1 CD for a given slave node is determined at the end of the previous cycle, after the position of

the slave and all other nodes in its locale have been updated. At that point the new order node for the slave

is searched for, starting from the old order node (this is done in the contact domain). Once the new order

node is found, the order node�s assigned CD becomes the slave�s new rule 1 (i.e. assigned) CD.

The rule 2 CD list is built a little less directly. The slave was sent to a number of CDs as either an as-

signed node or a ghost node during the previous cycle. The contact domain determined the new order nodeof each of its assigned masters at the end of that cycle. If the slave node was chosen as an order node of any

assigned master node(s) on that CD, the slave node was ‘‘tagged’’. The CD returned to the BD a list of all

of the BD�s slave nodes that were tagged.

If a slave node was sent to multiple CDs, then any or all of those CDs may have returned a tagged status

on that node. Any CD that did so is then added to the node�s rule 2 CD list for the next (i.e. current) time

cycle. For example, both CD 2 and CD 3 in Fig. 8 will return a tag on node s4 to the BD(s) where s4 resides,

and s4�s rule 2 CD list will be {CD 2, CD 3}.

m2 m3 m4 m5 m6

s2 s3 s4

CD 1 CD 2 CD 3

Fig. 8. Both m5 and m6 order on s4. Hence, s4 is a rule 2 generator for both CD 2 and CD 3 (and a rule 1 generator for CD 3).

s1 s2 s3 s4 s5 s6

BD 1 BD 2

Fig. 9. The circle represents the neighborhood of node s4.


2.4. Domain communication

2.4.1. BDM BD

Consider the example in Fig. 9. If node s4 has rule 1 CD set to CD 3, then nodes s3, s4, and s5 must all be

sent to CD 3. Since node s4 is on BD 2, then BD 2 can determine to send nodes s4 and s5 to CD 3 by looking

at s4�s rule 1 CD. BD 1 determines whether to send s3 to CD 3 by first creating a ghost layer of nodes (called

proxy nodes) around each bulk domain�s local set. These nodes are generated by taking the union of the

bulk domain�s local nodes� neighbors (i.e. nodes within two faces of all the local nodes) and then removingthe local nodes so that only the proxy nodes for the bulk domain remain. A list of the proxy nodes is sent to

the other bulk domains for matching against their own local nodes. A list (possibly null) of matches is then

returned.

A static communication topology is established between a bulk domain and the other bulk domains con-

taining the proxy nodes. The topology and list of nodes to be sent is static. Each bulk domain sends to the

other BDs in its ‘‘neighborhood’’ (i.e. the BDs that have proxy copies of any of its local nodes) the latest

rule 1 CD and rule 2 CD list for each of the shared nodes, and receives the same. In the example of Fig. 9,

this allows BD 1 to know what s4�s rule 1 and rule 2 CDs are.For each local node in a bulk domain, a list of neighboring domains, if any, on which the node is proxy is

built and permanently associated with the node. This is referred to as the node�s neighbor BD list.The BD M BD communication topology is a local, nearest-neighbor topology because it is bounded by

the product of the number of native nodes and the number of neighbors per node. In practice, each bulk

domain will have on the order of eight neighbors, i.e., a 3 · 3 grid where the central element represents the

local BD itself.

2.4.2. BDM CD

Each contact domain builds a list of bulk domains called the next time BDs list (Section 2.5.2) from

which it expects to receive data during the next time cycle. The contact domain preposts the ‘‘receives’’ from

each bulk domain in its next_time_BDs list at the start of a time cycle�s BD M CD communication pass. A

buffer is allocated on each bulk domain for each contact domain on its CD ‘‘send’’ list, containing state

information such as the node�s position, velocity, and order node, and its neighbor BD list, assigned (i.e.

rule 1) CD, and list of ghost CDs,. This buffer is sent to the destination CD upon completion where it is

assembled by the contact domain with the data from all other received buffers into CD�s piece of the contactsurface.

m5 m6

s7

CD 1 CD 2 CD 3

Fig. 10. CDM CD communication.


2.4.3. CDM CD

Consider the situation in Fig. 10.

s7�s order node is m6 at the start of the cycle, so s7 is assigned to CD 3. However, surface movement re-

sults in s7�s new order node being m5. This calculation is done by CD 3, since that is where s7 is assigned.

However, CD 2 also needs to know s7�s new order node because it has to accumulate s7�s neighbor BD listinto its next time BDs list, since this is where s7 will be assigned next cycle. This situation is accomodated by

defining another communication topology between a contact domain and other contact domains where any

of its ghosts are assigned or where any of its assigned nodes are ghosts. This is a local topology that is

bounded by the total number of nodes in a contact domain.The typical number of neighbors is on the order

of eight.

This communication topology is also used several times in a time cycle to communicate intermediate re-

sults during the course of the contact calculation. For example, in Fig. 10, node s7 needed the freshly up-

dated coordinates of m5 (a ghost node to CD 3) in order to determine that its new order node is indeed m5.To construct this topology, for each local node the bulk domain builds a list of which contact domains

the local node will be sent to as a ghost. The bulk domain sends this list when it sends the node to its as-

signed CD. When it sends the node to any other CD as a ghost, it sends along with it the information of

which CD it is assigned to. Each CD constructs from this information a set of CDs where its assigned nodes

are ghosts, and another set of CDs where its ghost nodes are assigned. Unlike the BD M BD topology, the

CD M CD communication topology is dynamic.

2.5. Domain preparation

2.5.1. Bulk domain

The bulk domain builds a package of data to send to each of its neighbor contact domains after the

BD M BD communication pass by looping through all its local and proxy nodes and looking at each node�srule 1 CD and rule 2 CD list. The first time a CD is encountered, it creates a new list for nodes going to that

CD and adds the CD to its list of CDs to send to.

If the node is local, it is added to the list of nodes being sent to the rule 1 CD (if not already on the list)

and flags the node as an assigned node. The BD adds all the local neighbors of the node to the lists of allCDs in the rule 1 and rule 2 CD lists regardless of whether the node is local or proxy.

2.5.2. Contact domain

Each contact domain must know beforehand which bulk domains will send it messages, in order to pre-

post the receives of those messages. This determination is actually done at the end of the previous cycle.

After the order nodes have been updated at the end of the previous time cycle, each contact domain

determines which slave nodes have order nodes that are among its assigned masters (the CD�s new rule

1 generators) and which slave nodes are the order nodes of its assigned masters (the CD�s rule 2 generators).The bulk domains that own these generators will communicate with the contact domain next (i.e. current)

cycle. In addition, the contact domain will also receive all the neighbors of its generators, and these neigh-

bors may not all be from the same bulk domains as the generators themselves (see Fig. 9).

To solve this problem, when a node is sent to a CD, extra data is also sent with it called the neighbor BD

list. For a given node, its neighbor BD list is the set of all bulk domains on which the node is local or proxy.

This is a static list, and can be computed during initialization. If a bulk domain appears in a given node�sneighbor BD list, then that bulk domain contains at least one node in the neighborhood of the given node.

Thus, the contact domain will also communicate with all bulk domains in the neighbor BD list of any of itsgenerators.

Thus, at the end of the previous cycle the contact domain finds all slaves that order on its assigned mas-

ters and all slaves that are ordered on by its assigned masters. Then, the contact domain takes the union of


these nodes� neighbor BD lists in order to determine which BDs it will talk to during the next time cycle.

This list is stored in the next time BDs list until the start of the next (i.e. current) cycle.

2.6. Communication summary

The ALE3D parallel contact algorithm uses three types of communication topologies in dynamically

evolving the communication patterns of material contact.

BDM BD communication. Ties the bulk domains together by sharing information near mutual bound-

aries and thus allows a given bulk domain to ‘‘sense’’ when it is about to become involved in the work of a

specific contact domain by involvement of its proxy nodes.

BDM CD communication. Supplies the contact domain with the data needed to perform the contact

enforcement, to update the order nodes of the CD�s assigned nodes, and to predict, on the basis of the up-

dated order nodes, which bulk domains will communicate with it during the next time cycle. Returns theupdated nodal state for its assigned nodes to their BD homes.

CDM CD communication. Supports the contact domain in determining which slaves order on its as-

signed master nodes, and thus which bulk domains the contact domain will communicate with during

the next time cycle. CD M CD communication also permits less redundant calculation of the ghost state

by trading communication for redundant calculation.

The control flow for a single time cycle is given in Fig. 11.

2.7. A simplified example

A simple example is provided to demonstrate how the communication pattern evolves. In a more real-

istic example the surfaces would be 2D, and the node neighborhoods would include all nodes within two

faces. As a further simplification, all the nodes on one side of the surface will exactly match up with a node

on the other side, so that the rule 2 CD list will only contain the rule 1 CD. Nevertheless this example shows

much of the mechanism by which BDs and CDs start and stop communicating with each other in the course

of contact evolution.

Fig. 11. Control flow for a single time cycle.


1. Initial configuration; mi orders on si

s1 s2 s3 s4 s5 s6

m1 m2 m3 m4 m5 m6

2. Static layout of the contact surface into the six bulk domains. The nodes represented by open circles are

proxy nodes

BD 4 BD 5 BD 6

BD 1 BD 2 BD 3

s1 s2 s3 s2 s3 s4 s5 s4 s5 s6

m1 m2 m3 m2 m3 m4 m5 m4 m5 m6

3. Static BD M BD topology

BD

BD

m

m

s

s1

1 $ BD 2 BD 2 $ fBD 1;BD 3g BD 3 $ BD 2

4 $ BD 5 BD 5 $ fBD 4; BD 6g BD 6 $ BD 5

4. Static master side contact decomposition. The assigned nodes to a domain are represented by filled cir-cles and the ghost nodes are presented by open circles

m1 m2 m1 m2 m3 m2 m3 m4

3 m4 m5 m4 m5 m6 m5 m6

CD 1 CD 2 CD 3

CD 4 CD 5 CD 6

5. Contact domain configuration

m1 m2 m1 m2 m3 m2 m3 m4

3 m4 m5 m4 m5 m6 m5 m6

CD 1 CD 2 CD 3

CD 4 CD 5 CD 6

3 s4 s5

s1 s2s1 s2 s3 s2 s3 s4

s4 s5 s6 s5 s6

6. Contact surface movement during time cycle tn�1

s2 s3 s4 s5 s6

m1 m2 m3 m4 m5 m6


7. CD allocation and order node updates at tn�1

s1 s2 s1 s2 s3 s2 s3 s4

s3 s4 s5 s4 s5 s6 s5 s6

m1 m2 m1 m2 m3 m2 m3 m4

m3 m4 m5 m4 m5 m6 m5 m6

CD 1 CD 2 CD 3

CD 4 CD 5 CD 6

8. The contact domains during time cycle tn

m1 m2 m1 m2 m3 m2 m3 m4

m3 m4 m5 m4 m5 m6 m5 m6

CD 1 CD 2 CD 3

CD 4 CD 5 CD 6

s2 s3 s2 s3 s4 s3 s4 s5

s4 s5 s6 s5 s6 s7 s5 s6

s1

A particular event is illustrated in Fig. 12. Note that BD 2 does not send anything to CD 1 during tn�1 as

BD 2 ! CD 1 is not part of the communication topology. Also, BD 2 finds it will send data during tn to CD

1 and BD 2 ! CD 1 gets added to the BD ! CD topology. Similarly, communication paths can disappear

when they are no longer needed. For example, CD 3 received s2 from BD 1 during time cycle tn�1 while in

the next time cycle CD 3 no longer needed s2, and consequently BD 1 ! CD 3 is dropped out of the

topology.

3. Timing studies

Scalability of the contact algorithm is tested by performing several timing studies. The test geometries

are rectangular, cylindrical, and spherical, and have been designed to isolate the contact scalability issues.

Scaled efficiency and fixed efficiency are both measured. Since contact surfaces are two dimensional, the

number of contact surface elements in conventional scaling grows as N2/3 of the overall element count N. If

the overall amount of work per processor is held constant, contact work per processor actually shrinks for

scaled speedup. This clouds the issue of scalability of the contact algorithm itself. Ideally one would hope tosee super-linear scaling of the contact phase in this case, but quantifying this is difficult. The simplest solu-

tion is to hold the contact work per processor constant, by performing mesh refinement only in the dimen-

sions of the contact surface.

Fixed efficiency tests are achieved by running a fixed problem size on various numbers of processors.

Ideally each doubling of processors should cut the run time by a half. However, as the amount of actual

work per processor becomes small compared to the time spent in various fixed-overhead work, e.g., com-

munication latency times, this ideal will inevitably fail at some point. The problem sizes chosen exhibit this

failure at the largest processor counts run.All tests were performed on the Lawrence Livermore National Laboratory ASCI ‘‘Y’’ machine, which is

an IBM SP system with 488 nodes and 1.5GBytes of memory per node. Each node consists of 4 PowerPC

604e processors, running at 332MHz with a peak of 664MFlops, that share the node�s memory, for a total

of 1952 processors.

Fig. 12. One particular event of the example.


3.1. A case for local communication

It was observed in the following sections that there was both a steady increase in typical cycle times and

an increasing variance in the individual cycle times when running with larger numbers of processor. Indeed,

some time cycles were four times as long as others. This occurred even in the fixed planes test (see below)

where identical work and communication is performed during each time cycle. Moreover, which cycles took

longer on a given run was entirely unrepeatable.

This behavior is now understood to be a result of system ‘‘daemon’’ processes that randomly interrupt

various processors to perform housekeeping tasks. If each processor ran entirely independently and the sys-tem interruptions are uniformly distributed, then scalability would not be affected. However, if one proc-

essor is interrupted during processor intercommunication, then all processor communication must wait.

While this is most dramatic for global operations such as MPI_Barrier and MPI_Allreduce, it also affects

local communication patterns. To see this, consider Fig. 13.

A B C D E

Fig. 13. Five processor example where each processor communicates only with its adjacent neighbor.


If processor A is interrupted and temporarily cannot send the expected data to processors B and E, then

at some point processors B and E can go no further and cannot send the data to processors C and D. Thus,

processors C and D must wait for processor A to return from the interruption. This ‘‘daisy-chain’’ can be of

any length. The probability of a processor being interrupted by the system in any given time interval in-

creases as the processor count is increased, resulting in the increased run times observed as well as the var-

iance in cycle times.

A simple test to measure this phenomenon is constructed where each processor communicates with 6neighbor processors. For simplicity the neighbors are the six with processor numbers (MPI rank) nearest

to the given processor. The test is run 1000 time cycles where each processor sends and receives 2000 float-

ing-point numbers to its neighbors during a time cycle. This test should ideally take the same amount of

time regardless of the total number of processors involved. The results are shown (normalized) in Fig.

14 and may seem to indicate that producing an algorithm that uses only local communication is futile.

However, a similar test was constructed which instead performs a global operation (MPI_Barrier) within

the loop. The results of this test are again shown in Fig. 14.

Since barriers are the simplest form of global communication in that they involve no data transfer, it isclear that global communication produces a much bleaker picture for scalability than local communication.

Thus efforts expended in restricting communication to local interchanges only are worthwhile.

3.2. Fixed planes test

Two rectangular boxes of equal size are oriented so that one sits on top of the other with perfect overlap,

see Fig. 15. Larger runs use a finer grid in x- and y-directions while shrinking the box thickness in the z-

direction. The contact surface is defined to be the surface where the blocks meet. The mesh for the twoblocks is slightly different as there is one more zone in the x- and y-directions of the bottom block than

in the top block. Thus, the contact nodes do not coincide except at the edges (the contact surface is orthog-

onal to the z-direction). The entire machinery of the contact algorithm, including all communication, is

Fig. 14. (a) Local communication (green line); (b) barrier communication (red line) and run-time is the average of 4 runs. (For

interpretation of color the reader is referred to the web version of this article.)

Fig. 15. Fixed planes.

Table 1

Results for fixed planes test

Procs. Total zones Zones/proc. Contact nodes Nodes/proc. Contact time

8 25,282 3160 12,961 1620 1.0

16 50,622 3164 25,763 1610 1.11

32 101,762 3180 51,521 1610 1.14

64 203,386 3178 102,597 1603 1.24

128 408,322 3190 205,441 1605 1.36

256 815,346 3185 409,481 1600 1.51

512 1,635,842 3195 820,481 1603 1.65

1024 3,264,994 3188 1,636,113 1598 2.57


exercised even though there is no movement in this test. Computational load balance should be very good

and any non-scalability is attributable to communication.

Table 1 lists the parameters and results for a set of timed runs. Each test is approximately twice the size

of the previous test, and is executed on twice as many processors. All times are normalized.

There is a gradual increase in time attributable to system interrupts leading to communication blocks as

expected from the previous discussion on IBM SP scalability issues. The overall increase in time is about 3·,for a 128· increase in problem size and processor count. Fig. 16 compares the actual scaled efficiency com-

pared to the ideal. The normalized maximum run time is shown and is actually the inverse of scaledefficiency.

A fixed-size test was also executed with various processor counts. The problem chosen is identical to the

one used for the scaled test with 256 processors. Results for the fixed test are compared with ideal scaling in

Fig. 17.

Fig. 16. Comparison of actual scaled efficiency (red line) to the ideal (green line) for fixed planes test. (For interpretation of color the

reader is referred to the web version of this article.)

Fig. 17. Comparison of actual efficiency (red line) with the ideal (green line) for fixed sized problem on the fixed planes test. (For

interpretation of color the reader is referred to the web version of this article.)


3.3. Sliding planes test

Two rectangular boxes are again modelled in this test. In this case they sit at an initial off-set as shown in

Fig. 18(a). One block is given a velocity, so that at a later time they are in the configuration shown in Fig.

18(b).

This test exercises the ability of the algorithm to change its communication pattern over time. The par-

titioning algorithm decomposes the problem into domains such that each domain is in one block or theother but not both. This is a consequence of the rule that attempts to minimize connectivity between do-

mains. The most obvious cut is between the blocks along the contact surface as there is no connectivity

between blocks.

The domains directly across from a given domain will change as the block slides, thus altering the com-

munication pattern. Individual slave nodes are assigned to the contact domain of their associated master

node. As the block slides, slaves that were previously ordered on masters located on the edge of a contact

domain will change ordering to new masters in a different contact domain. Thus, the individual data sent to

contact domains changes each time cycle even when the pattern of communicating domains does not.Table 2 and Fig. 19 present the normalized results for the scaled sliding planes test.

It is clear from the above results that this test does not scale as well as the fixed plane test. The source of

the problem is poor load-balance and is a consequence of the way in which the slave nodes are assigned to

contact domains. As previously discussed, each slave node is dynamically assigned to the same contact do-

main as its statically assigned master order node. Since the blocks are offset, all the slave nodes in the offset

region order on the limited set of master nodes along the edge of the master side closest to the offset slave

region. Fig. 20 depicts a simplified 1D example.

Fig. 18. Sliding planes.

Table 2

Results for sliding planes test


8 25,600 3200 13,122 1640 1.0

16 51,072 3192 25,990 1624 1.10

32 102,400 3200 51,842 1620 1.20

64 204,288 3192 103,050 1610 1.40

128 409,600 3200 206,082 1610 1.78

256 817,152 3192 410,368 1603 2.15

512 1,628,400 3200 821,762 1605 3.05

1024 3,268,608 3192 1,637,922 1600 5.21

Fig. 19. Comparison of actual scaled efficiency to the ideal for sliding planes test.

A B C

A B

Master

Slave

Fig. 20. The slave side is on the bottom. Each partition along the top represents a different contact domain.


Although the master side is well load-balanced, most of the slave side is nearest to the master nodes in

contact domain A. Thus, most of the slave nodes are assigned to contact domain A while a much smaller

number are assigned to contact domain B and none are assigned to contact domain C. This results in severe

load imbalance.

The above imbalance gets worse as the problem is scaled up. In Fig. 20, 50% of the slave side lies to the

left of the master side and thus more than 50% of the slave side will be assigned to 33% of the contact do-

mains. If the mesh is refined 10 times with a proportional increase in processor count yielding 30 contact

domains, then more than 50% of the slave nodes will now be assigned to 1/30, or about 3%, of the contactdomains. This is a definite weakness in the current version of the contact algorithm but can be ‘‘fixed’’ by

using a different method of assigning slave nodes to contact domains.

Fig. 21 shows the results for the fixed size scaling test for sliding planes.

3.4. Cylinder test

Two concentric cylinders are modelled in this test, see Fig. 22. The outer cylinder is held fixed while the

inner cylinder is given a constant angular velocity. Each part of the contact surface in this problem even-

Fig. 21. Comparison of actual efficiency with the ideal for a fixed sized problem on the sliding planes test.

Fig. 22. Cylinder test.


tually comes into contact with many other parts of the surface. Consequently, there is a large amount of

change in the communication pattern over time. This test does not have the load-imbalance problem of

the sliding planes test and so scalability is improved. One full revolution of the interior cylinder is modelled

over the course of the test.The results for this test are shown in Table 3 and Fig. 23.

Fig. 24 shows the corresponding fixed size results for the problem used in the 256 processor scaled test.

It can be seen that the fixed-size problem becomes communication-bound past 512 processors. This is

not surprising. There are less than 400 contact nodes per contact domain at this point and each processor

is left without a lot of computation to balance the communication overhead.

Table 3

Results for cylinder test


8 12,800 1600 6,642 830 1.0

16 25,536 1596 13,110 819 1.01

32 51,200 1600 26,082 815 1.13

64 102,144 1596 51,754 809 1.21

128 204,800 1600 103,362 808 1.40

256 408,576 1596 205,650 803 1.56

512 819,200 1600 411,522 804 1.85

1024 1,634,304 1596 819,847 801 2.37

Fig. 23. Comparison of actual scaled efficiency to the ideal for the cylinder test.

Fig. 24. Comparison of actual efficiency with the ideal for fixed sized problem on the cylinder test.


3.5. Concentric spheres test

This final test has two contact surfaces separating three concentric spherical shells. The outer and inner

shell are a metallic material, while the middle shell is an explosive material. Only one octant of the materialis modelled using symmetric boundary conditions. At the initial time the explosive is ignited, leading to a

rapid expansion of the material. Fig. 25 shows the initial problem configuration. While there is not a lot of

movement along the contact surfaces, there is some, and to capture the correct physics it is necessary to

Fig. 25. Initial configuration.

Table 4

Results for concentric spheres test


8 21,240 2655 5312 664 1.0

16 42,350 2646 7669 479 0.90

32 84,860 2652 12,936 404 0.90

64 169,920 2655 20,740 324 1.09

128 338,800 2646 30,064 235 1.11

256 678,880 2652 50,948 199 1.37

512 1,359,360 2655 81,956 160 1.62

1024 2,710,400 2646 119,044 116 1.93


allow the free motion. The most important requirement on the contact surfaces in this application is the

need to accurately model shock waves as they cross the surface.

Note that unlike the previous tests, the ratio of contact nodes to overall zones decreases as this test is

scaled up, see Table 4.

The shrinkage in the amount of actual computation initially leads to a reduced run time in the above

results. However, as the processor count increases, the amount of computation becomes less significant

Fig. 26. Comparison of actual scaled efficiency to the ideal for the concentric spheres test.

Fig. 27. Comparison of actual efficiency with the ideal for fixed sized problem on the concentric spheres test.


relative to the communication time and suffers the mild non-scalability discussed earlier. A plot of the tim-

ing results is presented in Fig. 26.

Fig. 27 shows the results of a fixed size scaling test. As in the other tests, the problem size previously used

for 256 processors is again used for all processor counts of the fixed size test.

References

[1] A. Anderson et al., User�s Manual forALE3D, Lawrence Livermore National Laboratory, 2003.

[2] R. Couch, R. McCallen, I. Otero, R. Sharp, 3D Metal Forming Applications of ALE Techniques, Simulation of Materials

Processing: Theory, Methods, and Applications, Balkema, Rotterdam, 1995.

[3] A. Koniges (Ed.), Industrial Strength Parallel Computing: Programming Massively Parallel Systems, Morgan Kaufmann

Publisher Inc., Los Altos, CA, 1999.

[4] G. Karypis, V. Kumar, METIS: Unstructured Graph Partitioning and Sparse Matrix Ordereing System, Version 1.0, University

of Minnesota, Department of Computer Science, 1995.

[5] D.J. Benson, J.O. Hallquist, A single surface contact algorithm for the post-buckling analysis of shell structures, Comput.

Methods Appl. Mech. Engrg. 78 (1990).

[6] C.G. Hoover, A.J. DeGroot, J.D. Maltby, R.D. Procassini, ParaDyn: DYNA3D for massively parallel computers, in:

Presentation at Tri-Laboratory Engineering Conference on Computational Modelling, October 1995.

[7] S.W. Attaway, B.A. Hendrickson, S.J. Plimpton, D.R. Gardner, C.T. Vaughan, K.H. Brown, M.W. Heinstein, A parallel

contact detection algorithm for transient solid dynamics simulations using PRONTO3D, Computat. Mech. 22 (1998).

[8] K. Brown, S. Attaway, S. Plimpton, B. Hendrickson, Parallel strategies for crash and impact simulations, Comput. Methods

Appl. Mech. Engrg. 184 (2000).

[9] M.J. Berger, Bokhari, A partitioning strategy for nonuniform problems on multiprocessors, IEEE Trans. Comput. C 36 (1987)

570–580.

[10] M.L. Wilkins, Calculation of elastic–plastic flow, Lawrence Livermore National Laboratory, UCRL-7322, rev 1, 1969.

[11] T.G. Pierce, A parallel algorithm for contact in a finite element hydrocode, Lawrence Livermore National Laboratory, UCRL-

LR-154063.

A parallel two-sided contact algorithm in ALE3D

Documents

Transcript of A parallel two-sided contact algorithm in ALE3D