Chapter 21 PARALLEL EARTHQUAKE SIMULATIONS ON LARGE...

Chapter 21

PARALLEL EARTHQUAKE SIMULATIONS ON LARGE-SCALE MULTICORE SUPERCOMPUTERS

Xingfu Wu Department of Computer Science & Engineering Institute for Applied Mathematics and Computational Science Texas A&M University, College Station, TX 77843, USA Benchun Duan Department of Geology & Geophysics Texas A&M University, College Station, TX 77843, USA Valerie Taylor Department of Computer Science & Engineering Texas A&M University, College Station, TX 77843, USA 1. Overview of Earthquake Simulations Earthquakes are one of the most destructive natural hazards on our planet Earth. Hugh earthquakes striking offshore may cause devastating tsunamis, as evidenced by the 11 March 2011 Japan (moment magnitude Mw 9.0) and the 26 December 2004 Sumatra (Mw 9.1) earthquakes. Earthquake prediction (in terms of the pre-cise time, place, and magnitude of a coming earthquake) is arguably unfeasible in the foreseeable future. To mitigate seismic hazards from future earthquakes in earthquake-prone areas, such as California and Japan, scientists have been using numerical simulations to study earthquake rupture propagation along faults and seismic wave propagation in the surrounding media on ever-advancing modern computers over past several decades. In particular, ground motion simulations for past and future (possible) significant earthquakes have been performed to under-stand factors that affect ground shaking in populated areas, and to provide ground shaking characteristics and synthetic seismograms for emergency preparation and design of earthquake-resistant structures. These simulation results can guide the development of more rational seismic provisions for leading to safer, more effi-cient, and economical structures in earthquake-prone regions. 1.1 Large-scale Ground Motion Simulations

2

Most earthquakes occur on tectonically active faults. A fault is a fracture in the Earth's crust or lithosphere on which one block of rock can slide past another. Alt-hough each of the two blocks moves with respect to the other due to plate tectonic processes, some areas of the fault may be locked by friction. Over decades, centu-ries, or even millennia, the shear (tangential) stress on these locked areas builds up, and elastic energy accumulates and is stored in deformed rocks. When the shear stress exceeds the frictional strength somewhere along the fault surface, the fault breaks suddenly and slip (relative displacement of the two sides of the fault) occurs there. Under favorable conditions, the rupture propagates along the fault within seconds or minutes. During the rupture propagation on the fault, the shear stress on the fault drops to a lower level, and the stored elastic energy is suddenly released. Some of the energy is radiated as seismic waves, which propagate within the Earth. When seismic waves arrive at the Earth's surface, they cause ground motion. The governing equations for seismic wave propagation, thus ground motion simu-lations, are equations of motion for a continuous medium. Consider a material volume V of a continuum with surface S, equations of motion can be written as

bσu ρρ +⋅∇= , (1) where, σ is the stress tensor, u is the displacement vector, b is the body force vec-tor, ρ is density, and double dots on u represent the second derivative in time (thus the acceleration). The first term on the right-hand side of Equation 1 with the dot product of the operator ∇ and the stress tensor gives the divergence of the stress field, resulting in a vector. With boundary conditions (either prescribed traction or displacement) on the surface S and initial conditions of displacement and velocity in the volume V, Equation 1 governs wave propagation in the medium. Given a specific type of continuum (e.g., elastic or viscoelastic), a constitutive law that specifies how stress relates to strain (or strain rate) can be substituted into Equa-tion 1. Earthquake sources in ground motion simulations are commonly characterized by kinematic models, rather than dynamic models. A kinematic source model speci-fies slip distribution on the fault and temporal evolution of slip at a given point on the fault, without considering driving forces that cause them. In contrast, a dynam-ic source model specifies initial stress conditions on the fault, and slip evolution and distribution are a part of the solution. In particular, in a spontaneous rupture model (this is the type of the dynamic source model we will refer to hereafter in this chapter), rupture propagation is governed by a failure criterion (e.g., Mohr-Coulomb) and a friction law that specifies how frictional strength varies with slip, slip rate, and/or state variables during fault slipping. Rupture propagation and its radiated wave field are decoupled in a kinematic source model, while they are coupled in a dynamic source, which is more realistic. Numerical methods are needed for ground motion simulations for realistic geolog-ic structures. Commonly used numerical methods in ground motion simulations

3

are the finite difference method (FDM) and the finite element method (FEM). Other numerical methods in use include the boundary element method (BEM) and spectral element method (SEM). Generally speaking, FDM is simpler and easier to implement in the computer codes than FEM. Thus, FDM is widely used, and is the most popular method in ground motion simulations. A comprehensive review of FDM and FEM with application in seismic wave propagation and earthquake ground motion is given in [29]. Large-scale ground motion simulations with parallel computing have been per-formed in some of earthquake-prone areas, particularly in Southern California. During the past decade, a group of researchers working on large-scale simulations has collaborated in building a modeling community within the Southern California Earthquake center (SCEC). Ground motion prediction from possible scenario earthquakes on the San Andreas Fault (SAF) is one of research activities in this community. These large-scale 3D ground motion simulations are very challeng-ing: they are not only computationally intensive but also data intensive. To capture higher frequencies of ground motion that are engineering's interest (i.e., up to tens Hertz) requires enormous computational resources. Each factor of two improve-ment in frequency resolution roughly requires an increase in spatial grid (mesh) size by a factor of eight and an increase in the number of simulation time steps by a factor of two, for a total increase of 16 in computational resources. Meantime, the size of required input data (e.g., rock properties) and output data increase dra-matically, imposing significant challenges for the I/O. Thus, some of these simula-tions were performed on the largest supercomputers available at the time of simu-lations, such as TeraGrid. A group of researchers at San Diego State University (SDSU) and San Diego Su-percomputer Center (SDSC) used a FDM approach to perform a series of large-scale ground motion simulations in Southern California for scenario earthquakes on the southern SAF, called TeraShake. The code used in TeraShake is the AWM (Anelastic Wave Model), which solves the 3D velocity-stress wave equation by a staggered-grid FDM with fourth-order spatial accuracy and second-order temporal accuracy. Anelastic wave propagation effects are accounted for by a coarse-grained implementation of the memory variables for a constant-Q solid [8] and Perfectly Matched Layers (PML) absorbing boundary conditions on artificial model boundaries are implemented in the code [28]. The TeraShake1 calculations [36] simulate 4 minutes of 0-0.5 Hz ground motion in a 180,000 km2 area of southern California, for Mw 7.7 scenario earthquakes along the 200 km long sec-tion of the SAF between Cajon Creek and Bombay Beach at the Salton Sea. The source models in TeraShake1 are kinematic source models modified from the 2002 Mw 7.9 Denali, Alaska, earthquake. The SCEC Community Velocity Model (CVM) [27, 25] Version 3.0 is used for material properties in the models. The main scientific findings from TeraShake1 include 1) the chain of sedimentary ba-sins between San Bernardino and downtown Los Angeles forms an effective waveguide to produce high long-period ground motions over much of the greater Los Angeles region, and 2) northwestward rupture propagation is much more effi-

4

cient in exciting the above waveguide effects than southeastward rupture propaga-tion. The model domain in TeraShake1 is 600 km (NW) by 300 km (NE) by 80 km (depth). With a grid spacing of 200 meters, the volume is divided into 1.8 bil-lion cubes. The simulations were performed on the 10 tera-flops IBM Power4+ DataStar supercomputer at SDSC, using 240 processors and up to 19,000 CPU hours. Each scenario took about 24 hours wall clock time for four minutes of wave propagation. The TeraShake2 simulations [37] use a more complex source derived from spon-taneous rupture models with small-scale stress-drop heterogeneity on a scale con-sistent with inferences from models of the 1992 Landers earthquake. These simu-lations predict a similar spatial pattern of peak ground velocity (PGV), but with the PGV extremes decreased by factors of 2-3 relative to TeraShake1, due to a less coherent wavefield radiated from the more complex source. The AWM code can also perform spontaneous rupture modeling, but limited to planar fault surfaces aligned with Cartesian coordinate planes normal to the free surface (vertical fault planes), which is common for standard FDMs. Thus, Olsen et al. [37] use a two-step approximate procedure to perform these simulations. In the first step, they perform spontaneous rupture modeling for a simplified planar fault geometry. The second step is essentially a separate kinematic simulation, using as a source the space-time history of fault slip from the first step, mapping the latter onto the five-segment SAF geometry. They use the same system DataStar as in TeraShake1. The first step, dynamic rupture simulations with a high-resolution (a grid spacing of 100 m), took 36,000 CPU hours using 1024 processors, and the second step of wave propagation runs with a grid spacing of 200 m took 14,000 CPU hours on 240 processors. Cui et al. [5] present detailed discussions on optimization of the AWM code, the I/O handling, and initialization, on scaling ability of the code up to 40k processors on the Blue Gene/L machine at IBM TJ Watson Research cen-ter, and on challenges for data archive and management. Several groups of researchers perform ground motion simulations for the great Southern California ShakeOut exercise [23]. The ShakeOut is a hypothetical seis-mic event of Mw 7.8 developed by a multidisciplinary group from the U.S. Geo-logical Survey (USGS), with the collaboration of SCEC and the California Geo-logical Survey to improve public awareness and readiness for the next great earthquake along the southern SAF. Graves et al. [17] simulate broadband ground motion for the scenario earthquake with a kinematic source description [21] and examined ground motion sensitivity to rupture speed. Olsen et al. [38] simulate ground motion from an ensemble of seven spontaneous rupture models of Mw 7.8 northwest-propagating earthquakes on the southern SAF. They found a similar dif-ference in ground motion extremes between kinematic and dynamic source mod-els to that between TeraShake1 and TeraShake2, attributable to a less coherent wavefield excited by the complex rupture paths of the dynamic sources. Bielak et al. [3] present verification of the ShakeOut ground motion simulations with a kin-ematic source description by three groups with three independently-developed codes. One group CMU/PSC uses a FEM approach known as Hercules [43] with

5

an octree-based mesher. The other two groups, SDSU/SDSC and URS/USC, use a staggered-grid FDM approach [36, 37, 17]. All three codes can run on parallel computers. They find that the results are in good agreement with one another, with small discrepancies attributed to inherent characteristics of the various numerical methods and their implementations. The most recent large-scale ground motion simulation in southern California, called "M8", was performed on the Jaguar Cray XT 5 at the National center for Computational Sciences, using 223,074 cores with sustained 220 Tflop/s for 24 hours by Cui et al. [6]. M8 uses 436 billion 40-m3 cubes to represent the 3D litho-sphere structure (the SCEC CVM Version 4.0) in a volume of 810 km by 405 km by 85 km, providing ground motion synthetics in southern California with fre-quencies up to 2 Hz. The code used in M8, called AWP-ODC, is a highly scalable, parallel version of the AWM with addition of other components, in particular for data-intensive I/O treatments (e.g., MPI-IO). The same two-step procedure in TeraShake2 is used in M8 to account for source complexities revealed by sponta-neous rupture models. Above TeraShake and ShakeOut simulations [37, 38] have shown that dynamic source models with small-scale stress-drop heterogeneity predict more realistic ground motion extremes than commonly used kinematic source models in ground motion simulations. More importantly, dynamic rupture models provide a means for scientists to explore physical processes that control earthquake rupture propa-gation, and thus earthquake sizes and rupture paths in realistically complex fault systems, which are important inputs for seismic hazard analysis in earthquake-prone regions. 1.2 Dynamic Rupture Simulations As mentioned in the above section, dynamic rupture models specify initial stress conditions on the fault and invoke a failure criterion and a friction law on the fault to solve for rupture propagation. Thus, rupture propagation in these dynamic rup-ture models is spontaneous (thus referring to spontaneous rupture models) and it obeys physical laws, including the failure criterion, the friction law, and principles of continuum mechanics. In addition, rupture propagation on the fault is coupled with wave propagation in the surrounding medium through the stress field. For example, waves reflected from layer boundaries and the Earth's free surface may alter the stress state on the fault, thus affecting rupture propagation that radiates seismic waves. Thus, spontaneous rupture models have been becoming increasing-ly important in earthquake physics studies and ground motion simulations. Two frictional laws have been widely used in spontaneous rupture models. One is the slip-weakening friction law [22, 1, 7], in which the frictional coefficient on the fault plane drops linearly from a static value sµ to a dynamic (sliding) value dµ over a critical slip-weakening distance D0. The other is the rate- and state-

6

dependent friction law derived from laboratory experiments of rock friction [13, 39], in which frictional coefficient is a function of slip velocity and state variables. Because of involvement of fault friction, there are no analytical solutions for spon-taneous rupture problems and numerical methods are required. Among the earliest spontaneous rupture models were those constructed by Andrews [1] in 2D and Day [7] in 3D using the FDM approach. The former examined variability of rup-ture speed on an idealized 2D fault and predicted a regime of supershear rupture that is observed in natural earthquakes later. The latter explored effects of nonuni-form prestress on rupture propagation. The standard FDM is limited to simulate spontaneous rupture propagation on vertical planar fault planes. FEM, BIEM (boundary integral equation method) and SEM have also been used in spontaneous rupture simulations. For example, Oglesby et al. [35, 34, 33] used a FEM ap-proach to study effects of dipping fault geometry on rupture dynamics and ground motion. Aochi and Fukuyama [2] used a BIEM to study spontaneous rupture propagation of the 1992 Lander earthquake along a non-planar strike-slip fault system. Kame et al. [24] studied effects of prestress state and rupture velocity on dynamic fault branching using a BIEM approach. One challenging issue in spontaneous rupture simulations is verification and vali-dation of computer codes implemented by different researchers based on the above methods. Verification refers to comparison of results from different codes on an identical problem, while validation generally means comparison of simula-tion results against ground motion recordings from natural events and involves validation of not only the source process, but also the path effect (including the velocity structure and the local site condition). A broad, rigorous community-wide exercise on verification of dynamic rupture codes has been underway in the SCEC/USGS community [19]. This exercise is to compare computer codes for rupture dynamics used by SCEC and USGS researchers to verify that these codes are functioning as expected for studying earthquake source physics and ground motion simulations [20]. More than 15 computer codes have been involved in the exercise and results of some benchmark problems from some of these codes are publically accessible on the web site, http://scecdata.usc.edu/cvws. As analyzed by Day et al. [9], the 3D spontaneous rupture simulations can be quite challenging in terms of required memory and processor power, because of a spatial resolution requirement for the cohesive zone at the rupture tip. Very few codes in the community can perform large-scale spontaneous rupture simulations on hundreds to thousands processors, which are needed to construct spontaneous rupture models of large to huge earthquakes with a reasonable resolution. The FDM code AWP-ODC used in M8 simulations, with its earlier version of AWM used in TeraShake2 simulations, is one that can run on thousands of processors to our knowledge. However, the code is limited to simulate spontaneous rupture propagation on a vertical planar fault. Most of large to huge earthquakes occur on shallow-dipping thrust faults and often involve segmented faults with non-planar geometry, such as the 2004 Sumatra and the 2011 Japan earthquakes.

7

Duan and co-workers have been developing an explicit FEM code, EQdyna, to simulate spontaneous rupture propagation on geometrically complex faults and seismic wave propagation in an elastic or elastoplastic medium [14, 15, 10, 11]. The code has been verified in the SCEC/USGS dynamic code verification exercise on many benchmark problems. An OpenMP version of the code [44] was used to investigate effects of prestress rotations along a shallow-dipping fault on rupture dynamics and near-field ground motion, motivated by relevant observations in the 2008 Wenchuan (China) Mw 7.9 earthquake [16]. Figure 1 shows snapshots of near-field ground motion from a spontaneous rupture model. In Figure 1, the black line is the trace of a shallow dipping fault in the model, and circle, triangular, plus, and cross signs denote the epicenter, Chengdu, Beichuan, and Wenchuan cities, respectively. The figure illustrates that the distribution of near-field ground veloci-ty is strongly affected by the shallow dipping fault geometry with higher ground motion on the hanging wall side of the fault (below the black line in the figure). We have been parallelizing EQdyna since 2008, aiming to perform large-scale spontaneous rupture and ground motion simulations for realistically complex fault systems and geologic structures. Based on what we learned from the OpenMP im-plementation [44], we developed an initial hybrid MPI/OpenMP implementation with a 3D mesh as an input, which was generated by a 3D mesh generator sepa-rately before the simulation execution [45]. In this chapter, we integrate the 3D mesh generator into the simulation, and use MPI to parallelize the 3D mesh gener-ator. Then we illustrate an element-based partitioning scheme for explicit finite el-ement methods, and evaluate its performance on Quad- and Hex-core Cray XT systems at Oak Ridge National Laboratory [31] using the SCEC benchmark TPV 210. The experimental results indicate that the hybrid MPI/OpenMP implementa-tion has the accurate output results and the good scalability on these systems.

8

Figure 1. Snapshots of horizontal ground velocity from a simplified dynamic model of the 2008 Ms 8.0 Wenchuan earthquake

1.3 Earthquake Simulations and Data-Intensive Computing The main input to large-scale earthquake simulations are large datasets which de-scribe geologic structures (e.g., faults and slip or stress on them) and rock proper-ties (e.g., seismic velocities). These datasets drive a simulation pipeline which in-cludes mesher, solver, and visualizer [42, 43, 26]. Generally, a mesh is generated to model the property or geometry of an earthquake region. Then, a solver takes the mesh as input to conduct numerical computation. The numerical results pro-duced by the solver are correlated to the mesh structure by a visualizer to create earthquake ground motion images or animations. As earthquake simulations target hundred million to multi-billion element simulations, significant performance bot-tlenecks remain in storing, transferring and reading/writing multi-tera-/peta-bytes files between these components. In particular, I/O of multi-tera-/peta-bytes files remains a pervasive performance bottleneck on large-scale multicore supercom-puters. Generally speaking, large-scale earthquake simulations are data-intensive simula-tions, such as TeraShake simulations [5] which produced more than 40 terabytes of data. These simulations revealed new insights into large-scale patterns of earth-quake ground motion, including where the most intense impacts may occur in Southern California's sediment-filled basins during a magnitude 7.7 southern San Andreas Fault earthquake. Parallel I/O techniques such as MPI-IO and I/O exten-sion of the standardized MPI library [18] are used to overcome I/O performance bottleneck occurred in large-scale earthquake simulations by increasing the overall I/O bandwidth via using more disks in parallel, decreasing the I/O latency via re-ducing the number of disk accesses, and/or overlapping computation, communica-tion and I/O operations [4]. For post-processing component visualizer in earth-quake ground motion simulations, there are some visualization challenges such as huge output data, time varying data, unstructured mesh, multiple variables and vector and displacement fields [26]. In this chapter, we focus on discussing the earthquake simulation components mesher and solver. The remainder of this chapter is organized as follows. Section 2 illustrates an element-based partitioning scheme, discusses our hybrid MPI/OpenMP parallel finite element Earthquake rupture simulations in detail. Section 3 describes the architecture and memory hierarchy of quad- and hex-core Cray XT systems used in our experiments. Section 4 discusses the benchmark problem TPV210 and verifies our simulation results. Section 5 evaluates and ex-plores performance characteristics of our hybrid MPI/OpenMP implementation, and presents the experimental results. Section 6 concludes this chapter.

9

2. Hybrid MPI/OpenMP Parallel Finite Element Earthquake Simulations In the finite element method, the data dependence is much more irregular than the finite difference method, so it is generally more difficult to parallelize. Ding and Ferraro [12] discussed node-based and element-based partitioning strategies, found that main advantage for element-based partitioning strategy over node-based partitioning strategy was its modular programming approach to the devel-opment of parallel applications, and developed an element-based concurrent parti-tioner for partitioning unstructured finite element meshes on distributed memory architectures. Tu et al [43] parallelized an octree-based finite element simulation of earthquake ground motion to demonstrate the ability of their end-to-end ap-proach to overcome the scalability bottlenecks of the traditional approach. Mahinthakumar and Saied [30] presented a hybrid implementation adapted for an implicit finite-element code developed for groundwater transport simulations based on the original MPI code using a domain decomposition strategy, and added OpenMP directives to the code to use multiple threads within each MPI process on SMP clusters. Nakajima [32] presented a parallel iterative method in GeoFEM for finite element method which was node-based with overlapping elements on the Earth Simulator, and explored a three-level hybrid parallel programming model, including message passing (MPI) for inter-SMP node communication, loop direc-tives by OpenMP for intra-SMP node parallelization and vectorization for each processing element. In this section, based on what we learned from our previous work [44, 45], we in-tegrate a 3D mesh generator into the simulation, and use MPI to parallelize the 3D mesh generator, illustrate an element-based partitioning scheme for explicit finite element methods, and discuss how efficiently to use hybrid MPI/OpenMP imple-mentations in the earthquake simulations for not only achieving multiple levels of parallelism but also reducing the communication overhead of MPI within a multi-core node, by taking advantage of the globally shared address space and on-chip high inter-core bandwidth and low inter-core latency on large-scale multicore sys-tems. 2.1 Mesh Generation and Model Domain Partitioning In our previous work [44, 45], we developed an initial hybrid MPI/OpenMP im-plementation of the sequential earthquake simulation code EQdyna with a 3D mesh as an input, which was generated by a 3D mesh generator separately before the simulation execution. As we discussed in our previous work, the earthquake simulation code is memory bound, when the number of elements increases, the re-quired system memory for storing large arrays associated with the entire model domain increases dramatically. In order to overcome the limitation, in this chapter,

10

we integrate the 3D mesh generator into the simulation, and use MPI to parallelize the 3D mesh generator. To parallelize the 3D mesh generator, based on the number of MPI processes used, we partition the entire model domain by the coordinate along fault strike (e.g., the x-coordinate in a Cartesian coordinate system) shown in Figure 2 so that we can define small arrays for each MPI process independently. Figure 2 gives a schematic diagram for the 3D mesh partitioning. Thus, memory requirements by large arrays that are associated with the entire model domain in a previous version of the code [45] significantly decrease.

Figure 2. Schematic diagram to show mesh and model domain partitioning

To facilitate message passing between adjacent MPI processes, based on the parti-tions of the entire model domain by the coordinate along fault strike, during the mesh generation step, we create a sub-mesh for each MPI process and record shared boundary nodes between two adjacent MPI processes. This converts read-ing initial large input mesh data to computing and generating small mesh data for each MPI process. Note that, in this partitioning scheme, the maximum number of MPI processes that can be used is bounded by the total number of nodes along the x-coordinate. 2.2 Element-based Partitioning In the FEM, elements are usually triangles or quadrilaterals in two dimen-sions, or tetrahedra or hexahedral bricks in three dimensions. In our explicit fi-nite element earthquake simulation, we primarily use trilinear hexahedral elements to discretize a 3D model for computational efficiency, with wedge-shaped ele-ments along the fault to characterize dipping fault geometry as illustrated in Fig-

11

ure 2. We use a large buffer region with increasingly coarser element sizes away from the fault to prevent reflections from artificial model boundaries from con-taminating examined phenomena.

Figure 3. 2D geometry for the EQdyna: 12 elements (boxes) and each ele-ment with 4 nodes (circles)

Figure 4. Element-based Partitioning Scheme For simplicity, we discuss our partitioning scheme with a hypothetical 2D mesh shown in Figure 3, where there are 12 elements (boxes) and each element has four nodes (circles) adjacent to it. We propose an element-based partitioning scheme because most time-consuming computation in the earthquake rupture simulation

12

code is element-based. Within one timestep, element contribution (both internal force and hourglass force) to its nodes' nodal force is first calculated. Then, con-tributions to a node's nodal force from all of its adjacent elements are assembled. For instance, the nodal force at node 1 only involves element 1, while the nodal force at node 5 involves elements 1, 2, 3, and 4. The nodal force at node 5 is the sum of contributions from all these four elements. Figure 4 illustrates the element-based partitioning scheme for the finite element method, where the 2D domain is split into three components. In this scheme, we essentially partition the model domain based on element numbers. Each compo-nent consists of four elements and the nodes adjacent to them. A node that lies on the boundary between two components is called a boundary node. For example, nodes 7, 8 and 9 are the boundary nodes between the first two components, and nodes 13, 14 and 15 are the boundary nodes between the last two components. To update the nodal force at a boundary node such as node 8, it needs contributions from elements 3 and 4 in the first component and those from elements 5 and 6 in the second component. This requires the data exchange between the first two components. Similarly, the above element-based partitioning scheme can be extended to large 3D datasets. The element-based partitioning method described in this section is applicable to more irregular meshes as well. 2.3 Hybrid Implementations Multicore clusters provide a natural programming paradigm for hybrid programs. Generally, MPI is considered optimal for process-level coarse parallelism and OpenMP is optimal for loop-level fine grain parallelism. Combining MPI and OpenMP parallelization to construct a hybrid program is not only to achieve mul-tiple levels of parallelism but also to reduce the communication overhead of MPI at the expense of introducing OpenMP overhead due to thread creation and in-creased memory bandwidth contention. Therefore, we use hybrid MPI/OpenMP to parallelize the finite element code for exploring the parallelism of the code at node level (OpenMP) and the parallelism of the code between nodes (MPI) so that the parallel earthquake simulation can be run on most supercomputers. Note that, in the hybrid MPI/OpenMP implementations, we separate MPI regions from OpenMP regions, and OpenMP threads cannot call MPI subroutines. Figure 5 shows the parallelism at MPI and OpenMP levels within one timestep for the hybrid implementation of the earthquake simulation. As we discussed in the previous section, using the element-based partitioning scheme, we can partition the 2D mesh geometry into three components, and dispatch each component to a MPI process for MPI level parallelism. So each MPI process is in charge of four elements and the nodes adjacent to them. Because the earthquake simulation is memory-bound, each MPI process is created on a different node as illustrated in Figure 5. MPI process 0 is run on Node 0; process 1 is on Node 1; process 2 is on

13

Node2. On each node, OpenMP level parallelism can be achieved by using ele-ment-based partitioning scheme and OpenMP. Each MPI process (the master thread) forks several new threads to take advantage of the shared address space and on-chip high inter-core bandwidth and low inter-core latency on the node.

Figure 5. Parallelism at MPI and OpenMP levels within one timestep To manipulate and update nodal forces at these boundary nodes, it requires the da-ta exchange between two adjacent MPI processes via message passing. For each boundary node such as node 7 shown in Figure 5, to update its nodal force at the end of each timestep, we sum the nodal force at node 7 from process 0 and the nodal force at node 7 from process 1, then use the sum to update the nodal forces at node 7 for the processes 0 and 1. To implement updating the nodal force at each boundary node at the end of each timestep, we propose the following algorithm to deal with the problem. Algorithm: Update the nodal forces at boundary nodes:

14

Step 1: Partition the initial data mesh based on the number of MPI processes to ensure load balancing, get the information about shared boundary nodes be-tween MPI processes i and i+1 from the mesh generator discussed in Section 2.1, and allocate a temporal array btmp with the nodal forces at the shared boundary nodes,

Step 2: The MPI process i sends the array btmp to its neighbor process i+1 using MPI_Sendrecv,

Step 3: The MPI process i+1 receives the array from process i using MPI_Sendrecv. For each shared boundary node, it sums the nodal force from the array and the local nodal force at the shared boundary node, then assigns the summation to the nodal force at the shared node locally,

Step 4: The MPI process i+1 updates the array locally, and sends the updated ar-ray back to the MPI process i,

Step 5: The MPI process i receives the updated array and update the nodal forces at the shared nodes locally, and deallocates the temporal array at the end of the timestep

Step 6: Repeat the above Steps 1-5 for the next timestep. The algorithm implements the straightforward data exchanges illustrated in Figure 5, and it is efficient because of sending/receiving smaller messages. This can sim-plify the programming efforts and reduce the communication overhead. 3. Experimental Platforms Our hybrid parallel earthquake simulations were tested on several systems [44, 45, 46]. In this chapter, we only conduct our experiments using Jaguar (Cray XT5 and XT4) supercomputers from Oak Ridge National Laboratory [31]. Table 1 shows their specifications and the same compilers used for all experiments. All systems have private L1 and L2 caches and shared L3 cache per node. Jaguar is the prima-ry system in the ORNL Leadership Computing Facility (OLCF). It consists of two partitions: XT5 and XT4 partitions shown in Figure 6.

Table 1. Specifications of quad- and hex-core Cray XT Supercomputers

Configurations JaguarPF (XT5) Jaguar (XT4) Total Cores 224,256 31,328 Total Nodes 18,688 7,832

Cores/Socket 6 4 Cores / Node 12 4

CPU type AMD 2.6GHz hex-core AMD 2.1GHz quad-core Memory/Node 16GB 8GB

L1 Cache/Core, private 64 KB 64 KB L2 Cache/Core, private 512KB 512KB

L3 Cache/Socket, shared 6MB 2MB Compiler ftn ftn

15

Compiler Options -O3 -mp=nonuma -fastsse

-O3 -mp=nonuma -fastsse

Figure 6. Jaguar and JaguarPF System Architecture [31]

Figure 7. AMD hex-core Opteron chip architecture [31] The Jaguar XT5 partition (JaguarPF) contains 18,688 compute nodes in addi-tion to dedicated login/service nodes. Each compute node contains dual hex-core

16

AMD Opteron 2435 (Istanbul shown in Figure 7) processors running at 2.6GHz, 16GB of DDR2-800 memory, and a SeaStar 2+ router. The resulting partition con-tains 224,256 processing cores, 300TB of memory, and a peak performance of 2.3 petaflop/s. The Jaguar XT4 partition (Jaguar) contains 7,832 compute nodes in addition to dedicated login/service nodes. Each compute node contains a quad-core AMD Opteron 1354 (Budapest) processor running at 2.1 GHz, 8 GB of DDR2-800 memory, and a SeaStar2 router. The resulting partition contains 31,328 processing cores, more than 62 TB of memory, over 600 TB of disk space, and a peak performance of 263 teraflop/s. The SeaStar2+ router (XT5 partition) has a peak bandwidth of 57.6GB/s, while the SeaStar2 router (XT4 partition) has a peak bandwidth of 45.6GB/s. The routers are connected in a 3D torus topology, which provides an interconnect with very high bandwidth, low latency, and extreme scalability. 4. Result Verification and Benchmark Problems 4.1 Benchmark Problem SCEC TPV210 To validate the hybrid MPI/OpenMP earthquake simulation code, we apply it to a SCEC/USGS benchmark problem TPV210, which is the convergence test of the benchmark problem TPV10 [19, 40]. In TPV10, a normal fault dipping at 60° (30 km long along strike and 15 km wide along dip) is embedded in a homogeneous half space. Pre-stresses are depth dependent and frictional properties are set to re-sult in a subshear rupture. This benchmark problem is motivated by ground mo-tion prediction at Yucca Mountain, Nevada, which is a potential high-level radio-active waste storage site [11, 20]. In TPV10, modelers are asked to run simulations at an element size of 100 meters on the fault surface. We refer the edge length of the trilinear hexahedral elements near the fault as the element size in our study.

Table 2. Model parameters for TPV210

Parameters Values Element size 200m 100m 50 m 25m Total elements 6,116,160 24,651,088 98,985,744 419,554,200 Time step (s) 0.016 0.008 0.004 0.002 Termination Time (s) 15 15 15 15 Required Memory (GB)

~6 ~24 ~94 ~380

nxt 281 477 829 1483 In TPV210, we conduct the convergence test of the solution by simulating the same problem at a set of element sizes, i.e., 200 m, 100 m, 50 m, 25 m, and so on, where m stands for meters. Table 2 summarizes model parameters for TPV210. The benchmark requires larger memory sizes with finer element sizes, because the

17

decrease of element size means the increase of numbers of elements and nodes. For example, for the element size of 50 m, it requires the number of elements be approximately 100,000,000, and the memory requirement is around 94 GB. The simulation is memory-bound. Our hybrid MPI/OpenMP parallel simulations dis-cussed in Section 2 target the limitation to reduce large memory requirements. Be-cause the number of elements with a discretization varies a little bit with the num-ber of MPI processes used, Table 2 only gives a rough estimate of this number. In the table, nxt is the node number along the x-coordinate in a sequential simulation, which limits how many MPI processes one can use in a hybrid parallel simulation. For the sake of simplicity, we only use TPV210 with 50 m element size as an ex-ample in this chapter. 4.2 Result Verifications

Figure 8. Rupture time contours on the dipping fault plane for TPV210 with

50m element size

18

Figure 9. The dip-slip component of slip velocity on a fault station for

TPV210 with 50m element size

Figure 10. The vertical component of particle velocity at an off-fault station

for TPV210 with 50m element size Figure 8 show the rupture time (in seconds) contours on the 60° dipping fault plane. Red star denotes the hypocenter of simulated earthquakes. Results from two simulations are plotted in the figure. One (black) is the result from a previous run with 50 m element size that was verified in the SCEC/USGS code validation exer-cise [40]. The other is the result from a run performed in this study on JaguarXT5 with 50 m element size using 256 MPI processes with 12 OpenMP threads per MPI process. These two results essentially overlap, indicating our current hybrid implementation gives accurate results.

19

Figures 9 and 10 compares time histories of the dip-slip component of slip veloci-ty at an on-fault station and the vertical component of particle velocity at an off-fault station from the two simulations with 50m of element size discussed above. The locations of the stations are described in the figures. The result from the cur-rent hybrid implementation matches that from the verified one very well. This in-dicates that our hybrid MPI/OpenMP implementation is validated and has the ac-curate output results of fault movement and ground shaking. 5. Performance Analysis In this section, we analyze and compare the performance of the hybrid MPI/OpenMP finite element earthquake simulation on quad- and hex-core Cray XT systems. Note that TPN stands for Threads Per Node.

Figure 11. Function-level performance for TPV210 with 50m on Cray XT4

Run$me updatedv faul$ng

0 1000 2000 3000 4000 5000

1024

1536

2048

2560

3072

3200

Time (s)

Number of Cores

Func3on-‐level Performance for TPV210 with 50m on Cray XT4

Run$me

Input

qdct2

updatedv

qdct3

hourglass

faul$ng

communica$on

20

Figure 12. Relative Speedup for TPV210 with 50m on Cray XT4 Figure 11 presents the function-level performance of the hybrid MPI/OpenMP fi-nite element earthquake simulation with 50m on Cray XT4, where there are seven main functions in the code; the functions Input and qdct2 are called once, and the functions updatedv, qdct3, hourglass, faulting and communication are within the main timestep loop. The function communication means the MPI communication, and the MPI communication overhead was measured on each master MPI process for all hybrid executions.

1000

1500

2000

2500

3000

3500

1000 2000 3000 4000

Speedu

p

Number of Cores

Rela3ve Speedup for TPV210 with 50m on Cray XT4

Cray XT4

Linear

Run$me updatedv faul$ng

0

2000

4000

6000

3072

4608

6144

7680

9216

9600

Time (s)

Number of Cores

Func3on-‐level Performance for TPV210 with 50m on Cray XT5

Run$me

Input

qdct2

updatedv

qdct3

hourglass

faul$ng

communica$on

21

Figure 13. Function-level performance for TPV210 with 50m on Cray XT5

Figure 14. Relative Speedup for TPV210 with 50m on Cray XT5 Figure 12 shows the relative speedup for TPV210 with 50 m on Cray XT4 from Figure 11, where we assume that the relative speedup for TPV210 with 50 m exe-cuted on 1024 cores is 1024, then calculate the relative speedup for up to 3200 cores. In fact, for 1024 cores, the hybrid execution on Cray XT4 utilizes 256 MPI processes with 1 MPI process per node and 4 OpenMP TPN. We observe that the hybrid execution on Cray XT4 has good scalability with the increase of the num-ber of cores. Figure 13 presents the function-level performance of the hybrid MPI/OpenMP fi-nite element earthquake simulation with 50m on Cray XT5. Figure 14 shows the relative speedup for TPV210 with 50 m on Cray XT5 from Figure 13, where we assume that the relative speedup for TPV210 with 50 m executed on 3072 cores (256 nodes with 12 cores per node) is 3072, then calculate the relative speedup for up to 9600 cores. In fact, for 3072 cores, the hybrid execution on Cray XT5 utiliz-es 256 MPI processes with 1 MPI process per node and 12 OpenMP TPN. We ob-serve that the hybrid execution on Cray XT4 has better scalability than that on Cray XT5. For strong scaling scientific applications like our hybrid earthquake simulation, with increasing number of cores, some parallelization loop sizes become very small, which may cause more OpenMP overhead. The other reason is related to memory subsystems and how efficiently they support OpenMP programming [46, 48]. 6. Conclusions

1000 2000 3000 4000 5000 6000 7000 8000 9000

10000 11000

1000 3000 5000 7000 9000 11000

Speedu

p

Number of Cores

Rela3ve Speedup for TPV210 with 50m on Cray XT5

Cray XT5

Linear

22

In this chapter, we reviewed large-scale ground motion and earthquake rupture simulations in the earthquake simulation community and discussed the relation-ships between data-intensive computing and earthquake simulations. We used a different approach to convert a data-intensive earthquake simulation into a compu-tation-intensive earthquake simulation to significantly reduce I/O operations at the input stage. We integrated a 3D mesh generator into the simulation, and used MPI to parallelize the 3D mesh generator. We illustrated an element-based partitioning scheme for explicit finite element methods. Based on the partitioning scheme and what we learned from our previous work, we implemented a hybrid MPI/OpenMP finite element earthquake simulation code to achieve multiple levels of parallelism of the simulation. The experimental results demonstrated that the hybrid MPI/OpenMP code has the accurate output results and the good scalability on Cray XT4 and XT5 systems. Because we partitioned the entire model domain by the coordinate along fault strike (e.g., the x-coordinate in a Cartesian coordinate system), the maximum number of MPI processes that can be used is bounded by the total number of nodes along the x-coordinate. This limits the scalability of the hybrid simulation. We also found that we could not use any number of MPI processes for the hybrid execution because load imbalance could cause large MPI communication over-head. For the future work, we plan to further improve the memory requirements of the hybrid simulation code by partitioning the entire model domain in X-, Y- and Z-dimensions, and consider some load balancing strategies discussed in [41] and some optimization strategies discussed in [47]. Acknowledgements This work is supported by NSF grants CNS-0911023, EAR-1015597, and the Award No. KUS-I1-010-01 made by King Abdullah University of Science and Technology (KAUST). The authors would like to acknowledge National Center for Computational Science at Oak Ridge National Laboratory for the use of Jaguar and JaguarPF under DOE INCITE project “Performance Evaluation and Analysis Consortium End Station”.

References 1. D. J. Andrews, Rupture velocity of plane strain shear cracks, Journal of Geo-

physical Research, 81, 5679-5687, 1976. 2. H. Aochi, and E. Fukuyama, Three-dimensional non-planar simulation of the

1992 Landers earthquake, Journal of Geophysical Research, 107(B2), 2035, doi:10.1029/2000JB000061, 2002.

3. J. Bielak, R. Graves, K.B. Olsen, et al., The ShakeOut Earthquake Scenario: Verification of Three Simulation Sets, Geophysical Journal International, 180 (1), 375-404, 2010.

4. M. Cannataro, D. Talia, and P. K. Srimani, Parallel Data Intensive Computing in Scientific and Commercial Applications, Parallel Computing 28, 2002.

23

5. Y. Cui, R. Moore, K. Olsen, et al., Toward Petascale Earthquake Simulations, Acta Geotechnica, DOI 10.1007/s11440-008-0055-2, 2008.

6. Y. Cui, K. B. Olsen, T. H. Jordan, et al., Scalable earthquake simulation on petascale supercomputers, SC10, 2010.

7. S. M. Day, Three-dimensional simulation of spontaneous rupture: The effect of nonuniform prestress, Bulletin of the Seismological Society of America, 72, 1881-1902, 1982.

8. S. M. Day and C. R. Bradley, Memory-efficient simulation of anelastic wave propagation, Bulletin of the Seismological Society of America, 91, 520– 531, 2001.

9. S. M. Day, L. A. Dalguer, N. Lapusta, and Y. Liu, Comparison of finite dif-ference and boundary integral solutions to three-dimensional spontaneous rupture, Journal of Geophysical Research, 110, B12307, doi:10.1029/ 2005JB003813, 2005.

10. B. Duan and S. M. Day, Inelastic Strain Distribution and Seismic Radiation From Rupture of a Fault Kink, Journal of Geophysical Research, 113, B12311, 2008.

11. B. Duan and S. M. Day, Sensitivity study of physical limits of ground motion at Yucca Mountain, Bulletin of the Seismological Society of America, 100 (6), 2996-3019, 2010.

12. H. Ding and R. Ferraro, An Element-based Concurrent Partitioned for Un-structured Finite Element Meshes, IPPS’96, 1996.

13. J. H. Dieterich, Modeling of rock friction, 1. Experimental results and consti-tutive equations, Journal of Geophysical Research, 84, 2169-2175, 1979.

14. B. Duan and D. D. Oglesby, Heterogeneous Fault Stresses From Previous Earthquakes and the Effect on Dynamics of Parallel Strike-slip Faults, Jour-nal of Geophysical Research, 111, B05309, 2006.

15. B. Duan and D. D. Oglesby, Nonuniform Prestress From Prior Earthquakes and the effect on Dynamics of Branched Fault Systems, Journal of Geophysi-cal Research, 112, B05308, 2007.

16. B. Duan, Role of initial stress rotations in rupture dynamics and ground mo-tion: A case study with Implications for the Wenchuan earthquake, Journal of Geophysical Research, 115, B05301, 2010.

17. R. W. Graves, B. Aagaard, K. Hudnut, L. Star, J. Stewart, and T. H. Jordan, Broadband simulations for Mw 7.8 southern San Andreas earthquakes: Ground motion sensitivity to rupture speed, Geophysical Research Letters, 35, L22302, 2008.

18. W. Gropp, E. Lusk, R. Thakur, Using MPI-2: Advanced Features of the Mes-sage-Passing Interface, MIT Press, Cambridge, MA, 1999.

19. R. A. Harris, M. Barall, et al., The SCEC/USGS Dynamic Earthquake-rupture Code Verification Exercise, Seismological Research Letters, Vol. 80, No. 1, 2009.

20. R. A. Harris, M. Barall, D. J. Andrews, et al., Verifying a computational method for predicting extreme ground motion, Seismological Research Let-ters, in press, 2011.

24

21. K. W. Hudnut, B. Aagaard, R. Graves, L. Jones, T. Jordan, L. Star, and J. Stewart, ShakeOut earthquake source description, surface faulting and ground motions, U.S. Geol. Surv. Open File Rep., 2008-1150, 2008.

22. Y. Ida, Cohesive force across the top of a longitudinal shear crack and Grif-fith's specific surface energy, Journal of Geophysical Research, 77, 3796-3805, 1972.

23. L. Jones, et al., The ShakeOut scenario, U.S. Geol. Survey Open File Rep., 2008–1150, 2008.

24. N. Kame, J. R. Rice, and R. Dmowska, Effects of prestress state and rupture velocity on dynamic fault branching, Journal of Geophysical Research, 108(B5), 2265,2003.

25. M. Kohler, H. Magistrale, and R. Clayton, Mantle heterogeneities and the SCEC three-dimensional seismic velocity model version 3, Bulletin of the Seismological Society of America, 93, 757– 774, 2003.

26. K. Ma, A. Stompel, et al., Visualizing Very Large-Scale Earthquake Simula-tions, SC’03, November 15-21, 2003, Phoenix, Arizona, USA.

27. H. Magistrale, S. M. Day, R. W. Clayton, and R. W. Graves, The SCEC southern California reference three-dimensional seismic velocity model ver-sion 2, Bulletin of the Seismological Society of America, 90, S65– S76, 2000.

28. C. Marcinkovich, and K. Olsen, On the implementation of perfectly matched layers in a three-dimensional fourth-order velocity-stress finite difference scheme, Journal of Geophysical Research, 108(B5), 2276,2003.

29. P. Moczo, J. Kristek, M. Galis, et al., The finite-difference and finite-element modeling of seismic wave propagation and earthquake motion, Acta Phys. Slovaca, 57(2), 177-406, 2007.

30. G. Mahinthakumar and F. Saied, A Hybrid MPI-OpenMP Implementation of An Implicit Finite-Element Code on Parallel Architectures, the International Journal of High Performance Computing Applications, Vol. 16, No. 4, 2002.

31. NCCS Jaguar and JaguarPF, Oak Ridge National Laboratory, http://www.nccs.gov/computing-resources /jaguar/

32. K. Nakajima, OpenMP/MPI Hybrid vs. Flat MPI On the Earth Simulator: Parallel Iterative Solvers for Finite Element Method, ISHPC2003, LNCS 2858, 2003.

33. D. D. Oglesby, R. J. Archuleta, and S. B. Nielsen, The dynamics of dip-slip faults: Explorations in two dimensions, Journal of Geophysical Research, 105,13643-13653, 2000.

34. D. D. Oglesby, R. J. Archuleta, and S. B. Nielsen , The three-dimensional dy-namics of dipping faults, Bulletin of the Seismological Society of America , 90, 616-628, 2000.

35. D. D. Oglesby, R. J. Archuleta, and S. B. Nielsen, Earthquakes on dipping faults: the effects of broken symmetry, Science, 280, 1055-1059, 1998.

36. K. B. Olsen, S. M. Day, J. B. Minster, et al., Strong shaking in Los Angeles expected from southern San Andreas earthquake, Geophysical Research Let-ters, 33, 1–4, 2006.

37. K. B. Olsen, S. M. Day, J. B. Minster, et al., TeraShake2: Simulation of Mw7.7 earthquakes on the southern San Andreas fault with spontaneous rup-

25

ture description, Bulletin of the Seismological Society of America, 98, 1162 – 1185,2008.

38. K. B. Olsen, S. M. Day, L. A. Dalguer, et al., ShakeOut-D: Ground motion estimates using an ensemble of large earthquakes on the southern San Andre-as fault with spontaneous rupture propagation, Geophysical Research Letters, 36, L04303, 2009.

39. A. Ruina, Slip instability and state variable friction laws, Journal of Geophys-ical Research, 88, 10,359-10,370, 1983.

40. The SCEC/USGS Spontanous Rupture Code Verification Project, http://scecdata.usc.edu/cvws.

41. V. Taylor, E. Schwabe, B. Holmer, and M. Hribar, Balancing Load versus Decreasing Communication: Parameterizing the Tradeoff, Journal of Parallel and Distributed Computing, Vol. 61, 567-580, 2001.

42. T. Tu, D. R. O’Hallaron, and O. Ghattas, Scalable Parallel Octree Meshing for Terascale Applications, Proceedings of 2005 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC05), November 12-18, 2005, Seattle, Washington, USA.

43. T. Tu, H. Yu, L. Ramırez-Guzman, et al., From mesh generation to scientific visualization: an end-to-end approach to parallel supercomputing, in Proceed-ings of 2006 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC06), IEEE Computer Soci-ety, Tampa, Florida, 2006.

44. Xingfu Wu, Benchun Duan and Valerie Taylor, An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems, ICPP2009 SMECS Workshop, September 22-25, 2009, Vienna, Austria.

45. Xingfu Wu, Benchun Duan and Valerie Taylor, Parallel simulations of dy-namic earthquake rupture along geometrically complex faults on CMP sys-tems, Journal of Algorithm and Computational Technology, 5 (2), 313-340, 2011.

46. Xingfu Wu, Benchun Duan and Valerie Taylor, Parallel Finite Element Earthquake Rupture Simulations on Quad- and Hex-core Cray XT Systems, the 53rd Cray User Group Conference (CUG2011), May 23-26, 2011, Fair-banks, Alaska.

47. Xingfu Wu, Valerie Taylor, Charles Lively and Sameh Sharkawi, Perfor-mance Analysis and Optimization of Parallel Scientific Applications on CMP Clusters, Scalable Computing: Practice and Experience, Vol. 10, No. 1, 2009.

48. Xingfu Wu and Valerie Taylor, Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Supercomputers, ACM SIGMETRICS Performance Evaluation Review, Vol. 38, Issue 4, March 2011.

Index terms (alphabetically):

26

Contour Computation-intensive Data-intensive computing Dynamic rupture simulation Earthquake Element-based partitioning Element size Finite difference method Finite element method Ground motion simulation Hybrid MPI/OpenMP Mesher Mesh generator MPI Multicore Node-based partitioning Parallel I/O Parallel finite element method Partitioning OpenMP Relative speedup Rupture Seismic wave propagation Speedup Spontaneous rupture propagation Supercomputers Verification Validation Visualizer

Chapter 21 PARALLEL EARTHQUAKE SIMULATIONS ON LARGE...

Documents

Transcript of Chapter 21 PARALLEL EARTHQUAKE SIMULATIONS ON LARGE...