IMPLEMENTING A COMPONENT BASED PARALLEL DISTRIBUTED … · 2008. 3. 5. · the parallel solver. In...

Master’s DissertationStructural

Mechanics

FILIP JOHANSSON and FREDRIK HANSSON

IMPLEMENTING A COMPONENTBASED PARALLEL DISTRIBUTEDFINITE ELEMENT SOLVER

Denna sida skall vara tom!

Copyright © 2008 by Structural Mechanics, LTH, Sweden.Printed by KFS I Lund AB, Lund, Sweden, March, 2008.

For information, address:

Division of Structural Mechanics, LTH, Lund University, Box 118, SE-221 00 Lund, Sweden.Homepage: http://www.byggmek.lth.se

Structural MechanicsDepartment of Construction Sciences

Master’s Dissertation by

FILIP JOHANSSON and FREDRIK HANSSON

Supervisors:

Jonas Lindemann, PhD,LUNARC, Lund

IMPLEMENTING A COMPONENT

BASED PARALLEL DISTRIBUTED

FINITE ELEMENT SOLVER

ISRN LUTVDG/TVSM--08/5155--SE (1-89)ISSN 0281-6679

Examiner:

Ola Dahlblom, Professor,Div. of Structural Mechanics

Abstract

Parallel computing is becoming more and more important in modern Finite Element Soft-ware. As problems grow larger, computation on a single processor may not be fast enough.To overcome this problem, one can utilize parallel programming using e.g SMP-machinesor clusters. Furthermore, when local compute resources are scarce, it can be quite conve-nient to take advantage of non-local resources for performing the calculations.

This thesis is a collaboration between the Division of Structural Mechanics, LTH andStruSoft AB in Malmo/Budapest. The aim is to address the issues above, specificallylooking at the structural analysis program FEM-Design developed by StruSoft. Thereare several ways to parallelize code, focus will be on OpenMP, PETSc and Intel MKL.These methods have been studied in order to conclude which one is the most suitable forexisting Finite Element applications. In the end Intel MKL was chosen and implemented.Regarding the distributed computations, a realistic client/server application was developedusing the Internet Communications Engine (Ice).

The parallel properties of the implementation was studied and also, during a visit to theStruSoft Budapest branch, the implementation was integrated into FEM-Design. Theresults were astonishing, reaching speedups of a factor up to 360 compared to the originalsolver. Also, the scaling was almost linear for the implemented solver.

i

ii

Acknowledgements

First of all we would like to thank our supervisor Jonas Lindemann for his support duringthe thesis process and many interesting discussions. We would also like to thank all thepeople at StruSoft in Malmo and Budapest. Specially Arpad Tornyos for his technicalsupport during our visit to Budapest. Tamas Racz for his help with communicating withthe people in Budapest and his overall involvement in the work. Also we owe HakanHansson a great deal of thanks for making this thesis possible. Futhermore we would liketo thank Kent Persson at the Divison of Structural Mechanics for inspiring us to use IntelMKL. Also we would like to thank the opponents Mikael Mansson and Magnus Tingne fortheir many useful comments on this report.

Contents

1 Introduction 1

2 Parallel programming 32.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 PETSc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Intel MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 The PARDISO solver . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Storage format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 PARDISO usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.4 Code example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Parallel Solver for Plane Stress 173.1 Original Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 OpenMP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Original Serial Solver . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Parallelizing the Solver using OpenMP . . . . . . . . . . . . . . . . . 193.2.3 Results with OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 PARDISO Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 Banded Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 Results with Banded Solution . . . . . . . . . . . . . . . . . . . . . . 223.3.3 Renumbered Banded Solution . . . . . . . . . . . . . . . . . . . . . . 233.3.4 Results with Renumbered Banded Solution . . . . . . . . . . . . . . 233.3.5 Coordinate Format Solution . . . . . . . . . . . . . . . . . . . . . . . 243.3.6 Results with Coordinate Format Solution . . . . . . . . . . . . . . . 26

4 Integrating the Parallel Solver into FEM-Design 294.1 The Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Distributed Programming Tools 33

iii

iv CONTENTS

5.1 Ice Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 The Slice Definition Language . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Principles of Ice Communication . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Distributed Interface for the Parallel Solver 396.1 The Slice Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2 Client Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2.1 Client Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.3 Server Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3.1 Server Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Conclusions and Future Work 497.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Client/Server code 51A.1 Slice definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.2 Client code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.3 Server Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

B Parallel application code 65B.1 Main program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65B.2 File reading/writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66B.3 FEM calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B.4 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81B.5 System solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 1

Introduction

Parallel computing is becoming more and more important in modern Finite Element Soft-ware. As problems grow larger, computation on a single processor may not be fast enough.To overcome this problem, one can utilize parallel programming using e.g smp-machinesor clusters. Furthermore, when local compute resources are scarce, it can be efficient totake advantage of non-local resources for performing the calculations. The latter part isoften referred to as distributed computations.

The focus is on parallelizing existing Finite Element Codes. For this purpose some dif-ferent techniques have been studied, these are OpenMP, PETSc and Intel MKL. Also,the possibility of distributing the calculations to non-local resources using e.g. the Inter-net Communications Engine (Ice) is investigated. This thesis is a collaboration betweenthe Division of Structural Mechanics, LTH and StruSoft AB in Malmo/Budapest. Theaim is to address the issues above, specifically looking at the structural analysis programFEM-Design developed by StruSoft.

The structure of the report is as follows: Chapter 2 gives a brief description of the toolsfor parallelization mentioned above. Chapter 3 describes the process of implementingthe parallel solver. In chapter 4, the integration of the parallel solver into FEM-Designis discussed. Chapter 5 gives an overview of distributed programming tools, specificallyfocusing on Ice. Chapter 6 treats the implementation of the distributed interface for theparallel solver. In chapter 7, some conclusions from the present work are discussed, also,some suggestions for future work are given. Most of the code written in this project isavailable in appendix A and B.

1

Chapter 2

Parallel programming

In this section some approaches to parallel programming will be reviewed. The purposeof parallel programming is to divide tasks between different CPUs in order to reduce thecomputational time. When doing this it is important how the problem is divided andhow the different processes (threads) communicate and synchronize with each other. Oneof the most used tools for handling this communication is the Message Passing Interface,MPI for short. The advantages of parallel programming are twofold, more CPU power canbe utilized, also, the total amount of memory is increased with more CPU:s. The latterbeing very significant for larger problems. Symmetric multi processing (SMP) systems likedual/quad core are becoming more and more popular, but far from all programs actuallyuse the parallel capacity of these systems. As the CPU clock speed curve is starting tolevel out, parallel programming is becoming more and more important.

In this thesis, three common approaches to parallel programming are studied, in thecontext of a finite element solver:

OpenMP [5] A set of compiler directives that can be directly inserted in to existing code.

PETSc [15] A library for solving Partial Differential Equations using MPI.

Intel MKL [4] A library of threaded math routines based on OpenMP. It is highly opti-mized for Intel processors.

It should be noted that the techniques studied are applied to medium sized finite elementproblems. Larger finite element problems often require some kind of domain decompositionscheme in addition to the techniques described in this report.

2.1 OpenMP

OpenMP stands for Open specifications for Multi Processing. It is a collaborative workbetween interested parties from the hardware and software industry, government andacademia. Essentially, OpenMP is an API for parallelizing existing code, giving theprogrammer full control over the parallelization. It is comprised of three primary APIcomponents, these are

3

4 CHAPTER 2. PARALLEL PROGRAMMING

• Compiler directives

• Runtime library routines

• Environment variables.

OpenMP is mainly designed for symmetric multiprocessing (SMP) architectures i.e. sys-tems with multiple CPUs on a shared memory. OpenMP is available for Fortran(77, 90and 95) and C/C++. It has been implemented on many platforms including most Unixplatforms and Windows NT. A full description of the OpenMP specification can be foundin [5].

2.1.1 Programming model

OpenMP uses a shared memory process consisting of multiple threads. The fork-joinmodel of parallel execution is utilized, see figure 2.1. The OpenMP program is initially

Master Thread

Fork Join

Parallel region

Master Thread

Figure 2.1: The master thread creates a team of parallel threads

executed as a single process by the master thread. When entering a parallel region, themaster thread creates a team of parallel threads, this is called fork. The statements inthe program that are enclosed by the parallel region are then executed in parallel amongthe threads in the team. When all threads have finished their work in the parallel region,they synchronize and terminate, leaving only the master thread, this is called join. In theparallel region, OpenMPs memory model dictates that all variables except loop variablesby default are shared by the different threads. One can alter this by specifying variablesas private or shared before entering the parallel region. The model also supports nestedparallelism i.e. it is possible to have parallel regions within other parallel regions.

2.1.2 Usage

OpenMP is used by inserting special compiler directives or runtime library routine callsinto the existing code. The compiler directives are inserted as special comments (beginningwith !$OMP in Fortran 90). Examples of directives are

!$OMP PARALLEL/ !$OMP END PARALLEL begin and end parallel region

2.1. OPENMP 5

!$OMP DO/!$OMP END DO this is a work sharing construct. It divides the enclosedloop between the different threads

Other constructs in OpenMP include:

• Static/dynamic sheduling of work-divison constructs.

• The possibility to divide code in different parallel running sections

• Synchronization tools like barriers and the possibilty to specify sections of codewithin parallel regions to be executed serially.

• Tools to lock/unlock certain parts of data structures such as matrices/vectors.

and much more, but obviously, a full description of OpenMP cannot be made here. A niceguide to using OpenMP can be found in [6].

2.1.3 Code examples

In 2.1, the basic usage of OpenMP is illustrated. This program is a simple example that al-locates two large arrays and performs some simple calculations on them. First, x is initial-ized, which is done serially. The value of the environment variable OMP NUM THREADS,which should be set to the number of processors for best performance, is printed using theOpenMP method omp get num threads(). The program then enters the parallel regionwhere a is calculated. Here, the work is divided equally between the threads, but it isalso possible to divide the work in different proportions using the !$OMP SCHEDULEdirective.

Listing 2.1: test.f90

program mainuse omp l ibi n t eger : : i , ji n t eger (8 ) : : s i z e = 5e7r ea l (8 ) : : s t a r t t im e , f i n i s h t i m er ea l (8 ) , dimension ( : ) , a l l o c a t ab l e : : a , x

a l l o c a t e ( a ( s i z e ) )a l l o c a t e ( x ( s i z e ) )

do i =1, s i z e

x ( i ) = ( i +2)/1000end do

s t a r t t im e = omp get wtime ( )

write (∗ , ’ (A, I ) ’ ) ’Maximum number o f t h r e a d s : ’ ,&omp get max threads ( )

!$OMP PARALLEL dodo i =1, s i z e

a ( i ) = s i n ( x ( i ) / 2 )end do


!$OMP END PARALLEL do

dea l loca te ( a )dea l loca te ( x )

f i n i s h t i m e = omp get wtime ( )

write (∗ , ’ (A, F) ’ ) ’ Time : ’ , f i n i s h t im e −s t a r t t im e

end program main

In this example, the result in one calculation did not depend on the result from anotheriteration, in the following example this is not the case.

!$OMP PARALLEL do sha red ( a )! P a r a l l e l s e c t i o n execu t ed by a l l t h r e a d sdo i =1, s ize −1

a ( i ) = a ( i +1)end do

! A l l t h r e a d s j o i n master th r ead and d i s band!$OMP END PARALLEL do

This is because a(i + 1) is needed in its unmodified form for the correct result to beobtained, however since the loop is now divided between the CPUs we cannot be certainof this. The above situation is an example of a so called race condition, meaning that theresult of the code depends on the thread scheduling and the speed of each processor. Insome cases, there are workarounds to these problems. For example, one could use someOpenMP directives to enforce synchronization between threads, or one could modify theloop itself. Of course, it is necessary to evaluate the cost in terms of time and memory inorder to see, if it is worth it or not.

2.2 PETSc

The Portable, Extensible Toolkit for Scientific Computations [15] is a suite of data struc-tures and routines for the scalable (parallel) solution of scientific applications modeled bypartial differential equations. It employs the MPI standard for all message-passing com-munication. The original purpose of PETSc was to enable its users to easily experimentbetween many different models, solving methods and discretizations and to eliminate theMPI from MPI programming. It is a freely available research code usable from C/C++,Fortran 77/90 and Python. PETSc has run problems with over 500 million unknowns, alsoit has run efficiently on 6,000 processors. It has also been used in applications running atover 2 Teraflops. An overview of some of the components of PETSc can be seen in figure2.2. This figure illustrates the hierarchical structure of PETSc, an important feature ofthe package is the possibility to start at a high level and work your way down in level ofabstraction. The essential components of PETSc are:

Vec Provides the vector operations required for setting up and solving large-scale linear

2.2. PETSC 7

and nonlinear problems. Includes easy-to-use parallel scatter and gather operations,as well as special-purpose code for handling ghost points for regular data structures.

Mat A large suite of data structures and code for the manipulation of parallel sparsematrices. Includes four different parallel matrix data structures, each appropriatefor a different class of problems.

PC A collection of sequential and parallel preconditioners, including (sequential) ILU(k),LU, and (both sequential and parallel) block Jacobi, overlapping additive Schwarzmethods and (through BlockSolve95) ILU(0) and ICC(0).

KSP Parallel implementations of many popular Krylov subspace iterative methods, in-cluding GMRES, CG, CGS, Bi-CG-Stab, two variants of TFQMR, CR, and LSQR.All are coded so that they are immediately usable with any preconditioners and anymatrix data structures, including matrix-free methods.

SNES Data-structure-neutral implementations of Newton-like methods for nonlinear sys-tems. Includes both line search and trust region techniques with a single interface.Employs by default the above data structures and linear solvers. Users can setcustom monitoring routines, convergence criteria, etc.

TS Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.

Figure 2.2: The hierarchical structure of the PETSc library, as described in the PETScmanual[3].

In figure 2.3 a closer look is taken at the parallel numerical parts of PETSc. As you cansee PETSc offers a wide range of iterative Krylov Subspace solvers and preconditioners.Also the package offers the possibility to store system matrices in many different formats.


Figure 2.3: An overview of the numerical libraries of PETSc, as described in the PETScmanual[15].

2.2.1 Programming model

PETSc employs the distributed memory model. Each process has its own address spacein memory and when needed, information is passed to other processes via MPI. To bemore specific in e.g. a finite element problem, each process will own a contiguous subsetof rows of the system matrix and will primarily work on this subset, sending/receivinginformation from/to other processes only when needed. How to perform this subdivisionis of course a different subject, and for that reason PETSc has integrated support forreordering packages like ParMetis. If the already included packages does not include theneeded features, external packages can easily be integrated. This programming modelmakes PETSc especially suitable for clusters.

2.2.2 Usage

Unlike OpenMP, PETSc is not easily integrated in existing codes. PETSc is a set of libraryinterfaces, with its own matrix and vector structures, assembly and solving methods whichare organized in an object oriented way. The user declares the system matrices and vectors,and then uses PETSc to assemble, precondition and solve the problem by envoking callsto PETSc methods. Through the whole process, there is ample possibility for the user tocustomize and control the solution process. It is also possible to supply most options atthe command line, which makes experimenting between different methods very easy. Anexample of how PETSc can be used is shown in the next section.

2.3. INTEL MKL 9

2.2.3 Code examples

In section 1.4 of the PETSc manual [15] there are examples that illustrate the basic usageof PETSc. These examples are written in the C programming language, but using PETScin Fortran code is not very different The linear system is solved using KSP, the interfaceto the preconditioners, Krylov subspace methods and direct linear solvers of PETSc.

2.3 Intel MKL

The Intel Math Kernel Library, MKL, provides Fortran routines and functions that per-form operations on vectors and matrices including sparse matrices. It has support for bothC and Fortran interfaces. It is highly optimized for Intel processors but also works wellon other archtectures. Features include:

• Linear Algebra in the form of BLAS/LAPACK, scaLAPACK routines and sparsesolvers.

• Fast Fourier Transforms.

• Vector math library.

• Vector random number generators.

• LINPACK benchmark packages.

These features covers most functions needed in a modern finite element solver. BLAS,FFT and vector math are threaded using OpenMP. Intel MKL also includes an efficientparallel solver, PARDISO, which is described in the next section.

2.3.1 The PARDISO solver

The PARDISO solver was originally developed by Olaf Schenk (et al) at the university ofBasel. The solver has been licensed to Intel, whom in their turn have optimized it for IntelCPUs (though it still runs efficiently on e.g. AMD architectures). It is a high-performance,robust, memory efficient and easy to use software for solving large sparse symmetric andunsymmetric linear systems of equations on shared memory multiprocessors. Paralleliza-tion is achieved using OpenMP. The PARDISO solver uses a direct LU algorithm (of coursehandling multiple right hand sides), but it also implements an iterative conjugate gradientmethod. The behaviour of the latter is not studied in this project. As seen in figure 2.4the solver supports a wide range of sparse matrices. It has both an out of- and in-coreversion. The out of core version writes the factorized matrices to temporary files on thehdd. The solver has four distinct phases:

1. Reordering and symbolic factorization.

2. Numeric factorization.

3. Back substitution.

4. Memory release.


Matrix

Symmetric

Real

Indefinite Pos. def

Hermitian Complex

Unsymmetric

Real Complex

Indefinite Pos. def

Figure 2.4: Sparse matrices supported by PARDISO, as described in the Intel MKL manual[4].

The reordering consists of a permutation vector being calculated, also, during the firstphase, an elimination tree is calculated for the next phase. PARDISO uses supernodeswith a two-level dynamic sheduling method to divide work between threads during thefactorization phase. This method is quite advanced and is described by figure 2.5. The

First Level: Local Tasks

Second level:Global Tasks

Figure 2.5: A mapping of a hypothetical elimination tree with the two-level scheduling overfour processors as image described in the article [1] by Olaf Schenk and Klaus Gartner.

idea is to have all processors working as much as possible, all the time. The first levelconsists of nodes which are independent, the second level is comprised of nodes whichare not independent and therefore must be global to each process. The tasks in the firstlevel is managed by a queue Qs and in the second by a queue Qr. Suppose now thatone process is executing a task in the second level, and suppose furthermore that thisprocess now has to wait for some other process to finish another task before finishingthe work: In this situation, the CPU should not wait in an idle state, because it willinhibit the performance of the solver. In PARDISO, the process would in this situationput this node back in Qr and pick another node from the queue and try factorizing this

2.3. INTEL MKL 11

instead. The pseudo-code for the factorization in PARDISO can be seen in [1]. This isan example of locks in OpenMP being used. The next phase of the PARDISO solutionis back substitution, in this step PARDISO uses perturbed pivoting when encounteringnumerical troubles. Refining the solution during the back substitution to compensate forthe “errors” in the factorized matrices. Information on the number of perturbations anditerative refinements can be queried using special functions. In the final phase of thesolution structures allocated by PARDISO are freed, e.g. the factorized matrices.

2.3.2 Storage format

For sparse matrices, it is more efficient to store only the non-zero elements. This assumesthat the sparsity is large i.e. the number of non-zeros is a small percentage of the totalnumber of elements. There are a number of storage schemes for sparse matrices. The basictechnique is to compress the non-zero elements into a linear array and provide arrays todescribe the location of the non-zeros in the original matrix K.

The PARDISO solver uses the Compressed Sparse Row (CSR) format for sparse matrices.This is a row major format i.e. the the compression of K into a linear is done by walkingacross each row and store each non-zero element in the order they appear in the walk. Forsymmetric matrices, only non zero elements of the upper triangular half of the matrix arestored.

The CSR storage format consists of three arrays

values This array contains the non-zero elements of K. The elements are stored in theorder they appear when walking across the rows in K.

ia element i of this array gives the index of the values array that contains the first non-zero element of row i in K. The number of non-zeros of the i-th row is equal toia(i + 1)− ia(i), since the non-zeros are stored consecutively. An additional entry isadded to the end of ia in order to have this relationship to hold for the last row inK. This makes the length of ia equal to the number of rows in K + 1.

ja element i of this array gives the number of the column corresponding to the elementvalues(i).

2.3.3 PARDISO usage

PARDISO is used by calling subroutines in Fortran or C. The Fortran interface of PAR-DISO is

SUBROUTINE p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , i a , ja,

perm , nrhs , iparm , msg lv l , b , x , e r r o r )

INTEGER∗8 pt (64)

i n t eger maxfct , mnum, mtype , phase , n , nrhs , e r r o r , i a (∗ ) , j a(∗ ) ,


perm (∗ ) , iparm (∗ )

r ea l (8 ) a (∗ ) , b ( n , n rh s ) , x ( n , n rh s )

mtype This parameter defines the matrix type. The values of mtype supported by PAR-DISO are1: real and structurally symmetric matrix 2: real and symmetric positive definite ma-trix -2: real and symmetric indefinite matrix 3: complex and structurally symmetricmatrix 4: complex and Hermitian positive definite matrix -4: complex and Hermi-tian indefinite matrix 6: complex and symmetric matrix 11: real and unsymmetricmatrix 13: complex and unsymmetric matrix

iparm Array of dimension 64 used to pass various parameters to PARDISO and to returnsome useful information after the execution of the solver. If iparm(1) = 0, thenPARDISO fills iparm with default values. There is no default value for iparm(3),the number of processors to use, and this value has to be supplied by the user.

phase The execution phase of the solver.

n The number of equations.

a Array containing the non-zeros.

ia Array containing starting indices of each row.

ja Array containing the column indices for the corresponding value in a.

perm This array, of dimension n, holds the permutation vector. If A is the original matrixand B = PAP T the permuted matrix, then row i of A is the perm(i) row of B. Thepermutation vector is only accessed if iparm(5) = 1.

nrhs The number of right-hand sides that need to be solved for.

msglvl If msglvl = 0 then PARDISO generates no output, if msglvl = 1 the solver printsstatistical information.

b This array of dimension (n, nrhs) contains the right hand side vector/matrix B. Onoutput, it contains the solution if iparm(6) = 1.

x On output, it contains the solution if iparm(6) = 0.

This interface is given for 64-bit architectures. For 32-bit architectures, the argumentpt(64) must be defined as INTEGER*4.

2.3.4 Code example

The code example in Listing 2.2 is taken from the MKL manual and shows an exampleof how to use the PARDISO solver. The program computes the solution of a sparse

2.3. INTEL MKL 13

symmetric linear system Ax = b where

A =

7.0 0.0 1.0 0.0 0.0 2.0 7.0 0.00.0 −4.0 8.0 0.0 2.0 0.0 0.0 0.01.0 8.0 1.0 0.0 0.0 0.0 0.0 5.00.0 0.0 0.0 7.0 0.0 0.0 9.0 0.00.0 2.0 0.0 0.0 5.0 1.0 5.0 0.02.0 0.0 0.0 0.0 1.0 −1.0 0.0 5.07.0 0.0 0.0 9.0 5.0 0.0 11.0 0.00.0 0.0 5.0 0.0 0.0 5.0 0.0 5.0

and

b =

1.01.01.01.01.01.01.01.0

First, the system of equations is defined. In the next step the iparm array is initialized.The number of processors to be used is set to the number of processors on the machinefor best performance. The solver has three execution steps. In the first step the solverperforms a fill-reduction analysis and symbolic factorization. The phase parameter is setto 11 and symbolic factorization is performed. In the second step the solver performs thenumerical factorization, with the phase parameter set to 22, the solver then performs onlynumerical factorization and no solve. In the last step the the system is solved. Finally theused memory is released.

Listing 2.2: mkl exempel.f90

!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

! Example program to show the use o f the ”PARDISO” r o u t i n e! f o r symmetr ic l i n e a r sys tems!

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

! Th i s program can be downloaded from the f o l l o w i n g s i t e :! h t tp : //www. computa t i ona l . un i ba s . ch/ cs / sc icomp!! (C) O la f Schenk , Department o f Computer Sc i ence ,! U n i v e r s i t y o f Base l , Sw i t z e r l a nd .! Emai l : o l a f . s chenk@un ibas . ch!!

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−


PROGRAM mkl exempel

use omp l ib

IMPLICIT NONE

! . . I n t e r n a l s o l v e r memory p o i n t e r f o r 64− b i t a r c h i t e c t u r e s! . . INTEGER∗8 pt (64)! . . I n t e r n a l s o l v e r memory p o i n t e r f o r 32− b i t a r c h i t e c t u r e s! . . INTEGER∗4 pt (64)! . . Th i s i s OK i n both c a s e sINTEGER∗8 pt (64)! . . A l l o t h e r v a r i a b l e sINTEGER maxfct , mnum, mtype , phase , n , nrhs , e r r o r , m s g l v lINTEGER iparm (64)INTEGER i a (9 )INTEGER j a (18)REAL∗8 a (18)REAL∗8 b (8 )REAL∗8 x (8 )

INTEGER i , idumREAL∗8 walt ime1 , wal t ime2 , ddum! i n t e g e r omp get max threads! e x t e r n a l omp get max threads! . . F i l l a l l a r r a y s c o n t a i n i n g mat r i x data .DATA n /8/ , n rh s /1/ , maxfct /1/ , mnum /1/DATA i a /1 ,5 ,8 ,10 ,12 ,15 ,17 ,18 ,19/j a = (/1 , 3 , 6 , 7 , 2 , 3 , 5 ,3 , 8 ,4 , 7 , 5 , 6 , 7 , 6 , 8 ,7 ,8/)a = (/ 7 . d0 , 1 . d0 , 2 . d0 , 7 . d0 ,−4.d0 , 8 . d0 , 2 . d0 , 1 . d0 ,

& 5 . d0 , 7 . d0 , 9 . d0 , 5 . d0 , 1 . d0 , 5 . d0 ,−1.d0 , 5 . d0 , 1 1 . d0 , 5 . d0 /)

! . .! . . Set up PARDISO c o n t r o l pa ramete r! . .do i = 1 , 64iparm ( i ) = 0end do

iparm (1 ) = 1 ! no s o l v e r d e f a u l tiparm (2 ) = 2 ! f i l l −i n r e o r d e r i n g from METIS

iparm (3 ) = omp get max threads ( ) ! numbers o f p r o c e s s o r s ,! v a l u e o f OMP NUM THREADS

iparm (4 ) = 0 ! no i t e r a t i v e −d i r e c t a l g o r i t hmiparm (5 ) = 0 ! no u s e r f i l l −i n r e du c i n g pe rmuta t i oniparm (6 ) = 0 ! =0 s o l u t i o n on the f i r s t n compoments o f xiparm (7 ) = 16 ! d e f a u l t l o g i c a l f o r t r a n u n i t number f o r outputiparm (8 ) = 9 ! numbers o f i t e r a t i v e r e f i n emen t s t e p siparm (9 ) = 0 ! not i n useiparm (10) = 13 ! p e r t u r b e the p i v o t e l emen t s w i th 1E−13iparm (11) = 1 ! use nonsymmetr ic pe rmuta t i on and s c a l i n g MPSiparm (12) = 0 ! not i n use

2.3. INTEL MKL 15

iparm (13) = 0 ! not i n useiparm (14) = 0 ! Output : number o f p e r t u r b e d p i v o t siparm (15) = 0 ! not i n useiparm (16) = 0 ! not i n useiparm (17) = 0 ! not i n useiparm (18) = −1 ! Output : number o f nonze ro s i n the f a c t o r LUiparm (19) = −1 ! Output : Mflops f o r LU f a c t o r i z a t i o niparm (20) = 0 ! Output : Numbers o f CG I t e r a t i o n se r r o r = 0 ! i n i t i a l i z e e r r o r f l a gmsg l v l = 0 ! don ’ t p r i n t s t a t i s t i c a l i n f o rma t i o nmtype = −2 ! unsymmetr ic mat r i x symmetr ic , i n d e f i n i t e , no

p i v o t i n g

! . . I n i t i l i a z e the i n t e r n a l s o l v e r memory p o i n t e r . Th i s i s on l y! n e c e s s a r y f o r the FIRST c a l l o f the PARDISO s o l v e r .

do i = 1 , 64pt ( i ) = 0end do

! . . R eo rde r i ng and Symbo l i c F a c t o r i z a t i o n , Th i s s t e p a l s oa l l o c a t e s

! a l l memory tha t i s n e c e s s a r y f o r the f a c t o r i z a t i o n

phase = 11 ! on l y r e o r d e r i n g and symbo l i c f a c t o r i z a t i o nCALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , i a , ja ,

&idum , nrhs , iparm , msg lv l , ddum , ddum , e r r o r )WRITE(∗ ,∗ ) ’ Reo rde r i ng completed . . . ’IF ( e r r o r .NE. 0) THEN

WRITE(∗ ,∗ ) ’The f o l l o w i n g ERROR was d e t e c t e d : ’ , e r r o rSTOP

END IF

WRITE(∗ ,∗ ) ’Number o f nonze ro s i n f a c t o r s = ’ , iparm (18)WRITE(∗ ,∗ ) ’Number o f f a c t o r i z a t i o n MFLOPS = ’ , iparm (19)

! . . F a c t o r i z a t i o n .phase = 22 ! on l y f a c t o r i z a t i o n

CALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , i a , ja ,&idum , nrhs , iparm , msg lv l , ddum , ddum , e r r o r )

WRITE(∗ ,∗ ) ’ F a c t o r i z a t i o n completed . . . ’IF ( e r r o r .NE. 0) THEN

WRITE(∗ ,∗ ) ’The f o l l o w i n g ERROR was d e t e c t e d : ’ , e r r o rSTOP

ENDIF

! . . Back s u b s t i t u t i o n and i t e r a t i v e r e f i n emen tiparm (8 ) = 2 ! max numbers o f i t e r a t i v e r e f i n emen t s t e p sphase = 33 ! on l y f a c t o r i z a t i o ndo i = 1 , nb ( i ) = 1 . d0end do


CALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , i a , ja ,&idum , nrhs , iparm , msg lv l , b , x , e r r o r )

WRITE(∗ ,∗ ) ’ So l v e completed . . . ’WRITE(∗ ,∗ ) ’The s o l u t i o n o f the system i s ’DO i = 1 , nWRITE(∗ ,∗ ) ’ x ( ’ , i , ’ ) = ’ , x ( i )END DO

! . . Te rminat ion and r e l e a s e o f memoryphase = −1 ! r e l e a s e i n t e r n a l memoryCALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , ddum , idum ,

&idum , idum , nrhs , iparm , msg lv l , ddum , ddum ,e r r o r )

END

Chapter 3

Parallel Solver for Plane Stress

This section describes the implementation of a parallel solver using the techniques de-scribed earlier. The three different approaches described earlier are evaluated with em-phasis on the needs of a medium sized finite element software company. The simple 2Dplane stress application used as the base for implemetation is also described, finally, thescaling, speed, and accuracy of the chosen method is evaluated in test examples.

3.1 Original Application

The implementation started with an existing application for 2D plane stress calculationsusing linear triangular elements. This application was comprised of a user interface writtenin Python, linked to a calculation engine written in Fortran. The calculation engine useda symmetric banded matrix format with a gaussian solver. The Triangle mesh generator[7] written by Jonathan Shewchuck was used to generate the finite element mesh in thetext examples.

A dependency diagram of the different modules can be seen in figure 3.1 The different

Figure 3.1: Dependency diagram for the original solver.

modules are:

17

18 CHAPTER 3. PARALLEL SOLVER FOR PLANE STRESS

inout read problem definition from input files.

fem element matrices, assembly functions, format conversion.

solve solve the problem, uses the Pardiso solver.

stress allocates system matrices, controls assembly process, calls the format conversionand calls the solve routine, finally it deallocates the system matrices.

main Allocate topology/coordinate matrices. Read problem definition via inout, performcalculations via stress.

The source code for this application can be found in appendix. In the following sectionsthis code will be implemented as a parallel version using the techniques described in theinitial chapters.

3.2 OpenMP Implementation

The first technique studied was OpenMP. For larger Finite Element problems, the mosttime consuming part is the solving process. It is therefore the part that this work focuseson. This section covers an OpenMP implementation of the existing serial Finite Elementapplication.

3.2.1 Original Serial Solver

The solver that is used in the original application is a gaussian solver. It handles anarbitrary number of load cases and it is banded i.e. it expects the system matrix to bestored in a banded format. This format is used to conserve storage space in large FiniteElement systems. Figure 3.2 illustrates the banded matrix format. The first column of

Figure 3.2: Illustration of the banded matrix format taken from [14].

3.2. OPENMP IMPLEMENTATION 19

the banded matrix contains the diagonal elements. Each row contains bandwidth elementsstarting from, and including, the diagonal element. The solver also expects the matrix tobe symmetric as only the upper triangular part of the matrix is stored.

The subroutine starts with saving the right hand side vectors r, that was given on input,into x. The second loop takes care of the forward reduction of the matrix and finally thesystem is solved by back substitution.

3.2.2 Parallelizing the Solver using OpenMP

To parallize the existing code OpenMP directives will be added at suitable locations.

The first loop, saving the right hand side vectors, is very straightforward to parallelizewith OpenMP.

! $omp p a r a l l e l dodo n=1, n s i z e

i f ( i f i x ( n ) . eq . 0 ) then

do l l =1, n l c sx (n , l l )=r ( n , l l )

end do

end i f

end do

! $omp end p a r a l l e l do

The compiler directive !$omp parallel do is added at the beginning and !$omp end parallel

do at the end of the loop to split it up in several threads. To parallelize the forwardreduction of the matrix, however, is more complicated. Beginning with the first inner loopone can see that the value of kk is depending on the other iterations.

! $omp p a r a l l e l do p r i v a t e ( kk , j j )do i i=n , nx

kk=i i −n+1 ! mod i f i e d i n c r emen t a t i o ndo j j =1, n l c s

r ( i i , j j )=r ( i i , j j )−s ( n , kk ) ∗x (n , j j )end do

end do


To overcome this kk has to be labeled private which means that it is visible to one threadonly. But now, when changes are seen by one thread only, the expression kk = kk + 1will not be correct, so we change it to kk = ii − n + 1. This will work because ii and n

are visible to all threads in the parallel region. Also jj is a local variable in the parallelregion and also has to be labeled private. The second loop of the reduction part is not ascomplicated as the first one.

! $omp p a r a l l e l do p r i v a t e ( i l c )do j e q =1, n s i z e


i f ( i f i x ( j e q ) . eq . 0 ) then

do i l c =1, n l c sr ( j eq , i l c )=x ( jeq , i l c )x ( j eq , i l c ) =0.0d0

end do

end i f

end do


ilc is a variable in the parallel region and is therefore labeled private.

3.2.3 Results with OpenMP

The OpenMP implementation was tested on a very simple model, consisting of a twodimensional rectangular plate divided into triangular elements by the Triangle mesh gen-erator. The architecture was a 4 Intel Xeon 5160 @ 3.0 Ghz with 4 GB RAM. The originalapplication was quite limited, only managing to calculate problems with up to approxi-mately 5000 degrees of freedom before it began swapping on the hard drive. This is due tothe very big bandwidth. When splitting up a region into several threads, the threads hasto be forked and this causes an overhead in the program. This overhead is in fact so largethat it in these relatively small problems are calculated even faster by the original solver.To overcome this problem, one could either develop an OpenMP parallelized solver fromscratch instead of parallelizing extisting routines.

3.3 PARDISO Implementation

The next approach taken was to implement the finite element application using the IntelMKL solver PARDISO. First, the subroutine bandsolve omp was replaced with the sub-routine pardiso solve, shown in appendix B in the solve module. This subroutine basicallycontains the code code in section 2.3.4, but with some modifications. The matrices inthis case are real, symmetric and positive definite, thus the variable mtype is set to 2. In-stead of initialize the number of equations, the CSR-representation of the system matrix,the solution matrix and the right-hand-side matrix in the solver code, these are given asparameters to pardiso solve from the execute subroutine in the stress module.

Next, the code was adapted to the new solve routine.The banded matrices must be replacedwith matrices stored in the CSR format, which is the default format used by the PARDISOsolver. There are several ways to overcome this. Assemble directly into the CSR formator assemble into a different format and then do a conversion to CSR. Different techniquesfor this are described in the following sections.

3.3.1 Banded Solution

In this approach the system matrix was, as before, assembled into the banded format thenconverted to the CSR format in order to use the PARDISO solver. For this purpose a

3.3. PARDISO IMPLEMENTATION 21

conversion subroutine, BANDtoCSR, shown in appendix B, was written. It stores thebanded system matrix K into the arrays Kvec, ia and ja. The subroutine simply iteratesover each row and every non zero element encountered is stored in Kvec. The variable nnzis increased for every non zero element encountered. To take advantage of the symmetry,only the upper triangle part of the matrix is stored, and for this purpose the col idxvariable is used. It starts counting the column index from the diagonal on each row, andwhen a non zero element is encountered the value of col idx is inserted at index nnz inja. When the first non zero element of a row i is encountered, i.e. when the element atindex i of ja is empty, this element is set to the value of nnz i.e. the index in K of thefirst non zero element of row i. Finally the dummy entry of ia is set to nnz + 1. The codelisting 3.1 shows the iteration.

Listing 3.1: Format conversion from banded to CSR

i a = 0do i =1, nnd∗2

c o l i d x = ido j =1, bw

i f (K( i , j ) . ne . 0) then

nnz = nnz+1 ! i n c r emen t non−ze rocoun t e r

Kvec ( nnz ) = K( i , j ) ! i n s e r t thenon−ze ro

! e l ementj a ( nnz ) = c o l i d x ! the i n s e r t e d

non−z e r o s! column index

i f ( i a ( i ) . eq . 0) then

i a ( i ) = nnzend i f

end i f

c o l i d x = c o l i d x+1end do

end do

i a ( nnd∗2+1) = nnz+1 ! Add s p e c i a l l a s t e n t r y

The only part of the matrix relevant for the PARDISO solver is the submatrix with theelements corresponding to the degrees of freedom that are not prescribed. The subroutinesubmatrixCSR performs an in-place extraction of this submatrix. It also calculates anvector, updof , containing the unprescribed degrees of freedom. The purpose of updof is toextract the elements of the displacement vector and force vector that corresponds to the un-prescribed degrees of freedom. submatrixCSR is shown it’s entirety in appendix B in thefem module. First the vector nbr del col vec is calculated. Element i in nbr del col vectells how many columns from column 0 to i − 1 that corresponds to a prescribed degreeof freedom i.e. how many columns that will not be extracted to the submatrix. Thecolumn index of the submatrix can then be calculated as column index of the originalmatrix−nbr del col vec(i). When nbr del col vec is calculated the subroutine loops overall rows of K and performs the in-place extraction of the submatrix. The extraction codeis shown in listing 3.2.

Listing 3.2: Submatrix extraction

nnz sub = 0 ! i n i t i a l i z e number o f non−z e r o s i n submat r i x


a l l o c a t e ( n b r d e l c o l v e c ( neq ) )n b r d e l c o l v e c = 0n b r d e l c o l = 0do i =1, neq

i f ( Pre ( i ) . ne . 0) then

n b r d e l c o l = n b r d e l c o l + 1end i f

n b r d e l c o l v e c ( i ) = n b r d e l c o lend do

do i =1, neq ! l o op ove r a l l rows o f Ki f ( Pre ( i ) . eq . 0 ) then ! t h i s do f i s u n p r e s c r i b e d

row = row+1 ! the submat r i x has a new rowupdof ( row ) = i ! do f i i s u n p r e s c r i b e dtmp rows ta r t = nnz sub+1 ! i ndex where row i n

! submat r i x s t a r t sdo j =1, i a ( i +1)− i a ( i ) ! l o op ove r columns i n t h i s

rowc o l = j a ( i a ( i ) + j − 1 )i f ( Pre ( c o l ) . eq . 0 ) then

nnz sub = nnz sub +1K( nnz sub ) = K( i a ( i ) + j−1 )j a ( nnz sub )=co l−n b r d e l c o l v e c

( c o l )end i f

end do

i a ( row ) = tmp rows ta r tend i f

end do

i a ( row+1) = nnz sub+1 ! Add s p e c i a l l a s t e n t r ydea l loca te ( n b r d e l c o l v e c )

At this point, when the system matrix was stored in the CSR format, one simply had tocall the subroutine pardiso solve to solve the system. The results are shown in the nextsection.

3.3.2 Results with Banded Solution

The code was tested on the same model as in the OpenMP section and used the samearchitecture. Even though this solution performed better than the OpenMP implemen-tation, it was actually very limited. The program managed to solve problems with up toabout 6000 degrees of freedom, but for problems with some ten thousands of degrees offreedom, it had to run for hours before returning a solution. The time consuming partsof the application was not the solver part, instead it was the assembling of the matrix,the format conversion and the extraction of the submatrix and force vector. Even thoughwe used the memory efficient banded matrix format, the system matrix consumed a lot ofmemory. The main reason for this is that the bandwidth for the topology generated byTriangle is not very efficient. This resulted in a very memory consuming system matrixeven for relatively small problems. In figure 3.3 the bandwidth for different problem sizesare shown. As the reader can see, the bandwidth for a system matrix with 6382 equations


was 6200 i.e. almost full bandwidth. This resulted in extremely inefficient calculations.

0 1000 2000 3000 4000 5000 6000 70000

1000

2000

3000

4000

5000

6000

7000

Number of degrees of freedom

Ban

dwid

th

Bandwidth for different number of degrees of freedom

Figure 3.3: Bandwidths for different number of degrees of freedom.

The relatively bad bandwidth is due to the numbering scheme used by Triangle. Whengenerating a mesh with Triangle, the geometry is defined in an input file. In this file, theedge nodes are listed. When the mesh is generated for the geometry, the numbering of theedge nodes are the same for all meshes generated i.e. they are forced into one position andthis results in the relatively bad bandwidth. To overcome this, some kind of renumberingof the topology has to be done.

3.3.3 Renumbered Banded Solution

To reduce the bandwidth the TRIANGULATION RCM [16] was used. It is an executableFORTRAN 90 program, using double precision arithmetic, which computes the reverseCuthill-McKee(RCM, an algorithm for reordering nodes in a graph) reordering for nodesin a triangulation composed of 3-node or 6-node triangles. The user supplies a node fileand a element file, containing the coordinates of the nodes, and the indices of the nodesthat make up each triangle. Either 3-node or 6-node triangles may be used. The programreads the data, carries out the reordering algorithm and produces new node- and elementfiles corresponding to the reordered nodes.

To utilize TRIANGULATION RCM, a program that reads the node- and element filesproduced by Triangle and then produces new node- and element files to be read by TRI-ANGULATION RCM, was written. Some changes was also made in the inout module ofthe application to make it read files in the new format. The bandwidth is now significantlyreduced and more memory efficient. The results are shown in the next section.

3.3.4 Results with Renumbered Banded Solution

Using the renumbering algorithm, the bandwidth became approximately five times smaller.Of course, this resulted in more effective calculations because a more memory conservingsystem matrix was produced. The size of the system matrix is proportinal to numberofelements∗


bandwidth, thus the application managed to solve problems approximately five times largerthan with the previous one (i.e. approximately 60 000 degrees of freedom). Still, this wasto limited because the size of the problems in real applications are up a million degrees offreedom. A better approach is needed.

3.3.5 Coordinate Format Solution

To store the system matrix in the banded format was not the optimal solution. Evenwhen the nodes where renumbered to minimize the bandwidth, there where extremelymany zeros kept in memory. The optimal solution would be to assemble directly intothe CSR format or in a similar, memory conserving, matrix format and then convert toCSR. Because it complicated to efficiently assemble directly into the CSR format, latteralternative was chosen. The format that was chosen to assemble in, is the coordinate(COO) format, a matrix format similar to the CSR format and very easy to assemble into.Like the CSR format, it consists of three vectors:

K This array contains the non-zero elements of the system matrix A. The elements arestored in the order they appear when walking across the rows in A.

ik element i of this array gives the number of the row corresponding to the element K(i).

jk element i of this array gives the number of the column corresponding to the elementK(i).

The assembly process is rewritten so that the element matrices are assembled directlyinto the COO format. To optimize the time complexity, the submatrix relevant for thePARDISO solver is directly calculated, instead of first calculating the full system matrixand then extracting the submatrix. As in section the nbr del col vec vector is calculatedto get the number of columns that will not be extracted. This vector is calculated by thesubroutine get nbr del col vec, in the fem module, by looping through the Pre vector.This subroutine also calculates the updof vector, containing the unprescribed degrees offreedom, to extract the force- and displacement vectors relevant for the PARDISO solver.Furthermore the elements, corresponding to the prescribed degrees of freedom, of the forcevector are also calculated by assemCOO. The code in listing 3.3 shows the calculation.

Listing 3.3: Calculation of get nbr del col vec

n b r d e l c o l v e c = 0

do i =1, s i z e ( Pre )i f ( Pre ( i ) . ne . 0) then

n b r d e l c o l = n b r d e l c o l+1e l s e

updo f i d x = updo f i d x+1updof ( updo f i d x ) = i

end i f


The assemblation of the system matrix in the COO format is done by the assemCOOsubroutine shown in appendix B. The subroutine loops through the element matrix. For


every non zero element encountered, the nnz variable is increased. The global coordinatesfor element nnz are calculated with edof(i), edof(j) where i, j are the coordinates in thelocal element matrix. If the matrix element corresponds to a prescribed variable, the vectorelement of the force vector is calculated. If not, nnz is increased and Ke(i, j) is stored inK(nnz). Finally, the global coordinates are stored in ik(nnz) and jk(nnz) respectively.The code, is very simple, shown in listing 3.4.

Listing 3.4: Assembly into the coordinate format

n = s i z e (Ke , 1 )row = 0

do i =1, ni g l o b = edo f ( i )i f ( Pre ( i g l o b ) . eq . 0) then

row=row+1do j =1, n

j g l o b = edo f ( j )i f (Ke( i , j ) . ne . 0) then

i f ( Pre ( j g l o b ) . ne . 0)then

F( j g l o b ) = F(j g l o b ) −

& Ke( i , j ) ∗u ( j g l o b )e l s e i f ( j g l o b . ge .

i g l o b ) then

nnz=nnz+1K spa r s e ( nnz ) =

Ke( i , j )i k ( nnz ) =

i g l o b−& n b r d e l c o l v e c ( i g l o b )

j k ( nnz ) =j g l o b−

& n b r d e l c o l v e c ( j g l o b )end i f

end i f

end do

end i f

end do

Now, when the system matrix is in the COO format, has to be converted to CSR inorder to use PARDISO. For this we use subroutines from SPARSKIT - a tool package forworking with sparse matrices written by Yousef Saad. The subroutine coicsr, performsan in place conversion from COO to CSR i.e. the vectors K, ia and ja are overwrittenand contains the system matrix on the CSR format on output. In the assembly process,entries with the same coordinates in the global matrix were put after each. To removethese duplicate entries we use the subroutine clncsr, also from SPARSKIT.


3.3.6 Results with Coordinate Format Solution

To test the performance of this application the same simple rectangular model as in theprevious sections was used. As before, Triangle was used to generate the triangular meshand now, when the banded matrix format was not longer used, there was no need toutilize the renumbering algorithm for node renumbering anymore, so the Triangle meshwas directly used. The architecture used for testing was a four processor (AMD Opteron2.4 GHz) Linux machine with 16 GB RAM. With this architecture the application managedproblems with up to eleven million degrees of freedom before swapping on the hard drive.Different steps in the program was time measured and with this approach considerableperformance improvement could be seen. In figure 3.4 the performance of the assemblyroutine assemCOO is shown and here one can see that this assembling is much faster thanwith the previous solutions. For eleven millions degrees of freedom the system matrix isassembled in less than a minute. Also, assemCOO takes care of the force vector andsub matrix extraction which results in a large improvement as the very time consumingsubroutines loadBANDmul and submatrixCOO can be eliminated.

1 2 3 4 5 6 7 8 9 10 11

x 106

0

10

20

30

40

50

60

Number of elements

Ass

embl

y tim

e / s

Assembly time for different number of elements

Figure 3.4: Assembly times for different number of degrees of freedom.

After the assembly process the matrix format has to be converted from COO to CSR.The conversion times with the current solution are shown in figure 3.5. For eleven milliondegrees of freedom the conversion was performed in less than three minutes so the overheadproduced when not assembling directly in to the CSR format is relatively small.The factorization time for different problem sizes are illustrated in figure 3.6. The largestproblem was factorized in less than four minutes. The wallclock time of the pardiso solvesubroutine was measured and the results are shown in figure 3.7. The largest problemwas solved in about seven minutes. To illustrate how well the solver scales on more thanone processor, the performance in GFLOPS during the factorization phase for differentnumber of processors are plotted. The result is shown in 3.8. One can see that the solverscales very well, it is almost linear for up to four processors.


1 2 3 4 5 6 7 8 9 10 11

x 106

0

20

40

60

80

100

120

140

160

180

Number of elements

CO

O to

CS

R c

onve

rsio

n tim

e / s

COO to CSR conversion time for different number of elements

Figure 3.5: Conversion times for different number of degrees of freedom.

1 2 3 4 5 6 7 8 9 10 11

x 106

0

0.5

1

1.5

2

2.5

3

3.5

Number of degrees of freedom

Tim

e fa

ctor

ize

/ min

Factorization times running 4 CPUs

Figure 3.6: Factorization times for different number of degrees of freedom.

1 2 3 4 5 6 7 8 9 10 11

x 106

0

50

100

150

200

250

300

350

400

450

Number of elements

Tot

al c

alcu

latio

n tim

e of

the

solv

er /

s

Total solve time for different number of elements

Figure 3.7: Total solve times for different number of degrees of freedom.


1 1.5 2 2.5 3 3.5 42

3

4

5

6

7

8

Number of processors

gflo

ps/s

gflops/s during factorization running different number of CPUs, problem has 10710714 dof

Figure 3.8: Performance in GFLOPS for different number of processors.

Chapter 4

Integrating the Parallel Solver intoFEM-Design

To test the implemented solver it was integrated in the commercial finite element codeFEM-Design. This chapter describes the implementation process and performance studiescomparing the solver with the existing solver.

4.1 The Implementation

The FEM-Design solver is implemented in Fortran. The first approach tried was to inte-grate the solver directly into the FEM-Design code. The problem was that the FEM-Designsolver is written in Fortran77 and the new solver is written in Fortran90. To make theFEM-Design code compile with the Intel Fortran90 compiler would take to much time notavailable in this project. For the temporary solution, the system matrix, assembled in thecoordinate format, and the force vector was written from FEM-Design to file. A specialapplication was written to handle the following steps:

FEM-data import The system matrix, the displacement vector and the force vector isread from the file produced by FEM-Design, and saved into arrays.

Matrix conversion The system matrix is converted from the COO format to the CSRformat in order to use the solver.

Solve The new solver is called.

Result export The displacement vector is written to file.

The application was compiled as an executable and called from FEM-Design directly afterthe FEM-data was exported to file. The result was thereafter read from the file and wasillustrated graphically by FEM-Design.

29

30 CHAPTER 4. INTEGRATING THE PARALLEL SOLVER INTO FEM-DESIGN

4.2 Results

The first step was to verify that the solver generated correct results. For this, somecalculations was done on a small model with both the solver and with FEM-Design. Themesh of this model is shown in figure 4.1. The results were identical to the eighth decimal.The next step was to measure and compare the calculation times for larger problems.

Figure 4.1: An illustration of the model used for result verification.

For this purpose a more complex model, shown in figure 4.2 was used. The structurewas divided into three different meshes in order to test on different problem sizes. Thetesting architecture was Intel 2.4 GHz Dual Core machines with 2 GB RAM. In figure

Figure 4.2: To the left the mesh is shown and to the right a three-dimensional illustrationwith and without mesh.

4.3 the solve times for FEM-Design are shown and in figure 4.4 the solve times for theparallel solver. The differences are noticeable, the largest example with 630000 equationswas solved in about six hours with FEM-Design while it was solved in about one minutewith the parallel solver i.e. a difference of almost a factor 360. One reason for this hugedifference is that FEM-Design does a lot of reading and writing on the hard drive between

4.2. RESULTS 31

different execution steps. The new solver, on the other hand, keeps all information inmemory throughout the entire calculation, which is much more time efficient. Also, theskyline matrix format used in FEM-Design is very memory consuming. A more efficientsolution would be to assembly directly into a less memory consuming format e.g. the COOformat.

0 1 2 3 4 5 6 7

x 105

0

50

100

150

200

250

300

350

400Calculation times for the FEM−Design solver

Number of equations

Sol

ve ti

me

/ min

Figure 4.3: Calculation times for the FEM-Design solver.

0 1 2 3 4 5 6 7

x 105

0

0.2

0.4

0.6

0.8

1

1.2

1.4Calculation times for the parallel solver

Number of equations

Sol

ve ti

me

/ min

Figure 4.4: Calculation times for the parallel solver.

These problems where also calculated on a four processor (AMD Opteron 2.4 GHz) Linuxmachine with 16 GB RAM, to evaluate how well the parallel solver scaled on these kindsof problems. Figure 4.5 shows that it scales very well also for these problems, almost linearfor up to four processors.

32 CHAPTER 4. INTEGRATING THE PARALLEL SOLVER INTO FEM-DESIGN

1 1.5 2 2.5 3 3.5 41

2

3

4

5

6

7Gflops for different number of processors

Number of processors

Gflo

ps

Figure 4.5: Scaling for the parallel solver, the performance is measured in GFLOPS.

Chapter 5

Distributed Programming Tools

When problems are too large for local computing resources, it is benificial to distributecalculation to non-local resources. A company could have an in-house dedicated serveror small cluster for handling these computations. This frees the user of the program toperform other tasks whilst waiting for the computations to finish. Of course, this couldbe done for example by transferring the input files to a remote computer and invoking thesolver via a terminal. This alternative is not so user friendly, especially when consideringcommercial software. It should be more convenient for the user to utilize distributedcomputations.

Figure 5.1 illustrates the concept of Remote Procedure Calls in distributed computing.Instead of one object invoking a method in another object locally, the same method couldbe invoked remotely on another machine. For example if FE modeling is being doneon some low-performance machine, the time consuming solution of the system could bedistributed to a better suited architecture via a Remote Procedure Call.

Figure 5.1: Remote Procedure Calls in distributed computing.

There exists many different programming tools, e.g. SOAP[9], Web Services[10], CORBA[8],Microsoft .NET[11], and Ice[12]. The preceeding tools can briefly be described as:

SOAP A protocol for exchanging XML-based messages over networks, normally usingHTTP/HTTPS.

33

34 CHAPTER 5. DISTRIBUTED PROGRAMMING TOOLS

Web Services Basically, the term Web Service refers to clients and servers that commu-nicate using XML messages that follow the SOAP standard, also there usually is amachine readable description of the operations supported by the server in the WebServices Description Language (WSDL).

CORBA Stands for Common Object Request Broker Architecture it is a vendor-independentarchitecture and infrastructure that computer applications use to work together overnetworks. Developed by the Object Management Group, OMG. Some of the peoplethat developed CORBA are the ones that are now maintaining ICE. You can prettymuch say that CORBA is the predecessor of Ice, although still maintained by OMG.

.NET Tools and librarys developed by Microsoft for connecting different applicationsdeveloped for Microsoft platforms.

A disadvantage with .NET is that it only runs on Microsoft platform. This makes it unsuit-able for use with Linux based resources. SOAP has very serious performance penalities inthis type of applications, both in terms of network bandwidth and CPU overhead. Sinceour application will send large amounts of data, text based XML-based data exchangeisn’t efficient. The same applies to Web Services as it is based on the same technology.Another issue with Web Services is lack of standardization. A more in-depth review ofthese considerations can be found in [12].

Ice on the other hand, can be used in heterogenous enviroments, is efficient in networkbandwith (sending data in binary form), memory use and CPU overhead. It is also easy tolearn, has built in security and provides all features required for applications needing high-performance network transfers. Since this project focuses on transferring large amount ofdata over heterogenous systems, Ice was chosen for the distributed implementation. Thenext section will describe the Ice architecture in more detail.

5.1 Ice Overview

Ice is an object-oriented, middleware 1 providing tools, APIs and library support forbuilding object oriented client-server applications. In contrast to e.g. Microsoft .NET,Ice can be used in heterogenous systems. This means that the client and server canrun on different machine architectures or operating systems, be implemented in differentprogramming languages, and also support a wide variety of networking technologies. Otheradvantages with using Ice include:

Synchronous/Asynchronous messaging For example, when running long calculations,the client doesn’t have to wait for the results to be sent back from the server. Iceis inherently an asynchronous architecture but forces synchronous behaviour by de-fault, it is a relatively easy task to change this.

Support for multiple interfaces There exists facilities to provide multiple implemen-tations of client/server interfaces while retaining a single object identity across theseinterfaces. This makes it easy to write different kind of clients without making manychanges, also, version handling is simplified.

1Middleware is computer software that connects software components or applications

5.2. THE SLICE DEFINITION LANGUAGE 35

Implementation independence Clients are not aware of how servers implement theirobjects.

Threading support Ice is fully threaded and APIs are thread-safe2.

Transport independence Ice has support for both the TCP/IP and the UDP transportprotocols. Also, there is support for SSL which means that information that has tobe sent over ”unsafe” networks can be encrypted. The Ice tool Glacier2 is used tohandle firewalls.

Location and server transparency The user doesn’t have to worry about locating ob-jects and managing the underlying transport mechanisms. To the programmer, theclient and server appear ”connection-less”.

Source code availibility The source code for Ice is fully available. However, use incommercial proprietary codes require a licensing agreement with ZeroC.

In this section, a brief overview of the Ice architecture is given. This is basically a sort ofselection of the contents of sections 2 and 32 in [12]. Of course, far from everything canbe covered in this report. However it is hoped that the description is clear enough thatthe reader can understand the later description of the actual implementation.

5.2 The Slice Definition Language

Ice provides a Remote Procedure Call, or RPC, protocol for invoking methods from clientto server (or the other way around). This means that the client can cause a method-/procedure to be executed in a different address space (e.g. on a different computer ona local network) without the programmer explicitly handling all the complex details ofthis remote interaction. Of course, since Ice is supposed to be object oriented and lan-guage/platform independ, all these remote procedures has to be described in some objectoriented, language independent way. This is where Slice comes in.

The Slice definition provides a contract by which the client and server must abide. Icealso provides special compilers that translate Slice definitions into language specific typedefinitions and APIs for a particular implementation language. These different translationalgorithms are called language mappings. The defined types and APIs are then used bythe devoloper to program specific functionality in either client or server. At the moment,Slice provides language mappings for C++, Java, C#, Visual Basic .NET, Python, Ruby,and PHP. Slice is a purely declarative language, because to be language independent, onlyinterfaces and types can be described (not implementations). List 5.1 is an example of asmall Slice definition file

Listing 5.1: Slice definition example

module Demo {i n t e r f a c e P r i n t e r {

vo i d p r i n t S t r i n g ( s t r i n g s ) ;} ;

2A piece of code is thread-safe if it functions correctly during simultaneous execution by multiple threads

36 CHAPTER 5. DISTRIBUTED PROGRAMMING TOOLS

} ;

In this C++ example, the language mapping for a module is not surprisingly a C++ names-pace and the interface within the module will be mapped to a C++ interface within thecorresponding namespace. To generate these mappings, one has to invoke the slice2cppcompiler in the following way:

$ slice2cpp Printer.ice

This will generate two C++ source files:

Printer.h Header file with type definitions corresponding the Slice definitions for thePrinter interface. In our case, this file will also include a definition of a pure virtualfunction corresponding to the printString method. This means that all implementa-tions of the Printer interface must implement this method. Of course, if both clientand server is written in C++, this header must be included in them both.

Printer.cpp Contains the source code for our Printer interface. This file also providessome basic tools for clients and servers e.g. code that marshals3 on the client sideand correspondingly unmarshals the data on the server side. For example, the strings passed from client to server has to be converted to a binary format on the clientside, sent to the server side, and converted back in to the right format.

Figure 5.2 illustrates the situation when both client and server is written in C++. Theclient devoloper writes her client and the server developer writes her server, the only infor-mation these developers need for writing their end of the application is the Slice definitionfile. The final executables both use the C++ Ice run-time library and communicate viaRemote Procedure Calls. In the final step the actual implementation of the client/server

Printer.h Printer.cpp

Client.cpp

Server.cpp

Application Code

Server developer

Client Developer

Slice developerPrinter.iceSlice2c++

C++ Ice Run−Time Library

Server Executable

RPC

Client Executable

Figure 5.2: Development process if client and server share the same development environ-ment. This process is also described in [12].

must be implemented. As Ice automatically handles all network communication, this isa relatively easy task. However, one needs a basic understanding of the principles ofclient/server communication in Ice, which is described in the next section

3the process of transforming the memory representation of an object to a data format suitable for

storage or transmission. In Ice, the data is transmitted in a binary format

5.3. PRINCIPLES OF ICE COMMUNICATION 37

5.3 Principles of Ice Communication

Figure 5.3 illustrates the basic idea behind client/server communication in Ice. In thisparticular case, the are three servants instantiated on the server side- These are given thearbitrary names ”A”, ”B”and ”C”. The client holds a reference, or proxy, to the servant ”A”on the server side. Basically this means that ”A” is an incarnation of an implementationof a specific Slice defined interface and that the client can invoke remote procedure callsusing this proxy that also contains information on how to access the server. In this casethere also exists other servants but these are distinguished by their identities which haveto be specified by the programmer (one can also us UUID to simplify the process if manyservants are needed). Now, when the client invokes an operation using its ”A” proxythe following will happen on the server side; The object adapter which is listening toits specified network endpoints will receive the call from the client, look up the correctservant in its associated Active Servant Map (ASM) using the passed object identity andthen dispatching the request to this servant. All data unmarshaling/marshaling is handled”under the hood” so the user need not worry about this.

Figure 5.3 is missing one key element in Ice, namely the entity called communicator. This isthe main entry point to the Ice run-time and is used both on the client and server side. Thecommunicator handles the client/server side thread pools, configuration properties, objectfactories, Loggers, statistics, default router (e.g. used by Glacier2 for firewall handling),default locator (resolving object identity to proxy), plug-in manager and last but not leastobject adapters. For a full description of distributed programming with ICE, see [12].

212.8.7.33 A

Client

ProxyA

Endpoint

212.8.7.33

A

B

C

ASM

Object Adapter

Server

Servants

Figure 5.3: Binding a request to the correct servant in Ice using direct binding. A moredetailed description can be found in [12].

Chapter 6

Distributed Interface for theParallel Solver

This chapter describes the implementation of the distributed interface for the previouslydeveloped parallel solver.. The implementation is based on Ice [12] which is describedearlier. The implemented system has been tested both locally on a Microsoft Windowssystem, remotely with the server running on a LUNARC 1 GNU/Linux node and theclient running on a Windows computer. In the current implementation both client andserver are implemented in C++. The server code however, links to a finite element solverimplemented in Fortran. The approach for doing this is also described in this chapter

6.1 The Slice Definition

The Slice definition for the client/server application is given in list A.1 in appendix A. Itis named remoteComputation.ice. As discussed earlier, when running the Slice compileron this file, a header file remoteComputation.h and cpp file remoteComputation.cpp willbe generated. The latter containing for example some code for marshalling/unmarshallingof data that the programmer need not be concerned about.

The first interface defined in the slice definition is a module called remoteComputation.This module will be mapped in to an equivalent namespace in C++. Everything elsedefined in this module will correspondingly be mapped to functions, classes and datas-tructures in the generated C++ namespace. Thereafter some Slice sequences are defined,these will be correspondingly mapped to C++ type definitions of std::vector. For example,sequence<int> will be mapped to std::vector< ::Ice::Int> in C++. Obviously, these typedefinitions describe our input/output vectors and matrices. These definitions look like thefollowing:

sequence i n tVec ;sequence intMat ;sequence<double> doubleVec ;

1Center for Scientific and Technical Computing in Lund

39

40 CHAPTER 6. DISTRIBUTED INTERFACE FOR THE PARALLEL SOLVER

sequence<doubleVec> doubleMat ;

Some exceptions are also defined in the Slice definition. RangeError is an exception thrownwhen there is some error in the solution. It is not implied that all solution errors are causedby range issues, a better name would have been SolutionError. RequestCanceledExceptionis used in situations where client requests are canceled on the server side, for example ifthe server is forced to shut down. For example the RangeError exception is defined in thefollowing way:

e x c e p t i o n RangeErro r{} ;

Next, the interface for remote procedure calls is defined, this is done in the following way:

i n t e r f a c e r emo teExecu t ion{

[ ”ami ” , ”amd”] vo i d exe cu t e ( doubleMat Coords ,intMat Edof ,doubleVec ElProp ,i n t nnd ,i n t ne l ,i n t nn l ,i n t npv ,i n tVec pvind ,doubleVec pv ,i n tVec l o ad i nd ,doubleVec p re l o ad s ,out doubleMat ElemForces ,out doubleVec F ,out doubleVec u ) th rows RangeErro r ;} ;

An interface in Slice corresponds to an interface in C++, i.e a class with all memberfunctions purely virtual. In this case, the pure virtual function is the execute method.This method has to be implemented by any servant implementing the remoteCompu-tation interface. The preceding metadata [”ami”, ”amd”] has to do with asynchronousmethod invokation/dispatch and will be discussed later. The input/output parameters tothe function will be mapped in a way that follows C++ coding standard. For examplethe first parameter will be mapped into the corresponding C++ parameter const ::re-moteComputation::doubleMat& and the last into ::remoteComputation::doubleVec&. Theparameters are described as follows:

Coords The coordinate matrix.

Edof The topology matrix.

ElProp Element properties.

6.2. CLIENT IMPLEMENTATION 41

nnd The number of nodes in the mesh.

nel The number of elements.

nnl The number of prescribed nodal loads.

npv The number of prescribed variables.

pvind Indices of the prescribed variables.

pv Numerical values of prescribed variables. The actual indices of the variables are de-ferred from their corresponding entries in the pvind vector.

loadind Indices of prescribed loads.

preloads Numerical values of prescribed loads. This is the same construct as with theprescribed variables. The reason for this way of sending the data is minimizingoverhead.

Elemforces The output element forces. This parameter is preceeded with the Slice key-word out which makes it possible to receive it as an output from the remote procedurecall.

F The output force vector.

u The output displacement vector.

In case of a calculation error, the execute method can throw a RangeError. This isbasically all that has to specified to be able to produce a simple client/server applicationfor distributed computations. In the next section the client implementation is described.

6.2 Client Implementation

The implemented client is a simple console based application written in C++, where allcommunication with the server is handled by the Ice runtime. The client reads input datafrom text files and invokes the execute procedure (described above) remotely. Since thepurpose of the client is not to provide a full finite element application but to illustratehow one could take advantage of distributed computations this is sufficient. A diagramdescribing the different classes in the client application can be seen in figure 6.1. The clientdepends on the Ice libraries. All information pertaining client/server communication,defined in the Slice definition is in the remoteComputation header so this also has to beincluded. Furthermore all reading from input files is done by the ElemData class. Wheninvoking remote procedure calls on large problems, or when many clients are connectedto the server, response times from the server could be very long. Now, Ice is inherentlyan asynchronous library but it forces synchronous behaviour by default. This means thatwhen using the default Ice behaviour in these case the client will wait for a responsefrom the server before continuing with program execution. This is not acceptable forany realistic finite element application. To overcome this problem, Ice provides tools formaking asynchronous method invokations (AMI) where client execution continues afterthe remote procedure call.


Figure 6.1: Dependency diagram for the client application.

This is the reason for the ”ami” metadata proceeding the definition of the execute methodin the Slice definition. This directive causes the compiler to generate code for asyn-chronous method invokation in the remoteComputation class. The user only has to createa callback object for the procedure call and specify what to do when receiving the re-sults from the server (also error handling has to be specified). All this is done in theAMI remoteExecution executeI class which is an implementation of theAMI remoteExecution execute interface. This interface is in turn generated (in the remote-Computation header) by the Slice compiler when specifying the ”ami” metadata directivefor the execute function. In the next section the actual code written will be reviewed indetail.

6.2.1 Client Code

The code for the ElemData class is trivial and not relevant to distributed computations sotherefore it is not discussed here. First the code for the main client program is presented,then we show how to handle the callback from the asynchronous method invokation in theAMI remoteExecution executeI class. The complete client code (except for ElemData) isincluded in appendix A.

Listing A.2 shows the code for the main client program. The client inherits fromIce::application, which is a helper class that Ice provides in order to eliminate writing thesame boiler plate code over and over. Among other things, it initiates a communicatorand handles interrupts from the user.

The client has two public and methods one private method. The first public method is therun method. This is needed by the main method defined in Ice::application and defineswhat is to happen when running the client. Basically the run method replaces the actualmain method, but is contained inside another method handling some basic initializationsand interrupt listening. The main method of the client is shown below.

i n t

main ( i n t argc , char ∗ argv [ ] ){

Asyn cC l i e n t app ;

6.2. CLIENT IMPLEMENTATION 43

return app . main ( argc , a rgv ) ;}

As the reader can see the main method inherited from Ice::Application is called. Thismethod in its turn calls the runmethod implemented by the derived application.

interruptCallback is an optional method that has to be defined if the statement

c a l l b a c kOn I n t e r r u p t ( ) ;

is specified in the run method. The purpose of this method is to override the defaultinterrupt behaviour of Ice::Application.

The run method has to basic responsibilities:

1. Create a proxy to a remoteExecution servant.

2. Initiate the user menu and when prompted make the remote procedure call.

The servant creation is illustrated below:

I c e : : Ob jec tPrx base = communicator ( )−>s t r i ngToProxy (”S imp l eP r i n t e r : d e f a u l t −p 10000 ”) ;

remoteExecut ionPrx r e = remoteExecut ionPrx : : checkedCas t ( base ) ;i f ( ! r e )throw ” I n v a l i d p roxy ” ;

AMI remo teExecut ion execu tePt r cb =new AMI remo teExecu t i on execu t e I ;

After these lines, the variable re holds a ”remote object reference” to an instance of aclass implementing the interface defined in the Slice definition. In this case the remoteobject is on the local computer (localhost) but could easily be changed. On the last linein the preceding code snippet a callback object is created for the asynchronous methodinvokation. Next, the input data structures are defined and read by calling methods inthe ElemData class:

i n t nnd , ne l , nn l , npv ;doubleMat Coords , E lemForces ;intMat Edof ;v e c to r<in t> pv ind , l o a d i n d ;ve c to r<double> pv , p r e l o a d s , F , u ;v e c to r<double> ElProp ;

: : r e adProb l emS i z e ( nnd , ne l , nn l , npv ) ;: : readBCandElProp ( nnl , npv , l o a d i n d ,p r e l o a d s , pv ind , pv , E lProp ) ;: : readNode ( Coords , nnd ) ;: : r e a dE l e ( Edof , n e l ) ;

The remote procedure call is then execucted in the following way:

re−>e x e c u t e a s yn c ( cb , Coords , Edof , ElProp ,nnd , ne l , nn l , npv , pv ind , pv ,l o a d i n d , p r e l o a d s ) ;

The execution will immediately continue after this call and the callback object createdwill handle the response when it arrives. Last in the run method, error handling is done.As shown by the code examples implementing a Ice client is not very complicated.

To complete the asynchronous method implementation the server response callbacks mustalso be implemented. The full code for the AMI remoteExecution executeI can be found inlisting A.3. There are two methods in this class; ice response indicates that the operationis completed successfully and ice exception indicates that a local or user exception wasraised. The definition of the the ice response method is given below.

v i r t u a l void i c e r e s p o n s e (const : : remoteComputat ion : : doubleMat& ElemForces ,const : : remoteComputat ion : : doubleVec& F ,const : : remoteComputat ion : : doubleVec& u )

{cout << ”F i n i s h e d Ca l c u l a t i o n ” << end l ;/∗DO SOMETHING WITH THE RESULTS∗/

}

The ice response method receives the output parameters defined in the Slice definition asinput parameters. In this method the results can be modified or put into an other datastructure.

In the ice exception method, the error is received as an input parameter, is printed.

v i r t u a l void i c e e x c e p t i o n (const : : I c e : : E xcep t i on& ex )

{t r y {

ex . i c e t h r ow ( ) ;}catch ( const I c e : : L o c a l E x c e p t i o n& e ) {

c e r r << ”c a l c u l a t i o n f a i l e d ” << e << end l ;}

}

This completes the code for the asynchronous method invokation. Compared to other RPClibraries, Ice significantly reduces the amount of complexity and code needed to implementasynchronous code.

6.3. SERVER IMPLEMENTATION 45

6.3 Server Implementation

The server is also implemented in C++ and does not involve more work than writinga client, however, there are some issues to consider. If all of the servers threads (onethread in our case) are busy dispatching long-running operations then no threads areavailable to process new requests. This may lead to long response times for client requests.Asynchronous Method Dispatch (AMD), the server side equivalent of AMI, addresses thisscalability issue. When using AMD the server can suspend the processing of incomingrequests in order to release the dispatch thread as soon as possible. In our case, thismeans queueing the request as a job in a workqueue and forking a new thread responsiblefor performing the tasks in the queue.

An aspect that can make implementing a server difficult, is linking C++ with Fortran.The solver is implemented in Fortran and Ice and Slice does not have a Fortran binding.When building the server in Visual Studio, linking was accomplished using a dynamic linklibrary (DLL) built with the Intel Visual Fortran compiler. When building on GNU/Linuxusing the Intel C++ compiler, linking was done directly to object files built with the IntelFortran compiler. There are also issues in the way that matrices are stored in C++ andFortran. Fortran uses column-major order while C++ the row-major order. To solve thisproblem, and also to make the transition from std::vector to the arrays needed by Fortran,a modified version of the FMatrix [13] template class written by Carsten A. Arnholm isused.

Figure 6.2 shows the different classes in the server application. As shown in this figure, all

Figure 6.2: Dependency diagram for the server application. The diagram also shows publicmethods without input parameters for some classes.

the information pertaining a specific calculation is held within an object of the class Job.It also holds a callback object (the cb variable) which is similar to the AMI case. Theexecute method in this class starts jobs and handles the callbacks. The Job class has aprivate method solve system which converts the input matrices using the FMatrix class,


calls the Fortran routines for solving the system and finally converts the Fortran resultsto appropriate STL vectors.

The WorkQueue holds a list of pointers to Jobs. These are smart pointers defined by Ice. Asmart pointer is a pointer that automatically frees the associated memory when the pointergoes out of scope or is deleted, eliminating many of the memory management errors inC++. WorkQueue inherits from IceUtil::thread and therefore implements a method, run,which is the method that is called by the Ice runtime when starting a new Workqueuethread. It also has a destroy method handling the destruction of the queue and an addmethod for adding jobs to the queue.

The server itself is defined in a very similar way as the client, inheriting from Ice::applicationand implementing a run and interruptCallback method. On the server side, the remote-Execution interface from the Slice definition, must be implemented. This is the respon-sibility of the remoteExecutionI class. The execute async method only adds a job to theworkqueue.

6.3.1 Server Code

In this section the most important parts of the server code is reviewed. All code with theexception of the FMatrix template class can be found in appendix A. Listing A.4 shows theserver specific code, it is very similar to the client code, inheriting from Ice::application.To handle remote method calls a workqueue is first instantiated. An object adapter and aservant for the remoteExecution interface is created and the servant is added to the objectadapters active servant map (ASM).

workQueue = new WorkQueue ( ) ;

I c e : : Ob j e c tAdap t e rP t r adap t e r= communicator ( )−>c r ea t eOb j e c tAdap t e rWi thEndpo i n t s (”S imp l eP r i n t e rAdap t e r ” , ”d e f a u l t −p 10000 ”) ;I c e : : Ob j e c tP t r o b j e c t = new r emo t eE x e cu t i on I ( workQueue ) ;adapte r−>add ( ob j e c t ,communicator ( )−>s t r i n g T o I d e n t i t y ( ”S imp l eP r i n t e r ”) ) ;

The workqueue thread is spawned which implies executing the code in its run method ina new thread.

workQueue−>s t a r t ( ) ;

The adapter is activated, and the server will actively start listening for requests over thespecific network interfaces.

adapte r−>a c t i v a t e ( ) ; ;

The server then waits for a shutdown request.

6.3. SERVER IMPLEMENTATION 47

communicator ( )−>waitForShutdown ( ) ;

As the reader can se, writing this server is not very complicated. In the servant implemen-tation (the implementation of the remoteExecution interface) the only thing that is doneis calling the workqueues add method, adding a job to the workqueue.

The code for the workqueue is illustrated in listing A.5 and A.6. The code in the runmethod is illustrated below:

void

WorkQueue : : run ( ){

I c e U t i l : : Monitor : : Lock l o c k ( mon i t o r ) ;

while ( ! done ){

i f ( j o b s . s i z e ( ) == 0){

mon i to r . wa i t ( ) ;}

i f ( j o b s . s i z e ( ) != 0){

JobPtr c u r r e n t J o b = j o b s . f r o n t ( ) ;i f ( ! done ){

cu r r en t Job−>execu t e ( ) ;j o b s . p o p f r o n t ( ) ;

}}

}

l i s t <JobPtr > : : c o n s t i t e r a t o r p ;f o r ( p = j o b s . b eg i n ( ) ; p != j o b s . end ( ) ; ++p ){

cout << ”Cance l ed r e qu e s t ” << end l ;(∗p )−>n o t i f y J o bCa n c e l l e d ( ) ;

}}

The workqueue thread waits until there are jobs to process, pops the one in the frontand calls its execute method. When the server is shutdown (implies done is set to true)outstanding client requests are notified of cancellation.

Chapter 7

Conclusions and Future Work

7.1 Conclusions

After reviewing the different approaches to parallelization, it was found that Intel MKLand the PARDISO solver was the most suitable alternative. The implemented parallelsolver turned out to be very fast and scaled well on multiple processors. One problemwith using the PARDISO solver is the required CSR sparse matrix format. After someexperimentation with banded formats, a good solution was found in assembling in to theCOO format and then converting to CSR.

The solver was compared with FEM-Designs original skyline solver. The study showedthat the new PARDISO based solver was more efficient than the original solver. Besidesthe parallel computations, the differences also derive from the fact that FEM-Designsoriginal solver uses a lot of disk IO even if there is memory available in the machine.

A client/server applications for distributed computations was implemented in C++ usingIce. It turned out to be very convenient and efficient to use Ice for these kinds of appli-cations. A special template class was written in order to link to the Fortan solver. Theapplication uses asynchronous method invokation/dispatch. On the client side, the usershouldn’t have to wait for computations to finish before performing other tasks and theserver has to be able to handle multiple client requests

7.2 Future Work

A future task is looking att some smart CSR assembly methods in order to eliminate theconversion overhead. This work focused mostly on Intel MKL for parallel computations.It is suggested that further testing of the PETSc libraries be done in the future. This isespecially important when problems grow extremely large. Since PETSc uses a distributedmemory model via MPI, it is more suitable for large clusters like the ones at LUNARC.

The parallel solver should be fully integrated in FEM-Design using the Intel Visual Fortran10.0 Compiler. This required more work as the existing code had to be adopted for thenew compiler which was not possible in the timeframe of this work. It would also be

49

50 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

interesting to do more work with the distributed computations. For example making afull integration with FEM-Design and testing other language mappings. Also, a moreobject oriented interface could be developed, allowing computations on many differentproblems.

Appendix A

Client/Server code

In this appendix the code for the client/server application is shown, some code is excludedbecause it is trivial and not relevant to the matter at hand.

A.1 Slice definition

Listing A.1: Slice definition

// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//// S l i c e d e f i n i t i o n f i l e f o r d i s t r i b u t e d computat ions//// Wr i t t en by : F i l i p Johansson and F r e d r i k Hansson////// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

module remoteComputat ion{

sequence <int> i n tVec ;sequence <in tVec> intMat ;sequence <double> doub leVec ;sequence <doubleVec> doubleMat ;

e x c ep t i on RangeError{} ;

e x c ep t i on Reques tCance l edExcep t i on{} ;

51

52 APPENDIX A. CLIENT/SERVER CODE

i n t e r f a c e r emoteExecu t i on{

[ ”ami ” , ”amd”] vo id execu t e ( doubleMat Coords ,intMat Edof ,doub leVec ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,i n tVec pvind ,doub leVec pv ,i n tVec load ind ,doub leVec p r e l o ad s ,out doubleMat ElemForces ,out doub leVec F ,out doub leVec u ) th rows RangeError ;} ;

} ;

A.2 Client code

Listing A.2: Main client application

// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//// Main c l i e n t program . Uses a synch ronou s method i n v o k a t i o n// f o r the d i s t r i b u t e d// computat ions .//// Wr i t t en by : F i l i p Johansson and F r e d r i k Hansson//// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

#inc lude #inc lude <remoteComputat ion . h>

#inc lude <AMI remoteExecu t i on execu t e I . h>#inc lude <ElementData . h>

us ing namespace s t d ;us ing namespace remoteComputat ion ;

c l a s s Asyn cC l i e n t : pub l i c I c e : : A p p l i c a t i o n{

pub l i c :v i r t u a l i n t run ( int , char ∗ [ ] ) ;v i r t u a l vo id i n t e r r u p t C a l l b a c k ( i n t ) ;

p r i va t e :vo id menu ( ) ;

} ;

i n t

A.2. CLIENT CODE 53

main ( i n t argc , char ∗ argv [ ] ){

Asyn cC l i e n t app ;r e tu rn app . main ( argc , a rgv ) ;

}

i n t Asyn cC l i e n t : : run ( i n t argc , char ∗ argv [ ] ){


i n t s t a t u s = 0 ;t r y {

I c e : : Ob jectPrx base = communicator ( )−>s t r i ngToProxy (”S imp l eP r i n t e r : d e f a u l t −p 10000 ”) ;

r emoteExecu t i onPrx r e = remoteExecu t i onPrx : : checkedCast ( base ) ;i f ( ! r e )

throw ” I n v a l i d p roxy ”;

AMI remoteExecu t i on execu t ePt r cb =new AMI remoteExecu t i on execu t e I ;

i n t nnd , ne l , nn l , npv ;doubleMat Coords , E lemForces ;intMat Edof ;vec to r <int> pv ind , l o a d i n d ;vec to r <double> pv , p r e l o ad s , F , u ;vec to r <double> ElProp ;

: : r eadProb l emS ize ( nnd , ne l , nn l , npv ) ;: : readBCandElProp( nnl , npv , l o a d i n d ,p r e l o ad s , pv ind , pv , ElProp ) ;: : readNode ( Coords , nnd ) ;: : r e a dE l e ( Edof , n e l ) ;

char c ;menu ( ) ;do

{cout << ”==>” ;c i n >> c ;i f ( c == ’ c ’ )

re−>e x e cu t e a s yn c ( cb , Coords ,Edof , ElProp , nnd ,ne l , nn l , npv , pv ind , pv ,l o a d i n d , p r e l o a d s ) ;

e l s e i f ( c == ’ x ’ )i n t e r r u p t C a l l b a c k (0) ;

e l s e

{cout <<

”unknown command ‘ ”<< c << ” ’ ”<< end l ;

menu ( ) ;}

}

whi le ( c i n . good ( ) ) ;

} catch ( const I c e : : Excep t i on & ex ) {c e r r << ex << end l ;s t a t u s = 1 ;

} catch ( const char ∗ msg) {c e r r << msg << end l ;s t a t u s = 1 ;

}r e tu rn s t a t u s ;

}

vo id

Asyn cC l i e n t : : i n t e r r u p t C a l l b a c k ( i n t ){

t r y

{communicator ( )−>d e s t r o y ( ) ;

}catch ( const I c e U t i l : : Excep t i on& ex ){

c e r r << appName( ) << ”: ” << ex << end l ;}catch ( . . . ){

c e r r << appName( ) << ”: unknown exc ep t i on ” << end l ;}e x i t (EXIT SUCCESS) ;

}

vo id

Asyn cC l i e n t : : menu ( ){

cout <<

”usage :\ n ””c : c a l c u l a t e \n ””x : e x i t \n ”;

}

Listing A.3: Class for handling callback from asynchronous method invokation

#inc lude ”remoteComputat ion . h ”

us ing namespace s t d ;

c l a s s AMI remoteExecu t i on execu t e I :pub l i c remoteComputat ion : : AMI remoteExecu t i on execut e

{v i r t u a l vo id i c e r e s p o n s e (

const : : remoteComputat ion : : doubleMat& ElemForces ,const : : remoteComputat ion : : doub leVec& F ,const : : remoteComputat ion : : doub leVec& u )

{cout << ”F i n i s h e d Ca l c u l a t i o n ” << end l ;

A.3. SERVER CODE 55

/∗DO SOMETHING WITH THE RESULTS∗/}

v i r t u a l vo id i c e e x c e p t i o n (const : : I c e : : Excep t i on& ex )

{t r y {

ex . i c e t h r ow ( ) ;}catch ( const I c e : : L o c a lE x c ep t i on& e ) {

c e r r << ”c a l c u l a t i o n f a i l e d ” << e << end l ;}

}} ;

A.3 Server Code

Listing A.4: Main server application

// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//// Main s e r v e r program f o r d i s t r i b u t e d comput ing a p p l i c a t i o n//// Wr i t t en by : F i l i p Johansson and F r e d r i k Hansson////// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗


#inc lude <WorkQueue . h>#inc lude <Job . h>#inc lude #inc lude <FMatr ix . h>#inc lude < l i s t >

typedef i n t INTEGER ;typedef double REAL ;


c l a s s ServerApp : v i r t u a l pub l i c I c e : : A p p l i c a t i o n {pub l i c :

v i r t u a l i n t run ( int , char ∗ [ ] ) {


workQueue = new WorkQueue( ) ; // I n i t i t a t e the work queue

I c e : : Ob jec tAdap t e rPt r adap t e r

= communicator ( )−>c r ea t eOb jec tAdap t e rWi thEndpo in t s (”S imp l eP r i n t e rAdap t e r ” , ”d e f a u l t −p 10000 ”) ;

I c e : : Ob jec tPt r o b j e c t = new r emo t eExe cu t i on I ( workQueue) ;adapter−>add ( ob j ec t ,

communicator ( )−>s t r i n g T o I d e n t i t y ( ”S imp l eP r i n t e r ”) ) ;

workQueue−>s t a r t ( ) ;// spawn workqueue th read

adapter−>a c t i v a t e ( ) ;// adap t e r s t a r t s l i s t e n i n g to network i n t e r f a c e

communicator ( )−>waitForShutdown ( ) ;

i f ( i n t e r r u p t e d ( ) ) {c e r r << appName( ) <<

” : r e c e i v e d s i g n a l , s h u t t i n g down ” << end l ;}adap t e r = 0 ;workQueue−>ge tTh readCon t ro l ( ) . j o i n ( ) ;

r e tu rn 0 ;}

v i r t u a l vo id i n t e r r u p t C a l l b a c k ( i n t ){cout <<

”Rece i ved i n t e r r u p t s i g n a l , t e rm i n a t i n g ! ” << end l ;workQueue−>d e s t r o y ( ) ;

communicator ( )−>shutdown ( ) ;}

p r i va t e :WorkQueuePtr workQueue ;

} ;

i n t main ( i n t argc , char∗ argv [ ] ){ServerApp app ;r e tu rn app . main ( argc , a rgv ) ;}

Listing A.5: The workqueues header.

// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//// Header f i l e f o r c l a s s h and l i n g the work queue i n the AMD model//// w r i t t e n by : F i l i p Johansson//// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

#i f n d e f WORK QUEUE H#def ine WORK QUEUE H

#inc lude <remoteComputat ion . h>

#inc lude #inc lude

A.3. SERVER CODE 57

#inc lude #inc lude <Job . h>

#inc lude < l i s t >

c l a s s WorkQueue : pub l i c I c e U t i l : : Thread{pub l i c :

WorkQueue( ) ;

v i r t u a l vo id run ( ) ;

vo id add ( const remoteComputat ion : : AMD remoteExecut ion executePtr & cb,const doubleMat& Coords ,const intMat& Edof ,const doub leVec& ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,const i n tVec& pv ind ,const doub leVec& pv ,const i n tVec& l oad i n d ,const doub leVec& p r e l o a d s ) ;

vo id d e s t r o y ( ) ;

p r i va t e :

/∗ s t r u c t Ca l l b a c kEn t r y{

con s t remoteComputat ion : : AMD remoteExecut ion executePt r cb ;doubleMat Coords ;intMat Edof ;doub leVec ElProp ;i n t nnd ;i n t n e l ;i n t nn l ;i n t npv ;i n tVec p v i n d ;doub leVec pv ;i n tVec l o a d i n d ;doub leVec p r e l o a d s ;doubleMat E lemForces ;doub leVec F ;doub leVec u ;

} ; ∗/

I c e U t i l : : Monitor mon i t o r ;s t d : : l i s t <JobPtr> j o b s ;bool done ;

} ;

typedef I c e U t i l : : Handle<WorkQueue> WorkQueuePtr ;

#end i f

Listing A.6: The workqueues implementation.

// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//////// Wr i t t en by : F i l i p Johansson//// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

#inc lude #inc lude <WorkQueue . h>#inc lude <FMatr ix . h>

us ing namespace s t d ;

WorkQueue : : WorkQueue( ) :done ( f a l s e )

{}

vo id

WorkQueue : : run ( ){

I c e U t i l : : Monitor : : Lock l o c k ( mon i t o r ) ;

whi le ( ! done ){

i f ( j o b s . s i z e ( ) == 0){

mon i t o r . wa i t ( ) ;}

i f ( j o b s . s i z e ( ) != 0){

//// Get next work i t em .//JobPtr c u r r e n t Job = j o b s . f r o n t ( ) ;

i f ( ! done ){

//// Execute c a l c u l a t i o n / send c a l l b a c k to c l i e n t//cu r r en t Job−>execu t e ( ) ;

// Job executed , remove from queuej o b s . p op f r on t ( ) ;

}}

}

A.3. SERVER CODE 59

//// Throw exc ep t i on f o r any ou t s t a nd i n g r e q u e s t s .//l i s t <JobPtr > : : c o n s t i t e r a t o r p ;f o r ( p = j o b s . b eg in ( ) ; p != j o b s . end ( ) ; ++p ){

cout << ”Cance led r e qu e s t ” << end l ;(∗p )−>n o t i f y J o bCa n c e l l e d ( ) ;

}}

vo id

WorkQueue : : add ( const remoteComputat ion : : AMD remoteExecut ion executePtr &cb ,

const doubleMat& Coords ,const intMat& Edof ,const doub leVec& ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,const i n tVec& pv ind ,const doub leVec& pv ,const i n tVec& l oad i n d ,const doub leVec& p r e l o a d s )

{I c e U t i l : : Monitor : : Lock l o c k ( mon i t o r ) ;

i f ( ! done ){

//// Add work i t em .//JobPtr newJob = new Job ( cb , Coords , Edof ,ElProp , nnd , ne l , nn l , npv ,pv ind , pv , l o a d i n d , p r e l o a d s ) ;

i f ( j o b s . s i z e ( ) == 0){

mon i t o r . n o t i f y ( ) ;}j o b s . push back ( newJob ) ;

}e l s e

{//// Dest royed , throw exc ep t i on .//cout << ”Dest royed ! ! ! ” << end l ;cb−> i c e e x c e p t i o n ( Reques tCance l edExcep t i on ( ) ) ;

}

}

vo id

WorkQueue : : d e s t r o y ( )

{I c e U t i l : : Monitor : : Lock l o c k ( mon i t o r ) ;

//// Set done f l a g and n o t i f y .//done = t rue ;mon i t o r . n o t i f y ( ) ;

}

Listing A.7: Servant implementation

// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗//// Imp lemen ta t i on o f r emoteExecu t i on s e r v a n t .//// Wr i t t en by : F i l i p Johansson and F r e d r i k Hansson////// ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗


#inc lude <WorkQueue . h>#inc lude <Job . h>#inc lude #inc lude <FMatr ix . h>#inc lude < l i s t >

typedef i n t INTEGER ;typedef double REAL ;


c l a s s r emo t eExe cu t i on I : v i r t u a l pub l i c r emoteExecu t i on {pub l i c :

r emo t eExe cu t i on I ( const WorkQueuePtr&) ;

v i r t u a l vo id e x e cu t e a s yn c (const remoteComputat ion : : AMD remoteExecut ion executePt r & cb ,

const doubleMat& Coords ,const intMat& Edof ,const doub leVec& ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,const i n tVec& pv ind ,const doub leVec& pv ,const i n tVec& l oad i n d ,const doub leVec& p r e l o ad s ,

A.3. SERVER CODE 61

const I c e : : Cu r r en t&) ;

p r i va t e :WorkQueuePtr workQueue ;

} ;

r emo t eExe cu t i on I : : r emo t eExe cu t i on I ( const WorkQueuePtr& workQueue) :workQueue( workQueue)

{}

vo id r emo t eExe cu t i on I : :e x e cu t e a s yn c ( const remoteComputat ion : : AMD remoteExecut ion executePtr &

cb ,const doubleMat& Coords ,const intMat& Edof ,const doub leVec& ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,const i n tVec& pv ind ,const doub leVec& pv ,const i n tVec& l oad i n d ,const doub leVec& p r e l o ad s ,const I c e : : Cu r r en t&)

{

workQueue−>add ( cb , Coords , Edof , ElProp , nnd ,ne l , nn l , npv , pv ind , pv , l o a d i n d , p r e l o a d s ) ;

}

Listing A.8: Header file for the Job class

#i f n d e f Job h#def ine Job h

#inc lude <remoteComputat ion . h>

#inc lude us ing namespace remoteComputat ion ;

c l a s s Job : pub l i c I c e U t i l : : Shared {pub l i c :

Job ( const AMD remoteExecut ion executePtr & cb ,const doubleMat& Coords ,const intMat& Edof ,const doub leVec& ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,const i n tVec& pv ind ,const doub leVec& pv ,

const i n tVec& l oad i n d ,const doub leVec& p r e l o a d s ) ;

vo id execu t e ( ) ;vo id n o t i f y J o bCa n c e l l e d ( ) {

cb−> i c e e x c e p t i o n ( Reques tCance l edExcep t i on ( ) ) ;}

p r i va t e :bool s o l v e s y s t em ( ) ;

AMD remoteExecut ion executePt r cb ;doubleMat Coords ;intMat Edof ;doub leVec ElProp ;i n t nnd , ne l , nn l , npv ;i n tVec pv ind , l o a d i n d ;doub leVec pv , p r e l o a d s ;doubleMat E lemForces ;doub leVec F , u ;

} ;typedef I c e U t i l : : Handle<Job> JobPtr ;#end i f

Listing A.9: Implementation of the Job class

Job : : Job ( const AMD remoteExecut ion executePtr & cb ,const doubleMat& Coords ,const intMat& Edof ,const doub leVec& ElProp ,i n t nnd ,i n t ne l ,i n t nnl ,i n t npv ,const i n tVec& pv ind ,const doub leVec& pv ,const i n tVec& l oad i n d ,const doub leVec& p r e l o a d s ) : cb ( cb ) ,Coords ( Coords ) , Edof ( Edof ) , E lProp ( ElProp ) , nnd ( nnd ) ,n e l ( n e l ) , n n l ( nn l ) , npv ( npv ) , p v i n d ( p v i n d ) ,pv ( pv ) , l o a d i n d ( l o a d i n d ) , p r e l o a d s ( p r e l o a d s )

{}vo id Job : : e xecu t e ( ){

i f ( ! s o l v e s y s t em ( ) ) {cb−> i c e e x c e p t i o n ( RangeError ( ) ) ;

r e tu rn ;}cb−>i c e r e s p o n s e ( ElemForces , F , u ) ;

}

extern ”C” d e c l s p e c ( d l l i m p o r t ) shor t

c d e c l e x e c u t e f o r t r a n (REAL∗ Coords , INTEGER∗ Edof , REAL∗ ElProp , REAL∗ F ,

A.3. SERVER CODE 63

REAL∗ u , INTEGER∗ Pre , INTEGER∗ nnd , INTEGER∗ ne l ,INTEGER∗ npv , REAL∗ ElemForces ) ;

bool Job : : s o l v e s y s t em ( ){

FMATRIX<REAL> Coo r d s f o r t r a n ( Coords , nnd , 2 ) ;FMATRIX<INTEGER> Ed o f f o r t r a n ( Edof , n e l , 6 ) ;FMATRIX<REAL> E lP r o p f o r t r a n ( ElProp , 3 , 1 ) ;FMATRIX<REAL> E l emFo r c e s f o r t r a n ( ne l , 3 ) ;

FMATRIX<REAL> F f o r t r a n ( nnd ∗2 ,1) ;F f o r t r a n . i n i t i a t e v a l u e s ( l o a d i n d , p r e l o a d s ) ;FMATRIX<REAL> u f o r t r a n ( nnd ∗2 ,1) ;u f o r t r a n . i n i t i a t e v a l u e s ( pv ind , pv ) ;FMATRIX<INTEGER> P r e f o r t r a n ( nnd ∗2 ,1) ;P r e f o r t r a n . i n i t i a t e v a l u e s ( p v i n d ) ;

e x e c u t e f o r t r a n ( Coo rd s f o r t r a n , Edo f f o r t r a n , E lP r op f o r t r a n ,F f o r t r a n , u f o r t r a n , P r e f o r t r a n , & nnd ,& ne l , & npv , E l emFo r c e s f o r t r a n ) ;

F f o r t r a n . toSTL (F) ;u f o r t r a n . toSTL (u ) ;r e tu rn true ;

}

Appendix B

Parallel application code

In this appendix the code for the parallel application is shown. The application is divided into fivemodules.

B.1 Main program

Listing B.1: main.f90

program main

use p u b l i cV a r suse inout

use s t r e s suse femuse s o l v e

i m p l i c i t none

in teger , parameter : : i n f i l e = 15 , o u t f i l e = 16

i n t ege r : : nnd , ne l , nn l , npv , n lr e a l (8) , dimens ion ( : , : ) , a l l o c a t a b l e : : Coords , E lemForcesr e a l (8) , dimens ion ( : ) , a l l o c a t a b l e : : F , u , ElProp

! Pre i s the v e c t o r d e s c r i b i n g which node! d i s p l a c emen t s tha t a r e p r e s c r i b e d

i n teger , dimens ion ( : , : ) , a l l o c a t a b l e : : Edofi n teger , dimens ion ( : ) , a l l o c a t a b l e : : Precharac te r (60) g e ome t r y f i l e , b c e l p r o p f i l e , o u t p u t f i l ei n t ege r : : Kt e s t (3 ,3 ) , u t e s t (3 ) , f t e s t (3 ) , K sub (11) , i a s u b (6) ,

j a s u b (11) , Pre sub (5) , nnz subi n t ege r : : i , t e s t v e c (2) , updof (3)

g e om e t r y f i l e = ”geometry . t x t ”b c e l p r o p f i l e = ”b c e l p r op . t x t ”o u t p u t f i l e = ”u tda ta . dat ”

c a l l r e a dp r ob l ems i z e ( nnd , ne l , nn l , npv )

a l l o c a t e ( Coords ( nnd , 2 ) )

65

66 APPENDIX B. PARALLEL APPLICATION CODE

a l l o c a t e ( Edof ( ne l , 6 ) )a l l o c a t e ( ElProp (3) )a l l o c a t e (F( nnd∗2) )a l l o c a t e ( u ( nnd∗2) )a l l o c a t e ( Pre ( nnd∗2) )a l l o c a t e ( E lemForces ( ne l , 3 ) )

c a l l r e a db cand e lp r op (F , u , Pre , ElProp , nn l , npv , nnd , b c e l p r o p f i l e )

c a l l readgeomet ry ( Coords , Edof , nnd , n e l )

c a l l execu t e ( Coords , Edof , ElProp , F , u , Pre , nnd ,&ne l , npv , E lemForces )

dea l l o c a t e ( Coords )dea l l o c a t e ( Edof )dea l l o c a t e ( ElProp )dea l l o c a t e (F)dea l l o c a t e ( u )dea l l o c a t e ( Pre )dea l l o c a t e ( E lemForces )

end program main

B.2 File reading/writing

Listing B.2: inout.f90

module inout

use p u b l i cV a r sconta in s

! Sub rou t in e tha t p a r s e s the i n pu t f i l e and p roduce s the system ma t r i c e s .

sub rout ine r e a dp r ob l ems i z e ( nnd , ne l , nn l , npv )

i n t ege r : : nnd , ne l , nn l , npvi n teger , parameter : : i n f i l e = 15

open ( un i t=i n f i l e , f i l e = ’ meshing/ t r i a n g l e . 1 . node ’ , access=’s equent ia l ’ ,&ac t ion=’ read ’ , s tatus =’old ’ )

read ( i n f i l e , ∗ ) nnd

c l o s e ( i n f i l e )

open ( un i t=i n f i l e , f i l e = ’ meshing/ t r i a n g l e . 1 . e l e ’ , access=’ s equent ia l

’ ,&ac t ion=’ read ’ , s tatus =’old ’ )

read ( i n f i l e , ∗ ) n e l

B.2. FILE READING/WRITING 67


open ( un i t=i n f i l e , f i l e = ’ b c e l p r op . txt ’ , access=’ s equent ia l ’ ,&ac t ion=’ read ’ , s tatus =’old ’ )

read ( i n f i l e , ∗ ) nnl , npv


end subrout ine r e a dp r ob l ems i z e

sub rout ine readgeomet ry ( Coords , Edof , nnd , n e l )

i n t ege r : : i , j , dummy , nodeNbrs (3) , Edof ( ne l , 6)i n teger , parameter : : i n f i l e = 15r e a l (8) : : Coords ( nnd , 2)open ( un i t=i n f i l e , f i l e = ’ meshing/ t r i a n g l e . 1 . node ’ , access=’

s equent ia l ’ ,&ac t ion=’ read ’ , s tatus =’old ’ )

read ( i n f i l e , ∗ )wr i te (∗ ,∗ ) ’ nnd : ’wr i te (∗ ,∗ ) nndwr i te (∗ ,∗ ) ’ n e l : ’wr i te (∗ ,∗ ) n e ldo j =1,nnd

read ( i n f i l e , ∗ ) dummy , ( Coords ( j , i ) , i =1 ,2)end do


open ( un i t=i n f i l e , f i l e = ’ meshing/ t r i a n g l e . 1 . e l e ’ , access=’ s equent ia l

’ ,&ac t ion=’ read ’ , s tatus =’old ’ )

read ( i n f i l e , ∗ )

do j =1, n e l! w r i t e (∗ ,∗ ) jread ( i n f i l e , ∗ ) dummy , ( nodeNbrs ( i ) , i =1 ,3)do i =1,3

Edof ( j , 2∗ i −1) = 2∗ nodeNbrs ( i )−1Edof ( j , 2∗ i ) = 2∗ nodeNbrs ( i )

end do

end do


end subrout ine readgeomet ry

sub rout ine r e a db cand e lp rop (F , u , Pre , ElProp , nn l , npv , nnd , i n p u t f i l e )


in teger , parameter : : i n f i l e = 15i n t ege r : : nn l , npv , nnd , i , j , node , dof , Pre ( nnd∗2)r e a l (8) : : F( nnd∗2) , u ( nnd∗2) , ElProp (3)charac te r ∗ (∗ ) i n p u t f i l e


open ( un i t=i n f i l e , f i l e = i n p u t f i l e , access=’ s equent ia l ’ ,&ac t ion=’ read ’ , s tatus =’old ’ )

read ( i n f i l e , ∗) nnl , npv

F=0i f ( nn l > 0) then

do j =1, nn lread ( i n f i l e , ∗ ) node , dof , F(2∗ node+dof −2)

end do

end i f

Pre = 0 ;u = 0 ;do j =1,npv

read ( i n f i l e , ∗ ) node , dof , u (2∗ node+dof −2)Pre (2∗ node+dof −2) = 1

end do

read ( i n f i l e , ∗ ) ( ElProp ( i ) , i =1 ,3)

end subrout ine r e a db cand e l p rop

sub rout ine w r i t e r e s u l t (F , u , Pre , nnd , ne l , o u t p u t f i l e , E lemForces )

i n teger , parameter : : o u t f i l e = 16r e a l (8) , dimens ion ( nnd∗2) : : F , ui n t ege r : : i , nnd , ne l , Pre ( nnd∗2)charac te r ∗ (∗ ) o u t p u t f i l er e a l (8) , dimens ion ( ne l , 3 ) : : E lemForces

open ( un i t=o u t f i l e , f i l e=ou t p u t f i l e , access=’ s equent ia l ’ , &ac t ion=’write ’ , s tatus =’unknown ’ )

wr i te ( o u t f i l e , ∗ ) ’ Fo rce s : ’wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’wr i te ( o u t f i l e , ’ ( T2 ,A, T15 ,A, T32 ,A) ’ ) ’Node ’ , ’ Fx [N] ’ , ’ Fy [N] ’wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’do i =1, s i z e (F , 1 ) /2

! i f ( Pre ( i ) == 1) thenwr i te ( o u t f i l e , ’ ( T2 , I0 , T10 , 2 F15 . 8 ) ’ ) i , F(2∗ i −1) ,F(2∗ i )

! end i fend do

wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’wr i te ( o u t f i l e , ∗ ) ’ ’wr i te ( o u t f i l e , ∗ ) ’ ’wr i te ( o u t f i l e , ∗ ) ’ D i sp l a cemen t s : ’wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’wr i te ( o u t f i l e , ’ ( T2 ,A, T10 ,A, T32 , A) ’ ) ’Node ’ , ’ ux [m] ’ , ’ uy [m] ’wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’do i =1, s i z e (u , 1 ) /2

! i f ( Pre ( i ) == 0) thenwr i te ( o u t f i l e , ’ ( T2 , I0 , T10 , F10 . 6 , T32 , F10 . 6 ) ’ ) i , u (2∗ i −1) , u

(2∗ i )! end i f

end do

B.3. FEM CALCULATIONS 69

wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’

wr i te ( o u t f i l e , ∗ ) ’ ’wr i te ( o u t f i l e , ∗ ) ’ ’wr i te ( o u t f i l e , ∗ ) ’ S t r e s s e s : ’wr i te ( o u t f i l e , ∗ ) ’

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’

wr i te ( o u t f i l e , ’ ( T2 ,A, T17 ,A, T37 ,A, T57 , A) ’ ) ’ Element ’ , &’ s igmax [N/m2] ’ , ’ s igmay [N/m2] ’ , ’ s igmaxy [N/m2] ’

wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’

do i =1, s i z e ( ElemForces , 1 )wr i te ( o u t f i l e , ’ ( T2 , I0 , T17 , F17 . 7 , T37 , F17 . 7 , T57 , F17 . 7 ) ’ ) i ,

E lemForces ( i , 1 ) , &ElemForces ( i , 2 ) , E lemForces ( i , 3 )

end do

wr i te ( o u t f i l e , ∗ ) ’−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−’

c l o s e ( o u t f i l e )end subrout ine w r i t e r e s u l t

end module inout

B.3 FEM calculations

Listing B.3: fem.f90

module fem! Module c on t a i n g f e r e l a t e d methods l i k e a s s emb l i n g and! e lement methods! a l s o c o n t a i n s a f u n c t i o n r e t u r n i n g a sy s t ems bandwidthuse p u b l i cV a r sconta in s

subrout ine g e t n b r d e l c o l v e c ( n b r d e l c o l v e c , Pre , updof )

i n t ege r : : i , n b r d e l c o l =0, u pdo f i d x=0i n teger , dimens ion ( : ) : : n b r d e l c o l v e c , Pre , updof

n b r d e l c o l v e c = 0

do i =1, s i z e ( Pre )i f ( Pre ( i ) . ne . 0) then

n b r d e l c o l = n b r d e l c o l+1e l s e

updo f i d x = updo f i d x+1updof ( u pdo f i d x ) = i

end i f

n b r d e l c o l v e c ( i ) = n b r d e l c o l


end do

end subrout ine g e t n b r d e l c o l v e c

sub rout ine assemCOO( edof , Pre , u , F , K sparse , Ke , nnz , ik , jk ,n b r d e l c o l v e c )

i n t ege r : : i , j , nnz , n , row , i g l o b , j g l o bi n teger , dimens ion ( : ) : : ik , jk , edof , Pre , n b r d e l c o l v e cr e a l (8) , dimens ion ( : ) : : K sparse , u , Fr e a l (8) , dimens ion ( : , : ) : : Ke

n = s i z e (Ke , 1 )row = 0

do i =1, ni g l o b = edo f ( i )i f ( Pre ( i g l o b ) . eq . 0) then

row=row+1do j =1, n

j g l o b = edo f ( j )i f (Ke( i , j ) . ne . 0) then

i f ( Pre ( j g l o b ) . ne . 0) then

F( j g l o b ) = F( j g l o b )−Ke( i , j ) ∗u ( j g l o b )e l s e i f ( j g l o b . ge . i g l o b ) then

nnz=nnz+1K spa r se ( nnz ) = Ke( i , j )i k ( nnz ) = i g l o b −n b r d e l c o l v e c ( i g l o b )j k ( nnz ) = j g l ob −n b r d e l c o l v e c ( j g l o b )

end i f

end i f

end do

end i f

end do

end subrout ine assemCOO

sub rout ine COOtoCSR( K sparse , ik , jk , nnz , neq )

! i n t e r n a l work a r r a y s f o r s p a r s k i t 2 methodsi n teger , a l l o c a t a b l e : : indu ( : )i n teger , a l l o c a t a b l e : : iwk ( : )

r e a l (8) , dimens ion ( : ) : : K spa r sei n teger , dimens ion ( : ) : : ik , j ki n t ege r : : nnz , neq

a l l o c a t e ( indu ( neq ) )a l l o c a t e ( iwk ( neq+1))

! w r i t e (∗ ,∗ ) ’ innan c o i s c r ’!−−−−− Convert to compressed s p a r s e row format −−−−−c a l l c o i c s r ( neq , nnz , 1 , K sparse , jk , ik , iwk )

! w r i t e (∗ ,∗ ) ’ innan c l n c s r ’!−−−−− Add d u p l i c a t e e l emen t s and s o r t v e c t o r s −−−−−c a l l c l n c s r (3 ,1 , neq , K sparse , jk , ik , indu , iwk )


nnz=i k ( neq+1)−1

dea l l o c a t e ( indu , iwk ) ! d e a l l o c a t e i n t e r n a l work a r r a y s

end subrout ine COOtoCSR

sub rout ine c o i c s r (n , nnz , job , a , ja , ia , iwk )i n t ege r i a ( nnz ) , j a ( nnz ) , iwk ( n+1)r e a l ∗8 a (∗ )

!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−! IN−PLACE coo−c s r c o n v e r s i o n r o u t i n e .!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−! t h i s s u b r o u t i n e c o n v e r t s a mat r i x s t o r e d i n c o o r d i n a t e format i n t o! the c s r format . The c o n v e r s i o n i s done i n p l a c e i n tha t the a r r a y s! a , ja , i a o f the r e s u l t a r e o v e rw r i t t e n onto the o r i g i n a l a r r a y s .!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−! on en t r y :!−−−−−−−−−! n = i n t e g e r . row d imens ion o f A .! nnz = i n t e g e r . number o f nonzero e l emen t s i n A .! j ob = i n t e g e r . Job i n d i c a t o r . when job =1, the r e a l v a l u e s i n a a re! f i l l e d . Othe rw i se a i s not touched and the s t r u c t u r e o f the! a r r a y on l y ( i . e . ja , i a ) i s ob ta in ed .! a = r e a l a r r a y o f s i z e nnz ( number o f nonzero e l emen t s i n A)! c o n t a i n i n g the nonzero e l emen t s! j a = i n t e g e r a r r a y o f l e n g t h nnz c o n t a i n i n g the column p o s i t i o n s! o f the co r r e s p ond i n g e l emen t s i n a .! i a = i n t e g e r a r r a y o f l e n g t h nnz c o n t a i n i n g the row p o s i t i o n s! o f the co r r e s p ond i n g e l emen t s i n a .! iwk = i n t e g e r work a r r a y o f l e n g t h n+1! on r e t u r n :!−−−−−−−−−−! a! j a! i a = c on t a i n s the compressed s p a r s e row data s t r u c t u r e f o r the! r e s u l t i n g mat r i x .! Note :!−−−−−−−! the e n t r i e s o f the output mat r i x a r e not s o r t e d ( the column! i n d i c e s i n each a re not i n i n c r e a s i n g o r d e r ) use c ooc s r! i f you want them s o r t e d .!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−c! Coded by Y. Saad , Sep . 26 1989 c!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−c

r e a l ∗8 t , t n e x tl o g i c a l v a l u e s

!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−v a l u e s = ( job . eq . 1)

! f i n d p o i n t e r a r r a y f o r r e s u l t i n g mat r i x .do 35 i =1,n+1

iwk ( i ) = 035 cont inue

do 4 k=1,nnzi = i a ( k )iwk ( i +1) = iwk ( i +1)+1


4 cont inue

!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−iwk (1) = 1do 44 i =2,n

iwk ( i ) = iwk ( i −1) + iwk ( i )44 cont inue

!! l oop f o r a c y c l e i n ch a s i n g p r o c e s s .!

i n i t = 1k = 0

5 i f ( v a l u e s ) t = a ( i n i t )i = i a ( i n i t )j = j a ( i n i t )i a ( i n i t ) = −1

!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−6 k = k+1

! c u r r e n t row number i s i . d e t e rm ine where to go .i p o s = iwk ( i )

! save the chased e lement .i f ( v a l u e s ) t n e x t = a ( i p o s )i n e x t = i a ( i p o s )j n e x t = j a ( i p o s )

! then occupy i t s l o c a t i o n .i f ( v a l u e s ) a ( i p o s ) = tj a ( i p o s ) = j

! update p o i n t e r i n f o rma t i o n f o r next e lement to come i n row i .iwk ( i ) = i p o s+1

! d e t e rm ine next e lement to be chased ,i f ( i a ( i p o s ) . l t . 0) goto 65t = tn e x ti = i n e x tj = j n e x ti a ( i p o s ) = −1i f ( k . l t . nnz ) goto 6goto 70

65 i n i t = i n i t +1i f ( i n i t . gt . nnz ) goto 70i f ( i a ( i n i t ) . l t . 0) goto 65

! r e s t a r t c h a s i n g −−goto 5

70 do 80 i =1,ni a ( i +1) = iwk ( i )

80 cont inue

i a ( 1 ) = 1r e tu rn

!−−−−−−−−−−−−−−−−− end o f c o i c s r −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−end subrout ine c o i c s r

sub rout ine c l n c s r ( job , va lue2 , nrow , a , ja , ia , indu , iwk )! . . S c a l a r Arguments . .

i n t ege r job , nrow , va l u e 2! . .! . . Ar ray Arguments . .

i n t ege r i a ( nrow+1) , indu ( nrow ) , iwk ( nrow+1) , j a (∗ )r e a l ∗8 a (∗ )

! . .


!! Th is r o u t i n e pe r fo rms two t a s k s to c l e a n up a CSR mat r i x! −− remove d u p l i c a t e / ze ro e n t r i e s ,! −− per form a p a r t i a l o r d e r i n g , new o rd e r l owe r t r i a n g u l a r part ,! main d iagona l , upper t r i a n g u l a r p a r t .!! On en t r y :!! j ob = op t i on s! 0 −− noth ing i s done! 1 −− e l i m i n a t e d u p l i c a t e e n t r i e s , z e ro e n t r i e s .! 2 −− e l i m i n a t e d u p l i c a t e e n t r i e s and per form p a r t i a l o r d e r i n g .! 3 −− e l i m i n a t e d u p l i c a t e e n t r i e s , s o r t the e n t r i e s i n the! i n c r e a s i n g o r d e r o f clumn i n d i c e s .!! v a l u e 2 −− 0 the mat r i x i s p a t t e r n on l y ( a i s not touched )! 1 mat r i x has v a l u e s too .! nrow −− row d imens ion o f the mat r i x! a , ja , i a −− i n p u t mat r i x i n CSR format!! On r e t u r n :! a , ja , i a −− c l e a n ed mat r i x! indu −− p o i n t e r s to the b eg i n n i n g o f the upper t r i a n g u l a r! p o r t i o n i f j ob > 1!! Work space :! iwk −− i n t e g e r work space o f s i z e nrow+1!! . . Loca l S c a l a r s . .

i n t ege r i , j , k , ko , i pos , k f i r s t , k l a s tr e a l ∗8 tmp

! . .!

i f ( job . l e . 0 ) r e tu rn

!! . . e l i m i n a t e d u p l i c a t e e n t r i e s −−! a r r a y INDU i s used as marker f o r e x i s t i n g i n d i c e s , i t i s a l s o the! l o c a t i o n o f the en t r y .! IWK i s used to s t o r e d the o ld IA a r r a y .! mat r i x i s cop i ed to squeeze out the space taken by the d u p l i c a t e d! e n t r i e s .!

do 90 i = 1 , nrowindu ( i ) = 0iwk ( i ) = i a ( i )

90 cont inue

iwk ( nrow+1) = i a ( nrow+1)k = 1do 120 i = 1 , nrow

i a ( i ) = ki p o s = iwk ( i )k l a s t = iwk ( i +1)

100 i f ( i p o s . l t . k l a s t ) then

j = j a ( i p o s )i f ( indu ( j ) . eq . 0 ) then

! . . new en t r y . .i f ( v a l u e 2 . ne . 0 ) then

i f ( a ( i p o s ) . ne . 0 .0D0) then


indu ( j ) = kj a ( k ) = j a ( i p o s )a ( k ) = a ( i p o s )k = k + 1

end i f

e l s e

indu ( j ) = kj a ( k ) = j a ( i p o s )k = k + 1

end i f

e l s e i f ( v a l u e 2 . ne . 0 ) then

! . . d u p l i c a t e en t r y . .a ( indu ( j ) ) = a ( indu ( j ) ) + a ( i p o s )

end i f

i p o s = i p o s + 1go to 100

end i f

! . . remove marks b e f o r e work ing on the next row . .do 110 i p o s = i a ( i ) , k − 1

indu ( j a ( i p o s ) ) = 0110 cont inue

120 cont inue

i a ( nrow+1) = ki f ( job . l e . 1 ) r e tu rn

!! . . p a r t i a l o r d e r i n g . .! s p l i t the mat r i x i n t o s t r i c t upper/ lowe r t r i a n g u l a r! pa r t s , INDU po i n t s to the the b eg i n n i n g o f the upper p a r t .!

do 140 i = 1 , nrowk l a s t = i a ( i +1) − 1k f i r s t = i a ( i )

130 i f ( k l a s t . gt . k f i r s t ) then

i f ( j a ( k l a s t ) . l t . i . and . j a ( k f i r s t ) . ge . i ) then

! . . swap k l a s t wi th k f i r s t . .j = j a ( k l a s t )j a ( k l a s t ) = j a ( k f i r s t )j a ( k f i r s t ) = ji f ( v a l u e 2 . ne . 0 ) then

tmp = a ( k l a s t )a ( k l a s t ) = a ( k f i r s t )a ( k f i r s t ) = tmp

end i f

end i f

i f ( j a ( k l a s t ) . ge . i ) then

k l a s t = k l a s t − 1end i f

i f ( j a ( k f i r s t ) . l t . i ) then

k f i r s t = k f i r s t + 1end i f

go to 130end i f

!i f ( j a ( k l a s t ) . l t . i ) then

indu ( i ) = k l a s t + 1e l s e

indu ( i ) = k l a s tend i f


140 cont inue

i f ( job . l e . 2 ) r e tu rn

!! . . o r d e r the e n t r i e s a c c o r d i n g to column i n d i c e s! bu rb l e−s o r t i s used!

do 190 i = 1 , nrowdo 160 i p o s = i a ( i ) , indu ( i )−1

do 150 j = indu ( i )−1, i p o s +1, −1k = j − 1i f ( j a ( k ) . gt . j a ( j ) ) then

ko = j a ( k )j a ( k ) = j a ( j )j a ( j ) = koi f ( v a l u e 2 . ne . 0 ) then

tmp = a ( k )a ( k ) = a ( j )a ( j ) = tmp

end i f

end i f

150 cont inue

160 cont inue

do 180 i p o s = indu ( i ) , i a ( i +1)−1do 170 j = i a ( i +1)−1, i p o s +1, −1

k = j − 1i f ( j a ( k ) . gt . j a ( j ) ) then

ko = j a ( k )j a ( k ) = j a ( j )j a ( j ) = koi f ( v a l u e 2 . ne . 0 ) then

tmp = a ( k )a ( k ) = a ( j )a ( j ) = tmp

end i f

end i f

170 cont inue

180 cont inue

190 cont inue

retu rn

!−−−− end o f c l n c s r −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−!−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−end subrout ine c l n c s r

sub rout ine submatrixCSR ( neq , K, ia , ja , nnz sub , Pre , updof )! Per forms an in−p l a c e e x t r a c t i o n o f the submat r i x to be used f o r the

p a r d i s o! s o l v e r , a l s o c a l c u l a t e s the v e c t o r updof c o n t a i n i n g the

u n p r e s c r i b e d! d eg r e e s o f freedom

i n teger , dimens ion ( : ) : : Prer e a l (8) , dimens ion ( : ) : : Ki n teger , dimens ion ( : ) : : i a , ja , updofi n teger , dimens ion ( : ) , a l l o c a t a b l e : : n b r d e l c o l v e ci n t ege r : : i , neq , nnz , nnz sub , row = 0 , tmp rowstar t , n b r d e l c o l ,

c o l

nnz sub = 0 ! i n i t i a l i z e number o f non−z e r o s i n submat r i x


a l l o c a t e ( n b r d e l c o l v e c ( neq ) )n b r d e l c o l v e c = 0n b r d e l c o l = 0

do i =1, neqi f ( Pre ( i ) . ne . 0) then

n b r d e l c o l = n b r d e l c o l + 1end i f


do i =1, neq ! l oop ove r a l l rows o f Ki f ( Pre ( i ) . eq . 0 ) then ! t h i s dof i s u n p r e s c r i b e d

row = row+1 ! the submat r i x has a new rowupdof ( row ) = itmp rowsta r t = nnz sub+1 ! i n d e x where row i n submat r i x s t a r t sdo j =1, i a ( i +1)− i a ( i ) ! l oop ove r columns i n t h i s row

c o l = j a ( i a ( i ) + j − 1 )i f ( Pre ( c o l ) . eq . 0 ) then ! found a nonzero to be

addednnz sub = nnz sub +1K( nnz sub ) = K( i a ( i ) + j−1 ) ! o v e rw r i t e i n to Kj a ( nnz sub ) = c o l − n b r d e l c o l v e c ( c o l ) ! o v e rw r i t e

the column i n t o the o ld v e c t o rend i f

end do

i a ( row ) = tmp rowsta r tend i f

end do

i a ( row+1) = nnz sub+1 ! Add s p e c i a l l a s t e n t r y

dea l l o c a t e ( n b r d e l c o l v e c )

end subrout ine submatrixCSR

sub rout ine loadBANDmul(bw , neq , K, u , f )

i n t ege r : : bw , neq , j , ir e a l (8) , dimens ion ( : , : ) : : Kr e a l (8) , dimens ion ( : ) : : u , f

do j =1,neq ! l oop through the d i s p l a c emen t v e c t o ri f ( u ( j ) . ne . 0) then ! i f v a r i a b l e p r e s c r i b e d ( not to ze ro )

per form m u l t i p l i c a t i o ndo i =1, neq

i f ( i . gt . j . and . i−j . l e . bw) then ! be low d iagona l ,make symmetr ic s h i f tf ( i ) = f ( i ) − K( j , i− j +1)∗u ( j )

e l s e i f ( i . l e . j . and . j− i . l e . bw) then ! aboved i a gon a lf ( i ) = f ( i ) −K( i , j− i +1)∗u ( j )

end i f

end do

end i f

end do


end subrout ine loadBANDmul

sub rout ine e l emen tMat r i x (Ke , Edof , Coords , ElProp , elnum )

! f2py i n t e n t ( in , out ) Ke

i n t ege r : : elnum

r e a l (8) : : E , v , t , detC , x1 , x2 , x3 , y1 , y2 , y3

r e a l (8) , dimens ion (6 ,6 ) : : C , invC , Ke

r e a l (8) , dimens ion (3 ,3 ) : : D

r e a l (8) , dimens ion (3 ,6 ) : : Q

r e a l (8) , dimens ion ( : , : ) : : Coords

r e a l (8) : : ElProp ( : )

i n teger , dimens ion ( : , : ) : : EdofE = ElProp (1)v = ElProp (2)t = ElProp (3)

x1 = Coords ( Edof ( elnum , 2 ) /2 ,1)x2 = Coords ( Edof ( elnum , 4 ) /2 ,1)x3 = Coords ( Edof ( elnum , 6 ) /2 ,1)y1 = Coords ( Edof ( elnum , 2 ) /2 ,2)y2 = Coords ( Edof ( elnum , 4 ) /2 ,2)y3 = Coords ( Edof ( elnum , 6 ) /2 ,2)

C ( 1 , : ) = (/ 1 .0 ap , x1 , y1 , 0 .0 ap , 0 .0 ap , 0 .0 ap /)C ( 2 , : ) = (/ 0 .0 ap , 0 .0 ap , 0 .0 ap , 1 .0 ap , x1 , y1 /)C ( 3 , : ) = (/ 1 .0 ap , x2 , y2 , 0 .0 ap , 0 .0 ap , 0 .0 ap /)C ( 4 , : ) = (/ 0 .0 ap , 0 .0 ap , 0 .0 ap , 1 .0 ap , x2 , y2 /)C ( 5 , : ) = (/ 1 .0 ap , x3 , y3 , 0 .0 ap , 0 .0 ap , 0 .0 ap /)C ( 6 , : ) = (/ 0 .0 ap , 0 .0 ap , 0 .0 ap , 1 .0 ap , x3 , y3 /)

D( 1 , : ) = E/(1−v∗v ) ∗(/ 1 .0 ap , v , 0 .0 ap /)D( 2 , : ) = E/(1−v∗v ) ∗(/ v , 1 .0 ap , 0 .0 ap /)D( 3 , : ) = E/(1−v∗v ) ∗(/ 0 .0 ap , 0 .0 ap , (1−v ) /2 /)

Q( 1 , : ) = (/ 0 .0 ap , 1 .0 ap , 0 .0 ap , 0 .0 ap , 0 .0 ap , 0 .0 ap /)Q( 2 , : ) = (/ 0 .0 ap , 0 .0 ap , 0 .0 ap , 0 .0 ap , 0 .0 ap , 1 .0 ap /)Q( 3 , : ) = (/ 0 .0 ap , 0 .0 ap , 1 .0 ap , 0 .0 ap , 1 .0 ap , 0 .0 ap /)

detC = x1 ∗( y2−y3 ) + x2 ∗( y3−y1 ) + x3 ∗( y1−y2 )

c a l l matinv (6 , C , invC )

Ke = matmul (matmul ( matmul( matmul ( t r a n sp o s e ( invC ) ,&t r a n sp o s e (Q) ) ,D) ,Q) , invC ) ∗0.5∗ detC∗ t

r e tu rn

end subrout ine e l emen tMat r i x


i n t ege r func t ion bandWidth ( Edof )

i n teger , dimens ion ( : , : ) : : Edof

i n t ege r : : elemWidth ( s i z e ( Edof , 1 ) )

do i =1, s i z e ( Edof , 1 )elemWidth ( i ) = maxval ( Edof ( i , : ) ) − minva l ( Edof ( i , : ) ) + 1

end do

bandWidth = maxval ( elemWidth )

end func t ion bandWidth

sub rout ine countnnz (K, bw , nnd , nnz )i n t ege r : : bw , nnz , nndr e a l (8) , dimens ion ( : , : ) : : K

nnz = 0do i =1, nnd∗2

do j =1,bwi f (K( i , j ) . ne . 0) then

nnz = nnz + 1end i f

end do

end do

end subrout ine countnnz

sub rout ine BANDtoCSR(K, Kvec , ia , ja , bw , nnd )

i n teger , dimens ion ( : ) : : i a , j ar e a l (8) , dimens ion ( : , : ) : : Kr e a l (8) , dimens ion ( : ) : : Kveci n t ege r : : c o l i d x , bw , nnz=0

i a = 0

do i =1, nnd∗2c o l i d x = ido j =1, bw

i f (K( i , j ) . ne . 0) then

nnz = nnz +1 ! i n c r emen t non−ze ro coun t e rKvec ( nnz ) = K( i , j ) ! i n s e r t the non−ze ro e lementj a ( nnz ) = c o l i d x ! the i n s e r t e d non−z e r o s co lumn indexi f ( i a ( i ) . eq . 0) then

i a ( i ) = nnzend i f

end i f

c o l i d x = c o l i d x +1end do

end do

i a ( nnd∗2+1) = nnz+1 ! Add s p e c i a l l a s t e n t r y

end subrout ine BANDtoCSR


sub rout ine assem ( Edof , Ke , elnum , K)

! f2py i n t e n t ( in , out ) K

i n teger , dimens ion ( : , : ) : : Edofr e a l (8) , dimens ion ( : , : ) : : Ke , Ki n t ege r : : elnum , ni n teger , dimens ion ( s i z e ( Edof , 2 ) ) : : e l d o f

e l d o f = Edof ( elnum , : )

n = s i z e (Ke , 1 )do j =1,n

do i=j , ni f ( e l d o f ( i ) >= e l d o f ( j ) ) then

i f ( abs (Ke( j , i ) ) . gt . 1e−12 ) then

K( e l d o f ( j ) , e l d o f ( i )−e l d o f ( j )+1) = &K( e l d o f ( j ) , e l d o f ( i )−e l d o f ( j )+1) + Ke( j , i )end i f

e l s e

i f ( abs (Ke( i , j ) ) . gt . 1e−12 ) then

K( e l d o f ( i ) , e l d o f ( j )−e l d o f ( i )+1) = &K( e l d o f ( i ) , e l d o f ( j )−e l d o f ( i )+1) + Ke( j , i )end i f

end i f

end do

end do

end subrout ine assem

sub rout ine matinv (n , a , a i )

!! Th is s u b r o u t i n e i n v e r t s a mat r i x ”a ” and r e t u r n s the i n v e r s e i n ”a i

”! n − I npu t by user , an i n t e g e r s p e c i f y i n g the s i z e o f the mat r i x to

be i n v e r t e d .! a − I npu t by user , an n by n r e a l a r r a y c o n t a i n i n g the mat r i x to

be i n v e r t e d .! a i − Returned by sub r ou t i n e , an n by n r e a l a r r a y c o n t a i n i n g the

i n v e r t e d mat r i x .! d − Work a r ray , an n by 2n r e a l a r r a y used by the s u b r o u t i n e .! i o − Work a r ray , a 1−d imen s i on a l i n t e g e r a r r a y o f l e n g t h n used by

the s u b r o u t i n e .!! From ht tp ://www. mae . usu . edu/ f a c u l t y / w p h i l l i p s /MAE6510 . html! Mod i f i ed by Jonas Lindemann!

i m p l i c i t r e a l (A−I ,M−Z)

i n t ege r : : n , i , j , mr e a l (8) : : a (n , n )r e a l (8) : : a i ( n , n )

i n teger , a l l o c a t a b l e : : i o ( : )


r e a l (8) , a l l o c a t a b l e : : d ( : , : )

! ! $omp p a r a l l e l p r i v a t e ( i , j ,m, io , d )

a l l o c a t e ( i o ( n ) )a l l o c a t e ( d (n ,2∗ n ) )

! F i l l i n the ” i o ” and ”d ” mat r i x .! ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗do i =1,n

i o ( i )=iend do

do i =1,ndo j =1,n

d ( i , j )=a ( i , j )i f ( i . eq . j ) then

d ( i , n+j ) =1.0 8e l s e

d ( i , n+j ) =0.0 8end i f

end do

end do

! S c a l i n g! ∗∗∗∗∗∗∗

do i =1,nm=1do k=2,n

i f ( abs ( d ( i , k ) ) . gt . abs ( d ( i ,m) ) )m=kend do

tmp=d ( i ,m)do k=1,2∗n

d ( i , k )=d ( i , k ) /tmpend do

end do

! Lower E l im i n a t i o n! ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

do i =1,n−1

! P i v o t i n g! ∗∗∗∗∗∗∗∗

m=ido j=i +1,n

i f ( abs ( d ( i o ( j ) , i ) ) . gt . abs ( d ( i o (m) , i ) ) )m=jend do

i tmp=i o (m)i o (m)=i o ( i )i o ( i )=itmp

! S c a l e the P ivo t e lement to u n i t y! ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

r=d ( i o ( i ) , i )do k=1,2∗n

B.4. EXECUTION 81

d ( i o ( i ) , k )=d ( i o ( i ) , k ) / rend do

! ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

do j=i +1,nr=d ( i o ( j ) , i )do k=1,2∗n

d ( i o ( j ) , k )=d ( i o ( j ) , k )−r ∗d ( i o ( i ) , k )end do

end do

end do

! Upper E l im i n a t i o n! ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

r=d ( i o ( n ) , n )do k=1,2∗n

d ( i o ( n ) , k )=d ( i o ( n ) , k ) / rend do

do i=n−1,1,−1do j=i +1,n

r=d ( i o ( i ) , j )do k=1,2∗n

d ( i o ( i ) , k )=d ( i o ( i ) , k )−r ∗d ( i o ( j ) , k )end do

end do

end do

! F i l l Out ”a i ” mat r i x! ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

do i =1,ndo j =1,n

a i ( i , j )=d ( i o ( i ) , n+j )end do

end do

dea l l o c a t e ( i o )dea l l o c a t e ( d )

! ! $omp end p a r a l l e l

r e tu rn

end subrout ine matinvend module fem

B.4 Execution

Listing B.4: stress.f90

module s t r e s s


use p u b l i cV a r suse femuse s o l v euse inout

conta in s

subrout ine execu t e ( Coords , Edof , ElProp , F , u , Pre , nnd ,&ne l , npv , E lemForces )

! f2py i n t e n t ( in , out ) F! f2py i n t e n t ( in , out ) u! f2py i n t e n t ( in , out ) E lemForces


i n t ege r : : nnd , ne l , bw , i e r r , i , j , nnz , npv , neqr e a l (8) , dimens ion ( : , : ) : : Coordsi n teger , dimens ion ( : , : ) : : Edofr e a l (8) , dimens ion ( : , : ) : : E lemForcesr e a l (8) , dimens ion ( : ) : : F , u

r e a l (8) , dimens ion ( : ) , a l l o c a t a b l e : : u p a r d i s o , F p a r d i s oi n teger , dimens ion ( : ) : : Prer e a l (8) : : s t a r t , end

r e a l (8) omp get wt imeex te r na l omp get wt ime

i n teger , dimens ion ( : ) , a l l o c a t a b l e : : updof ! v e c t o r c o n t a i n i n g theu n p r e s c r i b e d v a r i a b l e s

i n t ege r : : nnz subr e a l (8) , dimens ion ( : , : ) , a l l o c a t a b l e : : K! Spa r se system mat r i x s t o r a g e

r e a l (8) , dimens ion ( : ) , a l l o c a t a b l e : : Kveci n teger , dimens ion ( : ) , a l l o c a t a b l e : : i a , j a! Element mat r i x

r e a l (8) , dimens ion (6 ,6 ) : : Ke! Element s t r e s s e s and p r o p e r t i e s

r e a l (8) : : s igmae (3) , ElProp (3)i n teger , dimens ion ( : ) , a l l o c a t a b l e : : n b r d e l c o l v e c , ik , j kr e a l (8) , dimens ion ( : ) , a l l o c a t a b l e : : K spa r sei n t ege r omp get thread numex te r na l omp get thread num

neq = nnd∗2

a l l o c a t e ( n b r d e l c o l v e c ( s i z e ( Pre ) ) )a l l o c a t e ( K spa r se ( n e l ∗21) )

a l l o c a t e ( i k ( n e l ∗21) )a l l o c a t e ( j k ( n e l ∗21) )a l l o c a t e ( updof ( ( nnd∗2−npv ) ) )

c a l l g e t n b r d e l c o l v e c ( n b r d e l c o l v e c , Pre , updof )nnz = 0

s t a r t = omp get wt ime ( )do i =1, n e l

c a l l e l emen tmat r i x (Ke , Edof , Coords , ElProp , i )c a l l assemCOO( Edof ( i , : ) , Pre , u , F , K sparse , Ke , nnz , ik ,

jk , n b r d e l c o l v e c )

B.5. SYSTEM SOLVING 83

end do

end = omp get wt ime ( )wr i te (∗ , ’ (A, F) ’ ) ’ assem t ime : ’ , end−s t a r t

dea l l o c a t e ( n b r d e l c o l v e c )

s t a r t = omp get wt ime ( )c a l l COOtoCSR( K sparse , ik , jk , nnz , neq )end = omp get wt ime ( )

wr i te (∗ , ’ (A, F) ’ ) ’ c oo t o c s r t ime : ’ , end−s t a r t

a l l o c a t e ( F p a r d i s o ( s i z e ( updof ) ) )a l l o c a t e ( u p a r d i s o ( nnd∗2−npv ) )

F pa r d i s o = F( updof )

u p a r d i s o =0u=0

s t a r t = omp get wt ime ( )c a l l p a r d i s o s o l v e ( neq−npv , K sparse , ik , jk , u p a r d i s o , F p a r d i s o )end = omp get wt ime ( )wr i te (∗ , ’ (A, F) ’ ) ’ p a r d i s o t ime : ’ , end−s t a r t

u ( updof ) = u p a r d i s o

dea l l o c a t e ( K spa r se )dea l l o c a t e ( i k )dea l l o c a t e ( j k )dea l l o c a t e ( updof )dea l l o c a t e ( u p a r d i s o )dea l l o c a t e ( F p a r d i s o )

! $omp p a r a l l e l do d e f a u l t ( p r i v a t e ) sha red ( Edof , Coords , ElProp , u ,ElemForces , n e l )

do i =1, n e lc a l l c a l c e l e m e n t f o r c e s ( Edof , Coords , ElProp , i , sigmae , u )E lemForces ( i , : ) = sigmae

end do


do i =1, n e ldo j =1,3

wr i te (∗ ,∗ ) E lemForces ( i , j )end do

end do

end subrout ine execu t eend module s t r e s s

B.5 System solving

Listing B.5: solve.f90


sub rout ine p a r d i s o s o l v e (n , a , ia , ja , x , b )! . . I n t e r n a l s o l v e r memory p o i n t e r f o r 64− b i t a r c h i t e c t u r e s! . . INTEGER∗8 pt (64)! . . I n t e r n a l s o l v e r memory p o i n t e r f o r 32− b i t a r c h i t e c t u r e s! . . INTEGER∗4 pt (64)! . . Th is i s OK in both c a s e sINTEGER∗8 pt (64)! i n t e g e r pt (64)! . . A l l o t h e r v a r i a b l e si n t ege r : : maxfct , mnum, mtype , phase , n , nrhs , e r r o r , m sg l v li n t ege r : : iparm (64)i n t ege r , dimens ion ( : ) : : i a , j ar e a l (8) , dimens ion ( : ) : : a , b , xi n t ege r : : i , idumr e a l (8) : : walt ime1 , walt ime2 , ddum , memi n t ege r omp get max threadsex te r na l omp get max threads

! . . F i l l a l l a r r a y s c o n t a i n i n g mat r i x data .n rh s = 1maxfct = 1mnum = 1

! . .! . . Set up PARDISO co n t r o l parameter! . .do i = 1 , 64iparm ( i ) = 0end do

iparm (1) = 1 ! no s o l v e r d e f a u l tiparm (2) = 2 ! f i l l −i n r e o r d e r i n g from METIS

iparm (3) = omp get max threads ( ) ! numbers o f p r o c e s s o r s , v a l u e o fOMP NUM THREADS

iparm (4) = 0 ! no i t e r a t i v e −d i r e c t a l g o r i t hmiparm (5) = 0 ! no u s e r f i l l −i n r e d u c i n g pe rmuta t i oniparm (6) = 0 ! =0 s o l u t i o n on the f i r s t n compoments o f xiparm (7) = 16 ! d e f a u l t l o g i c a l f o r t r a n u n i t number f o r outputiparm (8) = 9 ! numbers o f i t e r a t i v e r e f i n emen t s t e p siparm (9) = 0 ! not i n useiparm (10) = 13 ! p e r t u r b e the p i v o t e l emen t s with 1E−13iparm (11) = 1 ! u se nonsymmetr ic p e rmuta t i on and s c a l i n g MPSiparm (12) = 0 ! not i n useiparm (13) = 0 ! not i n useiparm (14) = 0 ! Output : number o f p e r t u r b ed p i v o t siparm (15) = 0 ! not i n useiparm (16) = 0 ! not i n useiparm (17) = 0 ! not i n useiparm (18) = −1 ! Output : number o f nonze ro s i n the f a c t o r LUiparm (19) = −1 ! Output : Mf lops f o r LU f a c t o r i z a t i o niparm (20) = 0 ! Output : Numbers o f CG I t e r a t i o n se r r o r = 0 ! i n i t i a l i z e e r r o r f l a gmsg l v l = 1 ! p r i n t s t a t i s t i c a l i n f o rma t i o nmtype = 2 ! symmetr ic p o s i t i v e d e f i n i t e mat r i x

! . . I n i t i l i a z e the i n t e r n a l s o l v e r memory p o i n t e r . Th is i s on l y! n e c e s s a r y f o r the FIRST c a l l o f the PARDISO s o l v e r .

B.5. SYSTEM SOLVING 85

do i = 1 , 64pt ( i ) = 0end do

! . . R eo r d e r i n g and Symbol ic F a c t o r i z a t i o n , Th is s t e p a l s o a l l o c a t e s! a l l memory tha t i s n e c e s s a r y f o r the f a c t o r i z a t i o nphase = 11 ! on l y r e o r d e r i n g and symbo l i c f a c t o r i z a t i o nwr i te (∗ , ’ (A, I ,A) ’ ) ’ S o l v i n g system on ’ , iparm (3) , ’ p r o c e s s o r s ’CALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , ia , ja , idum ,

nrhs , iparm , msg lv l , ddum , ddum , e r r o r )

WRITE(∗ ,∗ ) ’ R eo r d e r i n g completed . . . ’IF ( e r r o r .NE . 0) THEN

!WRITE(∗ ,∗ ) ’The f o l l o w i n g ERROR was d e t e c t e d : ’ , e r r o rSTOP

END IF

!WRITE(∗ ,∗ ) ’Number o f nonze ro s i n f a c t o r s = ’ , iparm (18)!WRITE(∗ ,∗ ) ’Number o f f a c t o r i z a t i o n MFLOPS = ’ , iparm (19)

! . . F a c t o r i z a t i o n .phase = 22 ! on l y f a c t o r i z a t i o nCALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , ia , ja , idum ,

nrhs , iparm , msg lv l , ddum , ddum , e r r o r )

!WRITE(∗ ,∗ ) ’ F a c t o r i z a t i o n completed . . . ’IF ( e r r o r .NE . 0) THEN

!WRITE(∗ ,∗ ) ’The f o l l o w i n g ERROR was d e t e c t e d : ’ , e r r o rSTOP

ENDIF

! . . Back s u b s t i t u t i o n and i t e r a t i v e r e f i n emen tiparm (8) = 2 ! max numbers o f i t e r a t i v e r e f i n emen t s t e p sphase = 33 ! on l y f a c t o r i z a t i o nCALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , a , ia , ja , idum ,

nrhs , iparm , msg lv l , b , x , e r r o r )

do i =1 ,10wr i te (∗ ,∗ ) x ( i )

end do

!WRITE(∗ ,∗ ) ’ So l ve completed . . . ’wr i te (∗ , ’ (A, I ) ’ ) ’ iparm (7) = ’ , iparm (7)wr i te (∗ , ’ (A, I ) ’ ) ’ iparm (8) = ’ , iparm (8)wr i te (∗ , ’ (A, I ) ’ ) ’ iparm (14) = ’ , iparm (14)! . . Te rm ina t i on and r e l e a s e o f memoryphase = −1 ! r e l e a s e i n t e r n a l memoryCALL p a r d i s o ( pt , maxfct , mnum, mtype , phase , n , ddum , idum , idum ,

idum , nrhs , iparm , msg lv l , ddum , ddum , e r r o r )end subrout ine p a r d i s o s o l v e

end module s o l v e

Bibliography

[1] O. Schenk and K. Gartner, Two-level dynamic scheduling in PARDISO: Improvedscalability on shared memory multiprocessing systems, 2002

[2] O. Schenk and K. Gartner, Solving Unsymmetric Sparse Systems of Linear Equationswith PARDISO, Journal of Future Generation Computer Systems, 20(3):475–487,2004.

[3] O. Schenk and K. Gartner, On fast factorization pivoting methods for symmetricindefinite systems, Elec. Trans. Numer. Anal., 23:158–179, 2006.

[4] Intel. Intel MKL 10.0 manual

[5] OpenMP specification, http://www.openmp.org/mp-documents/spec25.pdf

[6] Miguel Hermanns, Parallel Programming in Fortran 95 using OpenMP, School ofAeronautical Engineering, Universidad Politecnica de Madrid, 2002

[7] Jonathan Richard Shewchuk, Triangle: Engineering a 2D Quality Mesh Generatorand Delaunay Triangulator in “Applied Computational Geometry: Towards Geomet-ric Engineering” (Ming C. Lin and Dinesh Manocha, editors), volume 1148 of LectureNotes in Computer Science, pages 203-222, Springer-Verlag, Berlin, May 1996. (Fromthe First ACM Workshop on Applied Computational Geometry.)

[8] Henning, M., and S. Vinoski. 1999. Advanced CORBA Programming with C++. Read-ing, MA: Addison-Wesley.

[9] World Wide Web Consortium. 2002. SOAP Version 1.2 Specification.http://www.w3.org/2000/xp/Group/# soap12. Boston, MA: World Wide WebConsortium.

[10] World Wide Web Consortium. 2002. Web Services Activity.http://www.w3.org/2002/ws. Boston, MA: World Wide Web Consortium.

[11] Microsoft. 2002. .NET Infrastructure & Services.http://www.microsoft.com/technet/treeview/default.asp?url=/technet/itsolutio ns/net/Default.asp. Bellevue, WA: Microsoft.

[12] Henning, M. and Spruiell, M. et al. 2007. Distributed Programming with Ice ZeroC,Inc. http://www.zeroc.com.

[13] Carsten A. Arnholm. 1997 Mixed language programming using C++ and FORTRAN77 http://arnholm.org/software/index.htm

87

88 BIBLIOGRAPHY

[14] Division of Structural Mechanics, LTH, Software Development for Technical Applica-tions, http://www.byggmek.lth.se/utbildning/kurser/valfria/vsm032%2Cprogramutveckling foer tekniska tillaempningar%2C 6 hp/

[15] Satish Balay and Kris Buschelman and William D. Gropp and DineshKaushik and Matthew G. Knepley and Lois Curfman McInnes and BarryF. Smith and Hong Zhang. 2001 http://www-unix.mcs.anl.gov/petsc/petsc-2/documentation/referencing.html

[16] John Burkardt TRIANGULATION RCM Reverse Cuthill-McKee NodeReordering, School of Computational Science, Florida State University.http://people.scs.fsu.edu/ burkardt/f src/triangulation rcm/triangulation rcm.html

IMPLEMENTING A COMPONENT BASED PARALLEL DISTRIBUTED … · 2008. 3. 5. · the parallel solver. In...

Documents

Transcript of IMPLEMENTING A COMPONENT BASED PARALLEL DISTRIBUTED … · 2008. 3. 5. · the parallel solver. In...