Performance Comparison of Public Domain MPI ... · PDF filewith its installation on Solaris...

Performance Comparison of Public Domain MPI

Implementations using Application Benchmarks

Yuan WAN

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2006

1

Abstract

The primary aim of this project is to investigate the performance of parallel scientific application codes on Linux clusters using different MPI implementations. To this end, several scientific applications have been ported to a Linux cluster and to a Solaris SMP machine. They have been benchmarked with different MPI implementations installed. The application benchmark performance can be understood on the basis of the low-level benchmark results from standard MPI Benchmark suites. Overall, the best implementation has been found to be MPICH2. The performance of the new OpenMPI implementation was acceptable but not as good as MPICH2 and LAM/MPI and there were also some problems with its installation on Solaris SMP machine. LAM/MPI has been proved to be able to take advantage of high performance Infiniband and shared memory of multi-core on another cluster machine.

2

Contents

Acknowledgements ...........................................................................................................................7 Chapter 1 Introduction ......................................................................................................................8 Chapter 2 MPI Implementations and Machine platforms ...............................................................11

2.1 MPI Implementations........................................................................................................11 2.1.1 OpenMPI ................................................................................................................11 2.1.2 MPICH2 .................................................................................................................12 2.1.3 LAM/MPI...............................................................................................................13

2.2 Machine platforms ............................................................................................................14 2.2.1 e-Science cluster.....................................................................................................14 2.2.2 Lomond ..................................................................................................................15 2.2.3 Scaliwag .................................................................................................................15 2.2.4 HPCx......................................................................................................................16

Chapter 3 Methodology...................................................................................................................17 3.1 Performance Measure Method ..........................................................................................17 3.2 Plotting Benchmark Results ..............................................................................................19 3.3 Benchmark Strategy ..........................................................................................................21 3.4 Correctness Check.............................................................................................................22 3.5 Summary ...........................................................................................................................23

Chapter 4 Low Level MPI Benchmark ...........................................................................................24 4.1 Introduction of Intel MPI Benchmarks .............................................................................24 4.2 Benchmark results on Lomond .........................................................................................25 4.3 Benchmark results on e-Science .......................................................................................28 4.4 Porting IMB ......................................................................................................................33 4.5 Summary ...........................................................................................................................34

Chapter 5 Coursework Code Benchmark........................................................................................35 5.1 Introduction of the code ....................................................................................................35 5.2 Porting code ......................................................................................................................36 5.3 Optimization of Coursework Code....................................................................................37 5.4 Benchmark on Lomond.....................................................................................................38 5.5 Benchmark on e-Science Cluster ......................................................................................40 5.5 Summary ...........................................................................................................................43

Chapter 6 PCHAN Benchmark .......................................................................................................44 6.1 Introduction of the code ....................................................................................................44 6.2 Porting code ......................................................................................................................45 6.3 Problems and solutions in PCHAN porting work .............................................................47 6.4 Benchmark on Lomond.....................................................................................................50

3

6.5 Benchmark on s-Science Cluster.......................................................................................53 6.6 Summary ...........................................................................................................................56

Chapter 7 NAMD Benchmark ........................................................................................................58 7.1 Introduction of the code ....................................................................................................58 7.2 Porting code ......................................................................................................................59 7.3 Benchmark on Lomond.....................................................................................................61 7.4 Summary ...........................................................................................................................63

Chapter 8 LAMMPS Benchmark....................................................................................................64 8.1 Introduction of the code ....................................................................................................64 8.2 Porting code ......................................................................................................................65 8.3 Benchmark on e-Science cluster .......................................................................................68 8.4 Summary ...........................................................................................................................71

Chapter 9 Infiniband and Shared Memory ......................................................................................72 9.1 Introduction of Infiniband and Shared Memory................................................................72 9.2 Application Benchmark.....................................................................................................74 9.3 Summary ...........................................................................................................................78

Chapter 10 Conclusion....................................................................................................................79 Appendix A Code Porting Detail ....................................................................................................83

A.1 Porting MPI Coursework Code ........................................................................................83 A.2 Porting PCHAN................................................................................................................84 A.3 Porting NAMD.................................................................................................................87 A.4 Porting LAMMPS ............................................................................................................89

Appendix B Work Plan and Final Outcome....................................................................................91 B.1 Original Plan.....................................................................................................................91 B.2 Current states after two months ........................................................................................93 B.3 Changes of original work plan..........................................................................................95 B.4 Final Porting Outcome......................................................................................................95

Bibliography....................................................................................................................................97

4

List of Tables

Table 3.1 the proportion of PCHAN running time ........................................................................ 17 Table 3.2 Example-Coursework Benchmark on e-Science Cluster with OpenMPI...................... 22 Table 4.1 P2P message passing performance order of different MPI on Lomond ........................ 27 Table 4.2 P2P message passing performance of different MPI on e-Science cluster .................... 31 Table 6.2 parallel running time values of different optimisation level.......................................... 49 Table 6.1 PCHAN building parameter value for different MPI .................................................... 86 Table 8.1 LAMMPS low-level Makefile flag values of different MPI on e-Science cluster ........ 90 Table B.1 Original work plan timetable ........................................................................................ 92 Table B.2 Project Risk................................................................................................................... 93 Table B.3 Machine states .............................................................................................................. 93 Table B.4 MPI installs states ......................................................................................................... 94 Table B.5 Code porting states ....................................................................................................... 94 Table B.6 Final Project States ....................................................................................................... 96

5

List of Figures

Figure 3.1 LAMMPS melt example Benchmark on e-Science cluster using MPICH2 ................ 18 Figure 3.2 NAMD apoa1 Benchmark on HPCx using native MPI ............................................... 19 Figure 3.3 Example1-Absolute Time of NAMD apoa1 Benchmark on Lomond.......................... 20 Figure 3.4 Example2-Speedup of Coursework Code 1000x1000 Benchmark on Lomond .......... 20 Figure 3.5 an example of CPU Time per Step --- PCHAN T1 benchmark on s-Science .............. 21 Figure 4.1 Pingpong bandwidth of middle messages on Lomond ................................................ 25 Figure 4.2 Pingpong bandwidth of large messages on Lomond.................................................... 26 Figure 4.3 IMB Pingpong Benchmark latency results on Lomond............................................... 27 Figure 4.4 IMB Allreduce Benchmark performance of small size message on Lomond.............. 28 Figure 4.5 Pingpong bandwidth of middle messages on e-Science cluster................................... 29 Figure 4.6 Pingpong bandwidth of large messages on e-Science cluster ...................................... 30 Figure 4.7 IMB Pingpong Benchmark latency results on e-Science cluster ................................. 30 Figure 4.8 8 bytes message Allreduce Performance on e-Science cluster..................................... 32 Figure 4.9 192 bytes message Allreduce performance on e-Science cluster................................. 32 Figure 4.10 the Makefile template for OpenMPI on e-Science cluster ......................................... 33 Figure 5.1 MPI Coursework Code updating image....................................................................... 36 Figure 5.4 halo-swap opt of Coursework code.............................................................................. 37 Figure 5.5 the MPI Coursework Code 1000x1000 Benchmark on Lomond................................. 39 Figure 5.6 MPI Coursework Code 1000x1000 Benchmark results on e-Science Cluster............. 41 Figure 5.7 Speedup of Coursework Code 1000x1000 Benchmark on e-Science Cluster ............. 42 Figure 5.8 Zoom in - Coursework Code 1000x1000 Benchmark on s-Science Cluster................ 42 Figure 6.1 PCHAN T1 Benchmark on HPCx Phase2a using native MPI ..................................... 45 Figure 6.4 (L) the composition of parallel running time ............................................................... 48 Figure 6.5 (R) extrapolate target parallel running time ................................................................. 48 Figure 6.9 CPU Time per Step of PCHAN T1 benchmark on Lomond ........................................ 51 Figure 6.10 Vampir MPI trace views of PCHAN T1 benchmark.................................................. 52 Figure 6.11 low-level Pingpong benchmark result on Lomond .................................................... 53 Figure 6.12 CPU Time per Step of PCHAN T1 benchmark on e-Science Cluster........................ 54 Figure 6.13 bandwidth results of Pingpong benchmark on e-Science cluster............................... 55 Figure 6.14 latency results of Pingpong benchmark on e-Science cluster .................................... 55 Figure 7.1 NAMD Apoa1 Benchmark on HPCx Phase2a using native MPI ................................ 59 Figure 7.5 NAMD Apoa1 benchmark results on Lomond ............................................................ 62 Figure 7.6 Speedup of NAMD Apoa1 benchmark results on Lomond ......................................... 63 Figure 8.1 LAMMPS Rhodopsin benchmark on HPCx Phase2a using native MPI ..................... 65 Figure 8.4 Results of LAMMPS Rhodopsin benchmark on e-Science cluster ............................. 68

6

Figure 8.5 Vampir MPI trace views of LAMMPS Rhodopsin benchmark.................................... 69 Figure 8.6 low-level Pingpong benchmark result ......................................................................... 70 Figure 8.7 low-level Allreduce benchmark result ......................................................................... 70 Figure 9.1 the architecture of Infiniband....................................................................................... 73 Figure 9.2 multi-core compute node ............................................................................................. 74 Figure 9.3 Coursework code benchmark with using two interconnects........................................ 75 Figure 9.4 Coursework code benchmark in multi and single mode on Scaliwag.......................... 77 Figure 5.2 Makefile for Coursework Code on Lomond using OpenMPI...................................... 84 Figure 5.3 Script for porting Coursework Code on Lomond using OpenMPI .............................. 84 Figure 6.2 Makefile target parameters for Lomond using MPICH2 ............................................. 84 Figure 6.3 Makefile target parameters for e-Science cluster using MPICH2................................ 85 Figure 6.6 the procedure of OpenMPI installation with ifort on e-Science cluster....................... 86 Figure 6.7 the procedure of MPICH2 installation with ifort on e-Science cluster ........................ 87 Figure 6.8 the procedure of LAM/MPI installation with ifort on e-Science cluster...................... 87 Figure 7.2 NAMD porting procedure – install TCL, FFTW and plug-in...................................... 88 Figure 7.3 NAMD porting procedure – install Charm++.............................................................. 88 Figure 7.4 NAMD porting procedure – build NAMD .................................................................. 89 Figure 8.2 Installation of FFTW 2.1.5........................................................................................... 89 Figure 8.3 LAMMPS low-level Makefile for MPICH2 on e-Science cluster............................... 90

7

Acknowledgements

I would like to thank my supervisor, Dr David Henty for his guidance while working on this project. I would like to thank Alan Gray from EPCC for providing me PCHAN codes and the binary of NAMD and LAMMPS on HPCx. I would like to thank Dr Joachim Hein for giving me the apoa1 benchmark input of NAMD

8

Chapter 1

Introduction

Today, the Message Passing Interface (MPI) as a library specification for message-passing is widely used for solving significant scientific and engineering problems [1] on both massively parallel machines and on workstation clusters. “The MPI specification was developed by the MPI Forum, a group of software developers, computer vendors, academics, and computer-science researchers whose goal was to develop a standard for writing message-passing programs that would be efficient, flexible, and portable.”[2] In 1993, one important outcome was published, known as the MPI Standard (MPI-1; then a most recent version was published in 1995. MPI standard has been well received today; there are several public implementations. [2] The MPI-2, which contains the additions to MPI-1 including process creation and management, one-sided communications, extended collective operations, external interfaces, I/O, and additional language bindings, was published in July, 1997. Based on these two standards, a number of vendors and users have developed more than a dozen implementations. Normally, vendors will install their own MPI implementations which are optimized for specific architecture and software on their machines. (For example, Sun MPI includes all MPI 1.1-compliant routines and a subset of the MPI 2- compliant routines. It is integrated with the Sun Cluster Runtime Environment and Optimized collectives for SMP) These MPI implementations are called native MPI because of their association with the machines. And using native MPI can acquire the best performance in most cases. However, since Donald Becker and Thomas Sterling proposed the idea of providing COTS (Commodity Off The Shelf) based systems to satisfy specific computational requirements after their success in Beowulf Project began in 1994, increasing number of academic and research communities preferred to build their HPC machine as a Linux cluster (Beowulf

9

machine) based on nodes of commodity workstations or servers connected by high performance interconnection , with open source software (Linux) infrastructure.[11] This kind of machine greatly reduced the cost of solving complicated scientific and engineering problems from high end HPC level to mass-market server or even PC level. One issue of Linux cluster is that there is no native MPI installed. In order to run message-passing parallel code on such machines, the standard approach is to install a public domain implementation of MPI. In the past years, there are tremendous outcome in the development of public domain MPI, which profits from the spirit of open source. In one side, some well known MPI implementations like MPICH and LAM/MPI were well optimised by academic institutes and scholars from all over the world. In another side, new versions like OpenMPI and MPICH2 were released on basis on their predecessors. They leave users a question that which MPI implementation is the best. Factually, it is quite hard to get an obvious conclusion as all these MPI are designed for portable use but one specific machine. The purpose of this project is to investigate which MPI implementation is the most efficient on different machines. The project will begin with installing the different MPI versions and running standard MPI test suites such as Intel MPI Benchmarks. This may give some low-level benchmark results including latency and bandwidth of message passing, collective operation performance. Then the project would progress to look at the performance of scientific application codes. It is the main contents of this dissertation. Three application codes of diverse roles and languages are selected from earlier research. In the end, they are ported to different machines and benchmarked with different MPI implementations. The application benchmark results will be analyzed together with the low-level results, and then a reasonable conclusion can be given. Another aim of the project is to check if the public domain MPI implementations could profit from the InfiniBand interconnects. InfiniBand is quickly becoming the choice of interconnect for many HPC applications because of its compelling price/performance and standards-based technology [3]. It is supported by some MPI recently released like OpenMPI and the latest version of LAM/MPI. A comparison of application performance of InfiniBand and Ethernet interconnect will be done to investigate and value the profit from InfiniBand. In addition, it is interesting to investigate whether public domain MPI implementation could

10

take advantage of shared memory architectures. One of the machine platforms in this project, Lomond, is a Solaris SMP machine. Conclusions can be made after comparing the code benchmarking results on this machine with public domain MPI and with native MPI configured for shared memory.

11

Chapter 2

MPI Implementations and Machine Platforms

This chapter introduces the MPI implementations studied in this project and the machine platforms on which application were benchmarked. Three public domain MPI implementations were selected in this project, including OpenMPI, MPICH2 and LAM/MPI. The application benchmarks would be performed on Lomond, e-Science cluster, Scaliwag and HPCx.

2.1 MPI Implementations

2.1.1 OpenMPI

With the rapid change of High Performance Computing landscape, systems which comprise of thousands to hundreds of thousands of processors vary from tightly integrated high end HPC systems, to clusters of PCs and Servers. [4] Some systems are even built with non-uniform computing facilities, such as variations in processor type and bandwidths and latencies between processors. This wide variety of platforms and environments poses a big requirement of a platform portable, stable, efficient and error tolerant MPI implementation. OpenMPI is a completely new MPI-2 compliant implementation (project started in 2004) combining technologies and resources from prior research and influenced by experience gained from the code bases of the LAM/MPI, LA-MPI, FT-MPI, and PACX-MPI projects. Open MPI provides a unique combine of novel features previously unavailable in an open-source, production-quality implementation of MPI.[4]

12

High performance drivers are developed for OpenMPI to enable the establishment of multiple communication protocols including TCP/IP, shared memory, Myrinet and InfiniBand. It makes OpenMPI efficiently support a wide range of parallel machines. In addition, OpenMPI can transparently utilize message fragmentation strip over multiple network devices to provide maximum bandwidth to applications and handle network failing.[4] OpenMPI is designed with a component architecture called Molecular Component Architecture (MCA). There are three types of component in this architecture: MCA, Component Frameworks and Components.[4] MCA manages the component frameworks and provide service to them. Each component framework takes charge of a single task, such as MPI collective communication. Components are self-contained software units that have well-defined interfaces and can be deployed and composed with other components.[4] The component architecture provides OpenMPI both a stable platform for third-party research as well as enabling the run-time composition of independent software units.

2.1.2 MPICH2

MPICH is a public-domain implementation developed at Argonne National Laboratory and Mississippi State University.[5] Gropp and Lusk began to develop MPICH at the Supercomputing ’92 conference, and the first version of MPICH was implemented within a few days. Following the MPI specification as it developed, the MPICH implementation was complete, portable, fast and available immediately when MPI standard was released in May 1994.[12] In the next ten years, it was freely available and run on a wide variety of systems. Today, MPICH is no longer being developed. A new version named MPICH2 was released in 2005. It was a completely new implementation of MPI includes all the feature of MPI-1 and MPI-2. The goals of MPICH2 are to provide an MPI implementation for important platforms, including clusters, SMP, and massively parallel processors. It also provides a vehicle for MPI implementation research and for developing new and better parallel programming environments. [6] Current version of MPICH2 (1.0.3) has replaced MPICH except for the case of clusters with heterogeneous data representations.

13

Replacing the old p4 portable programming environment used by MPICH, MPICH2 starts processes via remote shell commands (rsh or ssh) and collects and distributes necessary information at the startup time [7]. MPICH2 also provides a separation of process management and communication. The default runtime environment consists of a set of daemons called MPD. The communication amount machines are established by MPD before application process start up. So it can provide a clearer picture of what communication cannot be established.[7]

2.1.3 LAM/MPI

LAM/MPI is a high quality public domain MPI implementation originally developed by Ohio Supercomputing Centre and adopted by Laboratory for Scientific Computing at the University of Notre Dame. The first stable and robust version within various UNIX environments, 6.2b was released soon after it is adopted by Notre Dame. And in the autumn of 2001, lam was moved to Indiana University. The latest version Lam 7.1.2 provides a complete implementation of the MPI 1.2 standard and most MPI-2 standard. A partial list of the most commonly used features LAM supports is provided below: [8]

• Process Creation and Management • One-sided Communication (partial implementation) • MPI I/O (using ROMIO) • MPI-2 Miscellany:

o mpiexec o Thread Support (MPI_THREAD_SINGLE - MPI_THREAD_SERIALIZED) o User functions at termination o Language interoperability

• C++ Bindings

LAM/MPI provides users not only with the standard MPI API, but also with several debugging and monitoring tools. While specifically targetted at heterogenous clusters of Unix workstations, LAM runs on a wide variety of Unix platforms, from desktop workstations to large "supercomputers" (and everything in between).[8] As the direct predecessor of OpenMPI, LAM/MPI also supports various MPI

14

communication protocols (TCP, Gigabit Ethernet, Myrinet and InfiniBand) with little overhead. LAM/MPI uses one of the RPI modules (tcp, gm, ib, lamd, sysv, usysv) to perform point-to point message passing.

2.2 Machine platforms

2.2.1 e-Science cluster

E-Science cluster is a newly constructed Linux cluster constructed by e-Science MSc, University of Edinburgh. This machine is currently composed of one master node named lab01 and 14 available slave nodes named dev01 – dev15 (dev04 is down).

• Processor: 3.0GHz Pentium 4 processor • Cache: 12KByte + 8KByte L1 cache (for transfer and data) on P4

512Kbyte L2 cache on P4 processor • Memory: 1GB memory on each node. • Node: one Dual-core Xeon processor for master node

one Pentium 4 3.0GHz processor for each compute node 12 compute nodes available.

• O/S: Sun Solaris10 on master node and GNU Linux on slave nodes. • Compiler: Sun C, C++ & FORTRAN compilers are natively installed on master

GNU C, C++ & FORTRAN compilers are for slave nodes • Native MPI: None • Job Scheduler: None

Since there is no scheduler system on this machine yet, jobs are only able to run interactively. To reduce the impact from possible confusion, the 14 nodes can be divided as different partitions (i.e. dev02 and dev03 perform the 2 processor benchmarks, dev05-dev08 take charge of 4 processor benchmark). In addition, the benchmark on large number of processors (over 10 processors) should be done several times and figure out an average result.

15

2.2.2 Lomond

Lomond is one of the University of Edinburgh’s HPC systems and is used for research and

teaching. It is a SMP machine consists of 52 Sun UltraSparc III processors in a single cabinet. [10]

• Processor: 900MHz UltraSparcIII processor

• Cache: 64Kbyte L1 cache, 4-way associated with 32 byte lines 8Mbyte L2 cache, direct mapped with 64 byte lines

• Memory: 1GB memory associated with each processor. Totally 48Gbyte of shared memory

• Front-end node: 4 of these processors running code interactively with 7.2 Gflops peak performance.

• Back-end node: 48 of these processors running code through Sun Grid Engine with 86.4 Gflops peak performance.

• O/S: Sun Solaris 9 • Compiler: Sun C, C++ & FORTRAN compilers are natively installed together

with GNU C, C++ & FORTRAN compilers • Native MPI: Sun native MPI • Job Scheduler: Sun Grid Engine

2.2.3 Scaliwag

Scaliwag is a 1+32 node (66 processors) IBM e-325 cluster computer with Infiniband, SCI and Gigabit Ethernet interconnects.[9] The master node contains a dual core AMD Opteron 246(2.0GHz) processors, with 4GB DDR PC2700 ECC[9]. Scaliwag machine is provided by Daresbury Laboratory, CCLRC.

• Processor: 2.0GHz Dual-core AMD Opteron 246 processors • Cache: 1Mbyte L2 cache per core, with EEC data cache protection • Memory: 2Gb DDR PC2700 ECC Reg Memory per node. • Master node: Dual AMD Opteron 246 (2.0GHz) processors, with 4Gb

DDRPC2700 ECC Reg Memory, 2 * 73 Gbyte U320 SCSI Hard Disks.

16

• Slave node: Dual AMD Opteron 246 (2.0GHz) processors, with 2Gb DDRPC2700 ECC Reg Memory, 73 Gbyte U320 SCSI Hard Disks, PCI-X (100MHz) HCA card, 32 slave nodes available

• O/S: SLES 8.1 (SuSE Linux Enterprise Server) • Compiler: Portland Group C & FORTRAN compilers are installed as are GNU

C, C++ & FORTRAN compilers • Native MPI: Scali MPI • Job Scheduler: Sun Grid Engine

2.2.4 HPCx

HPCx phase 2a is a IBM Power 5 575 system composed by 102 IBM eServer 575 LAPRs

connected by an IBM’ High Performance Switch. It was updated recently from phase2 and

acquired improvement on memory architecture.

• Processor: 1.5GHz IBM Power5 processor • Cache: 32 Kbyte L1 data cache + 64 Kbyte L1 instruction cache

1.9 Mbyte L2 combined cache shared by two processors in one chip 36Mbyte L3 cache shared by two processors

• Memory: 32 Gbyte Memory shared by 16 processors in one eServer node • Node: 96 IBM eServer 575 nodes for compute plus 6 IBM eServer 575

nodes for login and disk I/O; each node contains 16 Power5 processors, totally 1536 processors for computation

• O/S: AIX • Compiler: xlf_r (f77), xlf90_r (f90), xlc_r (c), xlC_r (c++), javac(java) • Native MPI: IBM MPI • Job Scheduler: IBM LoadLever

17

Chapter 3

Methodology

This chapter is written to explain the methods which are used during application benchmarking. The first part shows how to measure and benchmark application code performance; then the second part shows how to plot the benchmark results on a graph.

3.1 Performance Measure Method

When benchmarking scientific application codes, there is more than one time value that can be used to measure the performance. Instead of using the total time, it is much better to divide it into several parts. For most scientific application codes, the whole procedure includes initialization containing I/O operations, one or more major loops containing most processing and calculation, then final I/O operations. The time spent on the start and end part is basically fixed in a limited range and will not be influenced by the number of steps (it may be influenced by the number of processors and I/O file size). But the time on the major loop will increase if the code runs more steps. It is important to be able to benchmark codes accurately without having to use enormous amounts of compute time. For example, PCHAN produces three time values: initialization time, loop time and wrap-up time. Although the total time is the sum of these three values, it is dominated by the loop time. And if the code runs many steps, the initialization time and wrap-up time can even be neglected. As an example, see table 3.1 which comes from results of PCHAN T1 benchmark on lomond using native MPI.

No. of steps

Loop Time Init Time + Wrap-up Time Total Time

1 97.98% 2.02% 100.00% 10 99.78% 0.22% 100.00%

Table 3.1 the proportion of PCHAN running time

18

It is preferable to measure performance of application codes using the major loop time rather than total time. First, scientific application codes may run thousands of steps in real scientific research, so the loop time is very close to the total time in this case. Second, if the major loop(s) do not contain I/O operation, loop time is better to be used to compare MPI performance. In addition, time per step can be easily calculated from loop time. And this value could be used to estimate the performance of scientific application codes in real use by multiplying the time per step by the number of steps. This is a quite economical and useful method as one code may run hundreds of thousands of loops for over one week in production and it is unwise to do benchmarks of the same scale in performance studies. In addition, time per step can be easily got from loop time. And this value could be used to estimate the performance of scientific application codes in real use by multiplying time per step by the number of steps. This is a quite economical and useful method as one code may run hundreds of thousands of time for over one week in real and it is unwise to do benchmark of same scale in research. There is still a question whether the time spent on each step is constant from the beginning to the end. Some codes have equal workload on each step (i.e. LAMMPS) but some other codes may spend more time on early steps than the later steps (i.e. NAMD). One method to make clear the assignment of workload is to plot time per step for every n steps

0

2

4

6

8

10

12

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

loop time(sec)

1cpu

4cpu

Figure 3.1 LAMMPS melt example Benchmark on e-Science cluster using MPICH2

step

19

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

20

80

140

200

260

320

380

440

500

560

620

680

740

800

860

920

980

time per step(sec)

16cpu

32cpu

64cpu

128cpu

256cpu

Figure 3.2 NAMD apoa1 Benchmark on HPCx using native MPI

The first figures shows that approximately equal time is spent on each step for LAMMPS and this time can be directly used to estimate the performance of code in real use. The second figure shows apparently that more time is spent on first 200 steps for NAMD. There are several humps around step200, step300 and step400. The time per step stays constant after step 500. A more complex method must be applied to estimate the performance of NAMD for more steps.

3.2 Plotting Benchmark Results

This section goes through several ways to plot the benchmark results and one method is selected to compare application performance using different MPI implementations. We assume the absolute time for running N steps on P processors is TP. Absolute Time (TP)

This is the simplest way to plot the performance. The original results are plotted directly without any change. The target graph will be several descending curves that clearly display with which MPI the code gets better performance. But this kind of graph is hard to show any scaling information and it is difficult to compare speed up of different MPI. Another drawback of absolute time graph is curves of different MPI may be too close to tell the difference. This is because the values cover a large scale so that the slight difference between MPI will not be seen on the graph. In another words, the obvious gap of performance with different parallel scaling hides the small gap between different MPI implementations. Figure3.3 is one example of graph of this kind.

step

20

0

100

200

300

400

500

600

700

1 2 4 8 16 24 32

time(sec)

Sun MPI

MPICH2

LAM/MPI

Figure 3.3 Example1-Absolute Time of NAMD apoa1 Benchmark on Lomond Speed Up (T1/TP)

This method is to show the benefits from parallelization, and will plot several ascending curves.

By comparing with linear speed up, it is an effective method to study the scaling of different MPI

implementations. However, speedup is only a relative value; it is not a good measure of code

performance. If one MPI has high speedup value it does not means it also has good performance.

If one MPI has high speedup value it does not means it also has good performance. If the

performance is affected by cache sizes, the derived speedup value would even exceed linear

speedup, which makes the speedup curve confusing and hard to interpret. In addition, In addition,

speedup curve of different MPI could also be too close to tell the difference between each other.

Figure3.4 is one example of a graph of this kind. The huge cache effect obviously makes

speedup curves exceed the linear target

0

10

20

30

40

50

60

70

80

0 10 20 30 40

speedup

OpenMPI

MPICH2

LAM/MPI

SunMPI

Linear

Figure 3.4 Example2-Speedup of Coursework Code 1000x1000 Benchmark on Lomond

CPU

CPU

21

Efficiency (T1/TP/P*100%)

Efficiency is an equivalent value of speedup which shows the effect of parallelization. One benefit of this method is that it provides all the results a same target value to compare with, 100%. However, it inherits almost all the disadvantage of speed up method. In this project, we are really interested in the speed of a code. Scaling is a useful property but all that really matters is absolute speed CPU Time per Step (TP*P/N)

This method aims to describe the sum of work time on each processor. The ideal graph is several horizontal lines, which means 100% efficiency or linear speedup. Compared with above methods, CPU Time per Step has several advantages. It directly shows the performance of code by time value and also contains a part of scaling info. In addition, this method avoids one problem the above methods own in common: the gap between curves of different MPI is able to be seen on the graph as the values of CPU Time per Step focus on a small range. Even if two curves are still close, we can zoom in the graph to acquire a satisfactory view. Figure3.5 is one example of graph of CPU time per step. This kind of graph is selected for MPI performance comparison in this project.

0.000

5.000

10.000

15.000

20.000

25.000

30.000

35.000

2 4 8 12

cpu

time

per

ste

p(cp

u*se

c)

OpenMPI

MPICH2

LAM/MPI

Figure 3.5 an example of CPU Time per Step --- PCHAN T1 benchmark on s-Science

3.3 Benchmark Strategy

Normally, when benchmarking an application code, the code is run with a fixed number of

CPU

T1 55.000

50.000

45.000

40.000

35.000

30.000

25.000

20.000

22

steps (N) on processors of increasing number (P) and is timed separately. The typical results are a group of decreasing values (TP). There is a problem of this strategy found in real benchmark work: for huge machines which has large number of processors like HPCx, too many steps will cause the time spent for code running on small number of processors too long, may be hours even days, which is quite inefficient; similarly, too few steps will make the time spent for code running on large number of processors too short, may be just one second or less, which is often not enough for good precision. Sometimes it is impossible to get a proper number of steps to satisfy the two ends. A more flexible strategy1 can effectively avoid this conflict. The idea is assigning fewer steps to small number of processors and giving more steps to big number of processors so that the whole absolute time results are limited in a proper range. One equation can be used as the basic model of this strategy: N / P = cont. If a code run n steps on one processor, it runs n*p steps on p processors. Finally, turn all results into CPU Time per Step by TNP*P/N to make the results comparable with each other. Table 3.2 show a real use of this strategy.

P Steps, N TP CPU Time Per Step

1 4000 55.33 0.0138 2 8000 65.31 0.0163 4 16000 84.05 0.0210 8 32000 174.55 0.0436

12 48000 242.67 0.0607 Table 3.2 Example-Coursework Benchmark on e-Science Cluster with OpenMPI

3.4 Correctness Check

It is important to check if an application code runs correctly in application benchmark as the performance results of an incorrect code is meaningless and useless. But most codes are quite different from each other, so there is no silver-bullet for all the codes. This section lists the way to check correctness for each code. The MPI Coursework Code is a 2D decomposition image processing code written by the 1 This strategy is only fit for application codes which spend the same time on each step. (i.e. Coursework Code

and LAMMPS)

23

author of this dissertation; PCHAN is a parallel flow fluids turbulence code; NAMD and LAMMPS are both parallel molecular dynamic codes. All these code would be introduced in detail in later chapters The MPI Coursework Code

The loop cessation criterion --- delta is also used to check the correctness of code. Since delta is calculated from the Allreduce function results, it could represent total results on all points in some extent. A simple but good method is to check the value of the last delta output. A more exact way is to check more previous delta output. (The MPI Coursework code was updated to print out delta value once delta was worked out.) PCHAN

PCHAN does not print out any statistic value within or after major loop, so the only way to ensure the code works correctly is to compare the output file. PCHAN will leave a static file stats.dat which contains 121x9 static values. diff tool may help to compare these files. NAMD and LAMMPS

NAMD and LAMMPS are both molecular dynamic codes, so methods to check their correctness are similar. Both of these two codes would print out static values of energy every fix number of steps. Comparing these values or just comparing the last output could check if the code runs without problem.

3.5 Summary

This chapter concludes a series of methods used in application benchmark. The time results of benchmark were targeted to time per step of the major loop. These results were transferred to CPU time per step by multiplying the number of processors used and these CPU time per step were plotted in performance figure. To control the time of benchmarking in a limited range, the code which put fix work on each step could be benchmarked with various running step according to the number of processors involved. It is necessary to verify the correctness of a code before using its performance results, and there is no silver bullet to check all the codes.

24

Chapter 4

Low Level MPI Benchmark2

An MPI implementation’s performance is always studied from commonly used basic MPI operations, as they are the elements of a complicated application code. A low level MPI benchmark, Intel MPI Benchmarks (IMB) was used in this project to measure the elementary MPI operation performance of OpenMPI, MPICH2 and LAM/MPI. Parts of the results will be employed to analyze and understand the application performance of different MPI implementations in later chapters. These results are given in this chapter in advance. In addition, since the IMB benchmark work was re-done by the author, , some porting information is contained in this chapter as well.

4.1 Introduction of Intel MPI Benchmarks

This project use Intel ® MPI Benchmarks (IMB) suite (version 2.3) to measure the commonly used MPI functions and plot low-level performance images of a range of public domain MPI implementations. Intel ® MPI Benchmarks is the successor of the well known Pallas MPI Benchmarks (PMB) [24]. It was written totally in C plus standard MPI and was designed to provide a concise set of benchmarks targeted at measuring the most important MPI functions [24]. The IMB 2.3 package contains three parts IMB-MPI1 (MPI-1 standard functions benchmark), IMB-EXT (MPI-2 One-sided Communications benchmarks) and IMB-IO (MPI-2 I/O benchmarks) [24]. A separation executable file can be built for each part. This chapter includes the results of the Pingpong benchmark and the Allreduce benchmark which all belong to MPI1.

2 For completeness, the IMB benchmarks were re-run by the author. For further information on low-level benchmarks see [29]

25

The pingpong benchmark was written to test simple MPI point-to-point message passing operations, both message passing latency and bandwidth of different message sizes are give from this benchmark. Allreduce aims to benchmark the performance of the collective communication function MPI_Allreduce used between different numbers of processors.

4.2 Benchmark results on Lomond

The Pingpong benchmark is comprised of a group of point-to-point message passing operations (MPI_Send and MPI_Recv) on a range of message size. Figure 4.1 records the Pingpong results of middle size messages and figure 4.2 plots the Pingpong results on large size messages.

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000

band

widt

h (M

byte

/sec

)

OpenMPI

MPICH2

LAM/MPI

Sun MPI

Figure 4.1 Pingpong bandwidth of middle messages on Lomond

It is clear in figure 4.1 that in mid-size section, the bandwidth of all MPI implementations keep increasing with the increment of message size in benchmark. But the bandwidth of OpenMPI was not as good as the other three MPIs and it only achieved around 50% bandwidth of the other three MPI in Pingpong benchmark. This obvious gap will have big effects on its performance of applications which rely on middle size messages. Compared with OpenMPI, the bandwidth curves of the other three MPI are quite close. Both MPICH2 and LAM/MPI performs as good as or better than native Sun MPI. Their high bandwidth value will increase competitive power to native MPI on small on middle scale applications, especially for LAM/MPI who had the best mid-size message bandwidth on Lomond.

Size (byte)

26

0

100

200

300

400

500

600

700

0 200000 400000 600000 800000 1000000 1200000

bandw

idth

(Mb

yte/s

ec)

OpenMPI

MPICH2

LAM/MPI

Sun MPI

Figure 4.2 Pingpong bandwidth of large messages on Lomond

When it comes to big size message section, the bandwidth results do not keep the same as mid-size section. One obvious change is LAM/MPI loses its No.1 position and falls behind OpenMPI in the benchmark. Combining with the results data in mid-size section, it can be found that LAM/MPI only achieved very limited bandwidth increase from mid-size section to large-size section so that the original No.1 was catch up with OpenMPI which sustained strong increment of bandwidth. This trend which can also be seen from above figure indicates that LAM/MPI would perform badly with very large messages. (Figure 4.2 was plotted here for bid size message application in later chapters.) MPICH2 and Sun MPI still kept a certain advantage to over OpenMPI. And MPICH2 even

outperformed Sun MPI taking the first position. In fact, MPICH2 showed super native-level performance in all sections of IMB Pingpong benchmark. It can be foreseen that application codes which are dominated by point-to-point communication will get best performance with MPICH2.

Size (byte)

27

17.34

5.91

2.914.29

0

2

4

6

8

10

12

14

16

18

20

latency (usec)

OpenMPI

MPICH2

LAM/MPI

Sun MPI

Figure 4.3 IMB Pingpong Benchmark latency results on Lomond

Figure 4.3 lists the message passing latency of different MPI. The latency is the time spent in passing one zero-length message. The performance of a problem with small and middle message length are determined by this value. In figure 4.3, the latency bar of OpenMPI is apparently much higher then the other three MPI, the latency of OpenMPI is about 3 times of MPICH2, 6 times of LAM/MPI and 4 times of Sun MPI. This result says OpenMPI had to spend much longer time to start a communication than the other MPI in Pingpong benchmark. The other three MPI have similar latency and LAM/MPI outperform SunMPI to get the best results. Although MPICH2 took longer than Sun MPI and LAM/MPI to start a point-to-point communication, it was not left far behind. Based on the above bandwidth and latency results, an assumption is proposed to order the point-to-point communication performance of different MPI at different sections of message size on Lomond. See the following table 4.1

size 1st 2nd 3rd 4th

small LAM/MPI Sun MPI MPICH2 OpenMPI middle LAM/MPI MPICH2 Sun MPI OpenMPI large MPICH2 Sun MPI LAM/MPI OpenMPI

Table 4.1 P2P message passing performance order of different MPI on Lomond Figure 4.4 shows the results of another important MPI-1 benchmark, Allreduce

28

0

20

40

60

80

100

120

140

2 4 8 16

all

redu

ce time

(use

c)OpenMPI

MPICH2

LAM/MPI

Sun MPI

Figure 4.4 IMB Allreduce Benchmark performance of small size message on Lomond

Figure 4.4 plots the performance of MPI_Allreduce operation on an 8 byte message with different MPI implementations on different number of processors. Although time cost keeps increasing with the increment of processors, Lomond’s shared memory feature made the rise trend very slow for all MPI implementations except OpenMPI. The gaps between OpenMPI’s performance curve and that of other MPI became increasingly bigger with the increment of processors. When 16 processors were involved in, the time cost of OpenMPI was over three times of that of other MPI. The results indicate that OpenMPI will be in an even worse position when running an application with lots of MPI_Allreduce operations. Compared with OpenMPI, the other three MPI got relatively close performance in Allreduce benchmark. MPICH2 showed native level performance and LAM/MPI fell a bit behind. Here the competition between MPICH2 and LAM/MPI is quite interesting. Because, in short message section, LAM/MPI beats MPICH2 on point-to-point communication, but MPICH2 has better performance on collective communication. Their victory or defeat is determined by the proportion of these two kinds of operations in codes.

4.3 Benchmark results on e-Science

The same benchmark was done on e-Science cluster as well. However, e-Science cluster is a new cluster without any scheduler installed; users are only allowed to run their codes interactively on nodes of this machine. One problem of interactive running is that it can not guarantee the processors running your codes are not used by system or other users, which may cause some incorrect benchmarking results. To deal with this problem, both Pingpong and Allreduce benchmark with each MPI were repeated 10 times on this machine; then

Processors

8 bytes

29

average the results of the replicated benchmark except the two extreme values. One typical example could be the following group of latency results of the same MPI. (250.48, 63.46, 63.65, 99.71, 249.97, 94.84, 62.77, 250.23, 256.72, 250.98) Figure 4.5 records the Pingpong results of middle size messages and figure 4.6 plots the Pingpong results on large size messages.

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

0 5000 10000 15000 20000

bandwidth (Mbyte/sec)

OpenMPI

MPICH2

LAM/MPI

Figure 4.5 Pingpong bandwidth of middle messages on e-Science cluster

Note: In figure 4.5, the bandwidth curves of MPICH2 and LAM/MPI stick together like one curve.

Figure 4.5 shows that all MPI implementations got increasing bandwidth with the increment of message size, this is always true due to the latency. MPICH2 and LAM/MPI had almost the same bandwidth (MPICH2 is a bit better) in this section in Pingpong benchmark. OpenMPI fell behind them but not far away, and this gap was gradually reduced. At the end of these three curves, OpenMPI caught up with MPICH2 and LAM/MPI. Basically, a code dominated by point-to-point communication of middle size messages will acquire better performance with MPICH2 and LAM/MPI than with OpenMPI.

Size (byte)

30

0.00

10.0020.00

30.00

40.0050.00

60.00

70.00

80.0090.00

100.00

0 200000 400000 600000 800000 1000000 1200000

bandwidth(Mbyte/sec)

OpenMPI

MPICH2

LAM/MPI

Figure 4.6 Pingpong bandwidth of large messages on e-Science cluster


In large size section, MPICH2 and LAM/MPI still got similar bandwidth (MPICH2 is a bit better) in Pingpong benchmark. The growth of their bandwidth had almost stopped. Unlike in the middle size section, OpenMPI outperformed MPICH2 and LAM/MPI achieving higher bandwidth. This difference continued expanding with the increase of message size. The average bandwidth of OpenMPI in this section is twice as much as bandwidth of the other two MPI. It is different from the results on Lomond that OpenMPI performed better than MPICH2 and LAM on e-Science cluster. It indicated that this MPI is fit for the cluster machine more than shared memory machine.

165.42

119.56109.96

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

latency(usec)

OpenMPI

MPICH2

LAM/MPI

Figure 4.7 IMB Pingpong Benchmark latency results on e-Science cluster

Size (byte)

31

The above figure lists the latency of different MPI on e-Science cluster. Compared with the latency results on Lomond, all MPI got notable increase of Latency on e-Science cluster. This is because nodes are connected by Gigabyte Ethernet which can not pass messages as efficiently as shared memory. Nevertheless, the three MPI kept the same order of latency as on Lomond: LAM/MPI need the shortest time to start communication, MPICH2 had a bit longer latency than LAM and OpenMPI had to spend much more time. However, the difference between OpenMPI’s latency and that of the other two MPI has been reduced a great deal. The ratios of OpenMPI’s latency to latency of MPICH2 and to latency of LAM are 3 and 6; but the ratios of same rule on e-Science cluster are 1.38 and 1.65. The gap between MPICH2 and LAM are also reduced a lot. This change indicates that the performance difference of codes dominated by small or middle messages with different MPI may be smaller on e-Science cluster than the performance difference on Lomond. Based on the above bandwidth and latency results, an assumption is proposed to order the point-to-point communication performance of different MPI at different sections of message size on e-Science cluster. See the following table 4.2

size 1st 2nd 3rd

small LAM/MPI MPICH2 OpenMPI middle MPICH2 LAM/MPI OpenMPI large OpenMPI MPICH2 LAM/MPI

Table 4.2 P2P performance of on e-Science cluster

The following figure 4.8 and figure 4.9 shows the Allreduce benchmark performance of 8 bytes and 192 bytes messages separately3.

3 Extra 192 bytes Allreduce performance results on e-Science cluster was plotted because this plot is prepared

for performance analysis of LAMMPS which is not ported to Lomond, so this graph was not potted for Lomond

32

0.00

200.00

400.00600.00

800.00

1000.00

1200.00

1400.001600.00

1800.00

2000.00

2 4 8 12

allreduce time(usec)

OpenMPI

MPICH2

LAM/MPI

Figure 4.8 8 bytes message Allreduce Performance on e-Science cluster

0.00

200.00

400.00600.00

800.00

1000.00

1200.00

1400.001600.00

1800.00

2000.00

2 4 8 12


OpenMPI

MPICH2

LAM/MPI

Figure 4.9 192 bytes message Allreduce performance on e-Science cluster

Allreduce benchmark was done twice dealing with messages of 8 bytes and 192 bytes separately on e-Science cluster. In figure 6.8, all the performance curves rise up with the increase of processors. The more processors involved in collective communication, the longer time is requested. The high gradient indicates that the increment of processors loads a more serious burden to e-Science cluster than Lomond because of the difference of their interconnect features. In all three MPI implementations, MPICH2 owned best performance of collective communication in this benchmark; its curve is always at the lowest position. Although LAM/MPI also had similar performance on less than 4 processors, the time it spent on Allreduce operation increased quickly when more than 4 processors were used so that LAM/MPI used most time at last. OpenMPI had bad performance on small number of processors, but slowly increment of the time it cost put OpenMPI in the middle position

8 bytes

Processors

Processors

192 bytes

33

between MPICH2 and LAM/MPI finally. Figure 4.9 shows curves of very similar shape to curves in figure 4.8. The results order kept without change when message size expanded from8 bytes to 192 bytes.

4.4 Porting IMB

After unpacking, the directory contains 4 sub directory: ./doc, ./src, ./license and ./version_news. There one main Makefile and several templates Makefile named like make_xxx in ./src. 1. Provide one template of each MPI on Lomond and e-Science cluster. i.e. figure 6.10

gives the Makefile template for OpenMPI on e-Science cluster.

MPI_HOME = /home/s0566708/ompi

MPI_INCLUDE = $(MPI_HOME)/include

LIB_PATH =

LIBS =

CC = ${MPI_HOME}/bin/mpicc

OPTFLAGS = -O

CLINKER = ${CC}

LDFLAGS =

PPFLAGS =

Figure 4.10 the Makefile template for OpenMPI on e-Science cluster

Replacing $MPI_HOME by the install location of other MPI, a series of templates for different MPI can be made based on above template: make_ompi, make_mpich2, make_lam etc.

2. Include template into main Makefile, add include make_ompi in Makefile to add the template for OpenMPI in figure 4.10

3. Since each part in IMB is allowed to be compiled separately, type

make IMB-MPI1 to compile IMB-MPI1 part which contains Pingpong and Allreduce benchmarks.

34

4. Running OpenMPI mpirun –hostfile ~/mytest/ompi/hostfile_2 –np 2 ./IMB-MPI1 Pingpong mpirun –hostfile ~/mytest/ompi/hostfile_8 –np 8 ./IMB-MPI1 Allreduce MPICH2 mpiexec –n 2 ./IMB-MPI1 Pingpong Mpiexec –n 8 ./IMB-MPI1 Allreduce LAM/MPI mpirun –np 2 ./IMB-MPI1 Pingpong mpirun –np 8 ./IMB-MPI1 Allreduce

4.5 Summary

Intel MPI Benchmarks provide a concise set of benchmarks targeted at measuring the most important MPI functions [24]. Two benchmarks of MPI1 part was applied in this chapter: Pingpong benchmark tested the bandwidth and latency of point-to-point communication and Allreduce benchmark measured the performance of collective communication. LAM/MPI could start up a point-to-point message passing within shortest time, which made LAM/MPI most suitable to applications applying small or middle size messages. But the limited bandwidth affected its performance on big message codes. MPICH2 achieved native level bandwidth and latency, it performed well in sections any message size. OpenMPI has

the best bandwidth for large messages and the best Allreduce for 8-byte messages, but it took much longer time to start a communication. However, real application codes are often dominated by more than one kind of MPI functions. It is hard to judge which one will give best performance just on benchmarks of a single MPI function. For instance, if a middle message code contains both many P2P communications and collective communications, which MPI can provide better performance, MPICH2 or LAM/MPI? It is hard to tell. The only way to the result is to benchmark this code with different MPI implementations. The next chapter provides a MPI Coursework code which is similar to this case. The above question will be answered by the end of next chapter.

35

Chapter 5

Coursework Code Benchmark

This chapter introduced one MPI 2D decomposition image processing code originally written in MPI Coursework. This code was made some changes and ported as a MPI application benchmark in this project. The results of this benchmark on both Lomond and e-Science cluster are contained in this chapter, followed by analysis on these results. 5.1 Introduction of the code

The MPI coursework code is a parallel image processing code completely written in C by the author of this dissertation. The image data is decomposed with a regular 2D decomposition into each processor and is updated in parallel through a number of iterations. Between each round of update, adjacent processors must swap halos to each other by non-blocking communication to keep synchronization. After each update, the stopping criterion delta is calculated based on a global operation All_reduce to check whether the loop should cease [28]. The original code calculate delta every iteration. It is not wise to compute the criterion each round at the early stage, as delta calculation involves expensive collective communication. The function dynamic_delta is designed to reduce unnecessary global operation. It figures out the gap between iterations to calculate the criterion based on the current value of delta and the final target value. If this function is switched on, the procedure will involve much fewer global operations, and the amount of computation is reduced as well as the amount of communication To benchmark the problem of any size, the code was modified. The I/O reading operation at

36

the beginning was replaced by a routine which randomly generated float numbers between -1.0 and 1.0 and assigned to the master buffer as image data. Thus, the code is able to test problem of various sizes now. This is very useful as this code could produce various benchmark of arbitrary image size. The coursework code is timed both on the loop body and the whole procedure including I/O. It is easy to work out the time per step by dividing the loop time by the number of iterations. Figure 5.1 shows an example of MPI Coursework Code – updating an image of 192x360 pixels on four processors. The middle image shows the 2D decomposition in a computing grid of four processors. The blue lines represent the boundaries of sub-images.

Figure 5.1 MPI Coursework Code updating image

5.2 Porting code

The MPI Coursework Code was ported to Lomond, e-Science Cluster, Scaliwag4 and HPCx. Before building a code on one machine with a public domain MPI, it is necessary to set the environment variable(s) to specify which MPI is to use and where this MPI’s executables are located. Then a Makefile can be developed by a configure script or be written by hand. Figure 5.2 in Appendix A lists the Makefile5 to compile Coursework Code on Lomond using OpenMPI This Makefile for OpenMPI can be reused to compile the MPI Coursework Code on both 4 Porting Coursework Code on Scaliwag is talked about in Chapter11 5 There is no chance to benchmark coursework code with opt flags on Scaliwag. To get comparable results, no

opt flags is applied in compilation of coursework code on all machines

After 100 steps After 5355 steps Original

37

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

1 2 4 8 16

cpu time per st

ep(cpu*sec)

old

new

0.000

0.010

0.020

0.030

0.040

0.050

0.060

0.070

0.080

0.090

0.100

1 2 4 8 12

cpu time per step(cpu*sec)

old

new

Lomond and e-Science cluster using OpenMPI, MPICH2 and LAM/MPI, because all public domain MPI implementations’ MPI C compilation executable is named mpicc. Figure 5.3 in appendix A shows a script to porting Coursework Code and run it on Lomond using OpenMPI. To build Coursework Code on e-Science Cluster or build it on Lomond with other MPI, just replace the environment variable setting.

5.3 Optimization of Coursework Code

Before benchmarking Coursework Code, one important optimization was done. In the original code, each processor swapped halos with all neighboring processors by four groups of non-blocking communication (MPI_Issend + MPI_Recv + MPI_Wait). These operations are replaced by one group of non-blocking communications which contains all original operations (four MPI_Isend + four MPI_Irecv + one MPI_Waitall). When one processor swapped halo with neighbors in order in original code (send to upper neighbor, receive from up neighbor, wait; then do the same group of operations with lower , left and right neighbor in order); there were four synchronization points. The new code designed halo swapping without order; it sets only once synchronization to reduce the communication cost. The old and new codes were run6 on both Lomond and e-Science Cluster to test the effect of halo swapping optimization. Figure 5.4 gives the results of this test:

Figure 5.4 halo-swap opt of Coursework code

6 Using small (500x500) benchmark and LAM/MPI

e-Science

cluster

CPUCPU

Lomond

38

Note: the two curves in left figure in figure 5.4 have been zoomed in, but they are still too close to tell the difference

In the left graph of figure 5.4, the performance curves of old code and new code almost stick together. That means the halo swap optimization act made no effect on Lomond. Because shared memory architecture on lomond enabled massage passing with very low latency and the SGE scheduler ensured all CPU cycles were only used for this task. There is little chance for any processor to be in waiting status during halo swapping. Even applying new method to reduce the possibility of waiting states of processors, there will not be big improvement on performance on Lomond. However, the same optimization effectively improved the performance and scaling of the Coursework Code on e-Science Cluster. Compared with Lomond, e-Science cluster is a distributed memory cluster machine connected by Gigabyte Ethernet. This kind of system is unable to such a short latency as shared memory machine when passing messages. Besides, application codes on this machine are run interactively, which means CPU may be interrupted by system or other users in the middle of running. At each synchronization point, fast processes may wait for late ones for a long time. So reducing synchronization operation avoided much unnecessary waiting time. These two different results indicate that the low-latency feature of a shared memory machine is able to provide high performance for a MPI code even if this code itself is not very efficient. But a Linux cluster is highly sensitive to the efficiency of the code. Any optimization on message passing is a key element to acquire a satisfactory performance whatever MPI is used.

5.4 Benchmark on Lomond

MPI Coursework Code has three benchmarks that process images of 500x500 pixels, 1000x1000 pixels and 4000x4000 pixels. They represent small, middle and big problems. The increasing size of image requires more memory available in runtime, and the messages passed between processors will also increase. In addition, there are two running mode for each benchmark: dynamic off and on. The program calculate loop stop criteria (delta) in each iteration using the former mode; if applying the latter mode, the program will prefer a more efficient way to calculate delta, about one calculation in every twenty-eight iterations

39

in average, which effectively decrease the global operation Allreduce. All these three benchmarks and two modes were tested in this project.

0.000

0.0500.100

0.1500.200

0.250

0.3000.350

0.4000.450

0.500

1 2 4 8 16 32


OpenMPI,off

MPICH2,off

LAM/MPI,off

Sun MPI,off

OpenMPI,on

MPICH2,on

LAM/MPI,on

Sun MPI,on

Figure 5.5 the MPI Coursework Code 1000x1000 Benchmark on Lomond

Figure 5.5 plots the results of benchmark 1000x1000 in both dynamic-off and dynamic-on mode. With increase of processors, more and more data can reside in L2 cache instead of memory, which made all performance curves keep descending at the beginning7. When using more than four processors, all the data has been kept in cache, cache effect stopped. The curve of CPU Time per Step did not fall down any more but stayed horizontal with slight tail off at the end. The upper group of curves in figure 5.5 represents the running time results of dynamic off mode. In this group, MPICH2 and LAM/MPI showed very close performance and scaling to native Sun MPI. Sometimes the performance of these two MPI even exceeded that of native MPI. Compared with MPICH2, curve of LAM/MPI was more stable at latter section and when using small number of processors, LAM/MPI had a bit better performance than MPICH2. Obviously, the performance of OpenMPI was still not good as the other MPI on Lomond. The earlier tail-off showed the scaling of OpenMPI was worse than other MPI either. The lower group of curves in figure 5.5 represents the running time results of dynamic on mode. In this mode, collective communication was effectively reduced. In this group, the 7 The size of L2 cache on Lomond is 8Mbyte per processor. And the memory required for a 1000x1000

benchmark is 32Mbyte. The increment of processors increased total cache size. More data can be stored in cache but in memory, which efficiently improved the parallel performance. But this effect is one benefit from architecture but MPI implementations.

CPU

40

gap between curve of OpenMPI and curves of other MPI became much smaller than in upper group, which meant OpenMPI profited more from the reduction of collective communication so that its performance was able to be close to that of other MPI The application performance can be understood by the low-level performance. Since this Coursework Code is totally written by author or this article. It is easy to know what kind of MPI communication this code performed within major loop. There are four groups of point-to-point communications plus one Allreduce collective communication in each step. The message size of P2P communication is about 8000 bytes and the message size in collective communication is 8 bytes. If dynamic on strategy is applied, the collective communication is reduced to once every 28 steps in average. The Low-level benchmark shows both MPICH2 and LAM/MPI owned similar small message bandwidth and latency to native MPI in Pingpong benchmark, these three MPI got similar Allreduce performance as well. These explains why MPICH2, LAM/MPI and Sun MPI have so close application performance in both dynamic off and dynamic on mode. Compared with MPICH2, LAM/MPI and Sun MPI, OpenMPI’s bandwidth is lower and it takes longer time to start a communication. Besides, the Allreduce benchmark shows OpenMPI is more expensive in collective communication as well. It is not hard to understand why OpenMPI left far behind on application performance. OpenMPI getting closer to the other MPI in dynamic on mode indicates that the expensive collective communication is vital to OpenMPI. A thin collective communication code environment is necessary for this MPI to acquire satisfied performance.

5.5 Benchmark on e-Science Cluster

500x500 and 1000x1000 benchmark were done with both two delta calculation modes on e-Science cluster. Figure 5.6 shows the results of a 1000x1000 benchmark

41

0

0.02

0.04

0.06

0.08

0.1

0.12

1 2 4 8 12


OpenMPI,off

MPICH2,off

LAM/MPI,off

OpenMPI,on

MPICH2,on

LAM/MPI,on

Figure 5.6 MPI Coursework Code 1000x1000 Benchmark results on e-Science Cluster

Note: in the lower group, three curves of running time result in dynamic on mode are so close that they seem stick together.

The result plots are a series of ascending curves. With increasing number of processors involved, more and more time have to be paid on communication and synchronization between processors. This extra time costs are increasing whole running time and pushing up performance curves. Comparing with performance curves on Lomond, the curves on e-Science cluster have earlier and more intense tail off phenomenon. Because e-Science cluster is a distributed memory machine and all processors are connected by Gigabyte Ethernet. Although this is a right architecture to apply MPI for a parallel job, the act of message passing on this machine can not be efficient as on a shared memory machine. Figure 5.6 also shows that when applying dynamic-on mode, the performance of code acquires huge improvement whatever MPI it used. Furthermore, dynamic-on mode effectively suppresses tail off on curves when many processors are involved. Obviously, Allreduce is such a costly global operation that half or more than half of whole running time is spent on it. Over-use of this operation may badly affect the scaling of code so that big amount of performance will be lost by tremendous communication costs between processors. When benchmarking Coursework code in dynamic-off mode, MPICH2 has some advantages on performance. LAM/MPI is ranked in the middle and OpenMPI is ranked at last. In figure 5.6, the performance difference of these MPI is not clear until more than four processors are used, which shows main performance gap is caused by scaling. Figure 5.8 gives the speedup plot of the same benchmark results. The speedup curve of MPICH2 is apparently closer to

CPU

42

linear line. This scaling advantage provides MPICH2 higher performance than the other two MPI on e-Science cluster.

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14

speed up OpenMPI

MPICH2

LAM/MPI

Linear

Figure 5.7 Speedup of Coursework Code 1000x1000 Benchmark on e-Science Cluster

Although MPICH2 owns a certain extent advantage in dynamic-off mode, this superiority disappears immediately when switching to dynamic-on mode. After reducing large numbers of global operation, all three MPI shows very similar performance so that their performance curve almost sticks together in figure 5.6

Figure 5.8 Zoom in - Coursework Code 1000x1000 Benchmark on s-Science Cluster

If we zoom in these curves to make them depart in figure 5.8, we can find that OpenMPI is still a bit slower, and LAM/MPI becomes the fastest. This result shows LAM/MPI and OpenMPI better profit from the change of mode. But in the other side, it also indicates that

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

1 2 4 8 12


OpenMPI MPICH2 LAM/MPI

0.044

0.042

0.040

0.038

0.036

0.034

0.032

0.030

CPU

CPU

43

these two MPI suffer from global operation, and MPICH2 is able to acquire good performance and scaling in a task with busy global communication. The conclusions made in last paragraph could be proved by results of low-level benchmark on e-Science cluster. LAM/MPI got highest bandwidth at small and middle messages and it owned the shortest latency. This made LAM/MPI the best MPI for a code that dominated by middle size message P2P communication. But LAM/MPI also has the worst Allreduce performance, this drawback made MPICH2 outperform LAM/MPI in dynamic off mode which provided a rich collective communication environment.

5.5 Summary

The MPI Coursework code was updated in this project to provide multiple benchmarks of different problem size. The two dynamic strategies enable this code to change code environment with dominated by different MPI operations. MPICH2 and LAM/MPI both provide native level application performance on Lomond, no matter which dynamic strategy was applied. OpenMPI’s performance is far behind the other three MPI in dynamic off mode, which was determined by its poor low-level benchmark results. However, since collective communication is vital to OpenMPI, when turning to dynamic on strategy, the performance difference between OpenMPI and other MPI decreased a lot. Supported by its best collective communication performance, MPICH2 beat LAM/MPI in dynamic off mode on e-Science cluster. But it was outperformed by LAM/MPI in dynamic on mode. Because MPICH2 lost its advantage when collective communication was reduced, but LAM/MPI still kept its advantage on bandwidth and latency of middle size P2P communication. OpenMPI’s performance ranked at last in both modes.

44

Chapter 6

PCHAN Benchmark

This chapter demonstrates the procedure of porting and benchmarking PCHAN, one scientific application code used in recent research on HPCx. Then the performance of different MPI implementations is compared upon the PCHAN T1 benchmark results both on Lomond and e-Science cluster.

6.1 Introduction of the code

PCHAN was developed by the UK Turbulence Consortium (UKTC) and is designed to simulate the flow of fluids to study turbulence by using Direct Numerical Simulation (DNS) techniques [14]. It employs many advanced features including high-order central differencing, a shock-preserving advection scheme, entropy splitting of the Euler terms and the stable boundary scheme [14]. PCHAN is a Fortran 90 code and is able to run in parallel using MPI on a range of high performance platforms. It has been ported both on Lomond and e-Science cluster in this project and benchmarked with OpenMPI, MPICH2, LAM/MPI and native Sun MPI (only on Lomond). The benchmark used is a simple turbulent channel flow on a 120x120x120 grid8. PCHAN was benchmarked on HPCx in the early research to compare performance of HPCx phase2 to phase2a [15]. It has been proved that PCHAN T3 benchmark can acquire high performance and ideal scaling on HPCx, and this turned to be even better on phase2a because of the improvement on cache and memory. Figure 6.1 shows the results of T1 benchmark used in this project on HPCx phase2a. The positive gradient of the curve demonstrates that the scaling of T1 is not as ideal as T3 in 8 T1 (120x120x120) benchmark was used in this project, as T2 (240x240x240) and T3 (360x360x360) requires

too much memory on each processor, which cannot be satisfied by e-Science cluster and Lomond

45

early research. The main communication within PCHAN is halo-swapping between adjacent computational sub-domains. The large problem size of T3 benchmark gives a small surface area to volume ratio for each sub-domain [14]. In another word, the communication costs constitute a bottleneck to T1 benchmark in certain extent. It can be anticipated that the benchmark effect of communication cost will be even worse on a system without very efficient interconnection (i.e. e-Science Cluster). The performance of MPI is critical to whole performance of PCHAN on that kind of machine.

0.000

20.000

40.000

60.000

80.000

100.000120.000

140.000

160.000

180.000

200.000

1 2 4 8 16 32 64 96 128 192 256

cpu

time

per

step

IBM MPI

Figure 6.1 PCHAN T1 Benchmark on HPCx Phase2a using native MPI

6.2 Porting code

When porting code, the scripts in PCHAN’s pack will produce one separate executable for running on each number of processors of a specific target machine. The executables of T1 benchmark on Lomond and e-Science cluster could be T1-1.x, T1-2.x … T1-48.x. They were produced by the following steps: 1. Edit Makefile to add parameters for specific targets: Lomond and e-Science cluster with

different MPI as the following instructs. Figure 6.2 and 6.3 in appendix A lists the parameters for Lomond and e-Science cluster with MPICH2.

CPP=cpp specified default cpp as C preprocessor the define macro in .F f90 code. The .F file was transferred to .f with the same main name.

OPTIONS2=-DMPI defined MPI macro to active the code of MPI processing

CPU

T1

46

which was enclosed in conditional compilation.

CPPFLAGS specified options for transferring original code.

FC specified the parallel fortran90 compiler executable used to compile .f files to .o files

FFLAGS gave the f90 compiler option of optimisation.

LIBS specified the necessary libraries in compilation and link phase

Table 6.1 in appendix A contains values of the four key parameters for all MPI in this

project

Note that LAM/MPI doesn’t provide mpif90 wrapper, but its mpif77 does support Fortran 90 codes. To ensure an f90 compiler is used, we gave –showme option to see the name of the back-end compiler that will be invoked, and also added –f77=%none to shut down f77 support of the f90 compiler. The OpenMPI implementation on Lomond had the same issue, as mpif90 wrapper was not built9. mpif77 –f77=%none is used to replace mpif90.

2. Edit compile script to specify the right target. 3. Edit compT1 script to uncomment proper lines of compile call to make T1 executables

running on proper number of processors. 4. Build a set of executables for the T1 benchmark by execute ./compT1 5. finally a set of executables T1-1.x, T1-2.x … T1-48.x were produced. To run PCHAN

T1 benchmark, just applying the following format. mpirun –np n ./T1-n.x < T1.in T1.in is the input values and parameters of T1 benchmark. The code produced an ASCII static file after running.

9 All MPI was installed on Lomond and Scaliwag by Erik McClements and MPI installation work on e-Science

was done by Yuan WAN, the author of this dissertation.

47

Here, the porting procedure is complete

6.3 Problems and solutions in PCHAN porting work

It is known that porting scientific applications is not a trivial job. There will be unavoidably some problems which are out of expected during this procedure. In this section, the problems and difficulties are summarized and some conclusions are given to solve them. Problems on Lomond The porting problem on Lomond was mainly caused by OpenMPI install. As OpenMPI installation failed on default Sun Forte Developer 7 Fortran95 compiler10 (version 7.0), it is alternatively installed on Sun Studio 10 Fortran95 compiler11 (version 8.1). However, most applications improved runtime performance significantly with FORTRAN compiler later than version 8.0. The new compiler may inline contained procedures, and those with assumed-shape, allocatable, or pointer arguments with high optimization levels -xO4 or -xO5 to get higher performance than old versions [16]. Since the aim of benchmark was to compare performance of different MPI, using a compiler of different performance was unfair and would produce incorrect conclusions. One possible solution is to estimate the results of T1 benchmark using OpenMPI installed with default f90 (version7.0). Since the new compiler acquired more performance by deeper inlining, it can be treated as a higher optimisation level of default FORTRAN compiler. We generally named ‘f90 (v7.0) –xO512 ’level I, named ‘f90 (v8.1) –xO5’ level II and ‘f90 (v8.1) –xO3’ level III; the sequential runtime results of T1 benchmark with level I, II, III are 265.591 sec, 115.673 sec and 157.322 sec for one step. Parallel running time result is composed by two time values: communication time Tcomm

10 Located at /opt/SUNWspro/bin on Lomond 11 Located at /opt/SUNWspro10/SUNWspro/bin on Lomond 12 The full optimisation option of level I is

‘–xO5 –r8const –xtypemap=real:64 –xchip=ultra3 –xarch=v8plusb –xcache=64/32/4:8192/64/1’ level II and level III are: ‘–xO5 –r8const –xtypemap=real:64 –xchip=ultra3 –xarch=v8plusb –xcache=64/32/4:8192/64/1’ and ‘–xO3 –r8const –xtypemap=real:64 –xchip=ultra3 –xarch=v8plusb –xcache=64/32/4:8192/64/1’

48

1 2 4 8 16

cpu time per step (cpu*sec)

1 2 4 8 16


TI

TII

TIII

and calculation time Tcalc. Tcomm responses the performance of MPI library and Tcalc is determined by compiler performance and the optimization level. (See figure 6.4) Figure 6.4 (L) the composition of parallel running time Figure 6.5 (R) extrapolate target parallel running time Note: figure 6.4 and 6.5 use assumed curve value to help instruct the evaluation method. They were not plotted from real running time results in this project.

The following calculation aims to compute TI 13, the target parallel running time results of

level I from TII and TIII, the known parallel running time results of level II and level III. (See figure 6.5) Given a fixed value of CPU coordinate, there are three values of parallel running time t1, t2 and t3 on curves of TI, TII, and TIII. As discussed above, t1, t2 and t3 are all comprised of two parts:

111 calccomm TTt +=

222 calccomm TTt +=

333 calccomm TTt +=

Since all these time value were tested based on OpenMPI, they have the same cost on communication 321 commcommcomm TTT ==

We roughly measure the performance of different optimisation level by their sequential running time. (level I: 265.591 sec, level II: 115.673 sec, level III: 157.322 sec) and work out a proportional relationship. Tcalc1 : Tcalc2 : Tcalc3 = 265.591 : 115.673 : 157.322

13 TI represents a group of running time results on the performance curve, so does TII and TIII

Tcalc

Tcomm

CPU CPU

49

11

232323 calc

calc

calccalccalccalc T

TTT

TTtt ×−

=−=−

( )2323

11 tt

TTT

Tcalccalc

calccalc −×

−=

( ) ( )2323

2121

1

2122121 tt

TTTT

tTT

TTtTTtt

calccalc

calccalccalc

calc

calccalccalccalc −×

−−

+=×−

+=−+=

223

133

23

211 t

TTTT

tTTTT

tcalccalc

calccalc

calccalc

calccalc ×−−

+×−−

=

Replace two fractions in above formula; the target parallel running time can be worked out:

231 602.2602.3 ttt ×−×=

Apply above formula on each point of TII and TIII curves to calculate curve of TI. Table 6.2 contains the running time value on curves of TII, TIII and extrapolated values on curves of target TI (cpu time per step)

P TII TIII TI

1 106.899 146.611 249.942 2 104.230 149.536 267.422 4 109.840 155.172 273.126 8 119.280 165.408 285.433

16 119.360 177.600 329.140 24 119.496 174.552 317.808 32 122.368 180.512 331.803

Table 6.2 parallel running time values of different optimisation level

Now we can estimate T1 benchmark results that would have been measured using OpenMPI with default f90. Problems on e-Science cluster One problem on e-Science cluster is that there is no Fortran 90 compiler in GNU tools (gcc, g++, g77) on which public domain MPI is built. It is necessary to find a f90 compiler as g77 does not support f90 format. Fortunately, an Intel Fortran Compiler (version 8.1.025) has been installed on e-Science cluster. To invoke this compiler, we need apply a new license and add the location of new

50

license into environment variable INTEL_LICENSE_FILE as following: export INTEL_LICENSE_FILE = /home/s0566708/intel/licenses:$INTEL_LICENSE_FILE Then we can call ifort to use this compile fortran codes OpenMPI, MPICH2 and LAM/MPI were re-installed on Intel Fortran Compiler by specifying ifort as the value of relevant variable (FC or F90). Figure 6.6-6.8 in Appendix A lists the procedure to install OpenMPI, MPICH2 and LAM/MPI with Intel Fortran Compiler. Another problem happened when running PCHAN on e-Science cluster: Reading input file fchan_120cube.bin caused an error. The reason is that fchan_120cube.bin is a binary file and its format is not supported by Intel P4 processor. To solve this problem, the original code was changed to read an ASCII formatted input file. Then the ASCII input file can be created on Lomond as UltraSPARC processors are able to accept th3 binary input file. Add a few lines in original code on Lomond to write an ASCII file with the data read in from binary input. Then PCHAN on e-Science cluster could make use of this ASCII file as replaced input file. To use the new ASCII input file, the input file name and binary flag in configuration file T1.in were changed. There’s no need to change PCHAN code, as the could judge the type of input file itself by binary flag value in T1.in and different operations were applied to read input file of different type. Now problems on e-Science cluster have been solved. We can turn to the benchmark work.


PCHAN T1 benchmark was done on Lomond using OpenMPI and MPICH2 compared with native Sun MPI14. T1 benchmark is a simple turbulent channel flow on a 120x120x120 grid and the number of steps was specified as 10.

14 It is a pity that LAM/MPI was not used on Lomond to benchmark PCHAN, although the porting work was

successful with LAM/MPI. The possible reason is there are lam daemons remaining at the backend of Lomond. The problem was found after one unexpected crash using LAM at the backend. The code quitted in the half way and had no chance to call lamhalt to clear the daemon. Since the daemon remaind on parts of all 48 processors at backend, it is hard to located all the daemons and clean them.

51

The results of each running contained four time values: initial time, time for 1 step, wrap-up time and total time. As mentioned in Chapter 5, the one we cared about is time for 1 step. This value was calculated from the time cost on major loop which dominated over 99% of the total time. Figure 6.9 shows the cpu * time per step of the T1 benchmark results on Lomond.

0

50

100

150

200

250

300

350

1 2 4 8 16 24 32


Sun MPI

MPICH2

OpenMPI

Figure 6.9 CPU Time per Step of PCHAN T1 benchmark on Lomond

Note: the performance curve of OpenMPI was calculated from the two other groups of parallel running time results of OpenMPI built on f90 version 8.1. This percolated curve seems reasonable in figure 6.9

As standard, Sun MPI showed good scaling. The curve of Sun MPI is better than ideal horizontal line. The total time cost decreased when running on over 24 processors. This could be explained by cache effect as the total L2 cache is up to 256 Mbyte when 32 CPUs are involved. That means about 1/5 data could reside in much faster L2 cache instead of memory (the size of total data when running PCHAN T1 benchmark is about 1.4Gbyte), which may contribute the performance improvement. MPICH2 still showed excellent performance as it did in MPI Coursework Code benchmark. Its performance curve is very close to that of native Sun MPI, which says that MPICH2 owns native level performance when running PCHAN on Lomond. OpenMPI cannot come up to Sun MPI and MPICH2 on performance. This difference became even bigger on high number of processors, which indicates that heavy cost of communication of OpenMPI seriously affects it scaling.

CPU

T1

52

To understand the application results, we must make clear what kind of code PCHAN is. Vampir provides powerful graphic MPI trace tools to help analyze the actions of running PCHAN.

Figure 6.10 Vampir MPI trace views of PCHAN T1 benchmark Note: Figure 6.10 gives MPI trace views using vampir. Figure (A) records all communication actions within one step. Figure (B) records point-to-point communication functions The last figure (C) shows call tree which lists the statistic information of used MPI function. The most frequent functions have been circled

Vampir MPI trace analysis shows that PCHAN is not a heavy communication code. Figure 6.10 (A) and figure 6.10 (B) are vampir time lines of PCHAN T1 benchmark. There are only four small groups of point-to-point message passing within each step. The call tree list also reveals point-to-point communication function MPI_Isend and MPI_recv dominates message passing calls with in whole program. In addition, the identified message says the messages in point-to-point communication are in size of around 768,000 bytes. The

(A) Four groups of communication within each step (B) Point-to-point communication in timeline

(C) Call tree

53

communication character of PCHAN can be concluded as infrequent point-to-point communication of large size messages. The performance of an application code with PCHAN’s characters tends to be affected by point-to-point communication bandwidth in large size section. Figure 6.11 quotes the results of Pingpong low-level benchmark on Lomond.

0

100

200

300

400

500

600

700

0 200000 400000 600000 800000 1000000 1200000

bandw

idth

(Mb

yte/s

ec)

OpenMPI

MPICH2

LAM/MPI

Sun MPI

Figure 6.11 low-level Pingpong benchmark result on Lomond

In low-level benchmark, OpenMPI got only about 80% of MPICH2 and Sun MPI’s bandwidth at large size section but the latency is up to 4 times. The poor bandwidth of P2P communication made the application performance of OpenMPI leave far behind. MPICH2 own a bit better bandwidth than Sun MPI but longer latency, which finally gave MPICH2 a very similar application performance to native Sun MPI.

6.5 Benchmark on s-Science Cluster

PCHAN T1 benchmark was done on e-Science cluster with the whole suite of public domain MPI implementations. The number of steps is still set as 10. Again, what we cared about is the value of time per step. Figure 6.12 showed the zoomed performance curves of T1 benchmark results on e-Science cluster.

Size (byte)

768,000

54

0.000

5.000

10.000

15.000

20.000

25.000

30.000

35.000

2 4 8 12

cpu

time

per

ste

p(cp

u*se

c)

OpenMPI

MPICH2

LAM/MPI

Figure 6.12 CPU Time per Step of PCHAN T1 benchmark on e-Science Cluster

There are three curves in figure 6.12 representing PCHAN application performance of three public domain MPI implementations. All the curves fell down at the beginning, then followed a flat section, and rose up in the end. The negative gradient at the beginning was not affected by increasing cache size, because e-Science cluster facilitates 1Mbyte L2 cache for each processor (2Mbyte for 2 processors), but PCHAN required over 1.4Gbyte memory in running, the ratio of required memory size to available cache size is too large (700:1) for cache to make obvious effect on performance. The real reason of this performance improvement is memory effect. When running PCHAN on one processor, the required memory size (over 1.2Gbyte) had exceeded the size of facilitated memory for one processor (1.0Gbyte). This state caused data swapping between memory and external disk, which seriously lost performance. When running on two processors, the size of total memory was still not big enough (required memory size on each processor is about 800Mbyte), so part data had to be swapped in and out in some time, which cost lots of extra time15. With the increment of processors, the sub-domain assigned on each processor became smaller and the data could completely be stored in memory. Data swapping act would disappear, which seems improved the performance.

15 This is a common problem of distributed machine: A code which request large memory in runtime often can

not be run on small number of processors because of the memory limit, although each processor is fast enough in many cases. Compared with distributed memory machines, a shared memory machine (Lomond) or cluster of SMP (HPCx) don’t have this problem

CPU

T1 55.000

50.000

45.000

40.000

35.000

30.000

25.000

20.000

55

When more processors were used, memory effect had disappeared, but the communication cost between processors was increasing fast, so the performance curves was pulled up soon. LAM/MPI outperformed MPICH2 providing the best performance, but the difference between these two MPI is not clear to tell until the point of 12 processors. Compared with LAM/MPI and MPICH2, the performance of OpenMPI fell behind, but it was not far from the performance of LAM/MPI and MPICH2. It is known from previous vampir trace analysis that PCHAN does not perform many MPI calls. There are a few but big point-to-point communication function calls within each step, which would determine application performance of MPI.

0.00

10.0020.00

30.00

40.0050.00

60.00

70.00

80.0090.00

100.00

0 200000 400000 600000 800000 1000000 1200000


OpenMPI

MPICH2

LAM/MPI

Figure 6.13 bandwidth results of Pingpong benchmark on e-Science cluster


165.42

119.56109.96

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

latency(usec)

OpenMPI

MPICH2

LAM/MPI

Figure 6.14 latency results of Pingpong benchmark on e-Science cluster

Size (byte)768,000

56

The bandwidth and latency of low-level Pingpong benchmark in figure 6.13 and 6.14 could explain why LAM/MPI outperformed MPICH2 in T1 benchmark. LAM/MPI and MPICH2 had nearly the same bandwidth in P2P communication of large size message, but LAM/MPI could start communication faster than MPICH2. The advantage on latency help LAM/MPI beat its opponent. The application performance of OpenMPI is a bit out of expected. OpenMPI provided higher bandwidth than the other MPI in low-level benchmark but its performance fell behind in T1 benchmark. The possible reason is that OpenMPI’s application performance was encumbered by the high latency of OpenMPI.

Quantification was made here to compare the effect of Latency and Bandwidth. From the results

of IMB Pingpong benchmark, we can calculated that OpenMPI, who had about 85 Mbyte/sec

bandwidth, took about 8000 usec to perform once P2P communication of a 800,000 bytes

message. Its latency value (165.42 usec) takes about 2.1% of the total communication time.

Compared with bandwidth, latency should make little effect on performance. Based on this

conclusion, OpenMPI is expected to provide best performance for PCHAN on e-Science cluster,

which is not consistent with real application benchmark results.

The explanation could be that there is more than one message being sent at a time so perhaps

MPICH2 gets good bandwidth for applications codes that send multiple messages although it

isn’t so good for the Pingpong. The unexpected OpenMPI performance is hard to understand at

this moment and requires further study.

6.6 Summary

PCHAN was developed by UKCT and uses Direct Numerical Simulation (DNS) techniques to study turbulent fluid flow. Compared with other application codes used in this project (NAMD and LAMMPS), it is not a big code, only around 12000 lines of Fortran 90. And its porting procedure is simpler and faster than MD codes used in this project as no other packages or libraries were required. Due to its small size and simple porting procedure, it is a quite portable application code; the whole porting work using all three MPI on both machines succeeded in this project. T1 benchmark results on Lomond shows that MPICH2 should be recommended in this case

57

because it totally gave a native level performance as Sun MPI. MPICH2 also exhibited better stability than OpenMPI and LAM/MPI during the porting and benchmarking work. However, MPICH2’s ‘best’ title is taken by LAM/MPI on e-Science cluster. Although MPICH2 still gave a satisfied performance, it was outperformed by LAM/MPI on performance curve. Compared with the other two MPI, the performance of OpenMPI ranked at last on both machines.

58

Chapter 7

NAMD Benchmark

NAMD is a molecular dynamic application code which is commonly used in Biology and Computational Chemistry research. This chapter demonstrates how to build NAMD on Lomond and compares the results of the Apoa1 benchmark using a range of public domain MPI.


NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois [17]. It is a molecular dynamics program designed for high-performance simulations of large biomolecular systems. NAMD applies object-oriented technology and is based on the charm++ parallel object to run on a range of parallel platforms with good scaling, from high end HPC machine which have hundreds of processors to commodity clusters which connects tens of processors using Gigabit Ethernet [18]. In this project, the latest release (version 2.6b1) was used. Compared with prior version, 2.6b1 has several advantages, including [19]: - Ports to Itanium, Altix, and Opteron/Athlon64/EMT64. - Improved serial performance on POWER and PowerPC. - Adaptive biasing force free energy calculations. - Tcl-based boundary potentials. - Reduced memory usage for unusual simulations. - Support for OPLS force field. Apoa1 benchmark was applied in this project. It is a commonly used MD benchmark which

59

contains a 108 x 108 x 80 (Partial Mesh Ewald) PME Grid. In prior performance research on HPCx phase2a, Apoa1 was used as well. Figure 7.1 shows the Apoa1benchmark results on HPCx running for 1000 steps. The results were plotted as a slowly ascending curve which shows NAMD acquired good scaling on HPCx. Unlike PCHAN, it seems NAMD is not sensitive with memory size as according to the early report, it didn’t benefit from the improvement of memory architecture from phase2 to phase2a.

0

1

2

3

4

5

6

7

8

9

16 32 64 128 256


IBM MPI

Figure 7.1 NAMD Apoa1 Benchmark on HPCx Phase2a using native MPI

7.2 Porting code

It is known that porting NAMD is a non-trivial work, because unlike the Coursework Code and PCHAN, NAMD relies on a number of other packages to work correctly. These packages include FFTW, TCL, Charm++ and plug-ins. Before installing NAMD, it is necessary to install these packages. As the failure of porting work on e-Science cluster failed, the following section only contains the procedure of porting work on Lomond. 16Install packages 16 Before NAMD work began, environment variables PATH and LD_LIBRARY should be set to specify the public domain MPI to be used.

CPU

Apoa1

60

Since there are binary files of FFTW, TCL and plug-ins precompiled for Solaris system on NAMD website, installing these packages on Lomond becomes very simple --- download Solaris version of FFTW, TCL and plug-in packages from http://www.ks.uiuc.edu/Research/namd/libraries/ and unpack them to NAMD porting directory on Lomond, as the steps showed in figure 7.2 in Appendix A When building Charm++, several arguments must be decided for the build script in Charm++’s source directory. The complete command syntax of build is build <target> <version> [options] [--basedir=dir] [--libdir=dir] [--incdir=dir] <target> specifies the parts of Charm++ to compile. “charm++” is the most often used <target>, it will compile key Charm++ executables and runtime libraries. <version> defines CPU, O/S and communication of the machines. It includes three factors; the way to communicate, operating system and other options. “mpi-” was selected to specify communication of charm++ using MPI calls and “sol” indicated that Solaris system was used on Lomond. There are no other options. It is critical to specify using which MPI by assigning basedir to the install location of the MPI being used. Then Charm++ can be installed by executing the build script. Figure 7.3 in appendix A gives the operations to build Charm++ with MPICH2. To use another MPI, just replace the value of basedir by the install location of MPI to be used An error which said ‘cannot find command mpiCC’ happened during the procedure of building charm++ with MPICH2. This is because MPICH2’s C++ compile command (mpicxx) is not consistent with the default MPI C++ command (mpiCC). To solve this problem, one method is to make a soft link from mpiCC to mpicxx in the shell. The other way is to change the value of CMK_CXX in the file charm-5.9/src/arch/mpi-sol/conv-mesh.sh to mpicxx. When re-building Charm++ and there is no error this time. Build NAMD After all packages were installed, the porting work processed to building NAMD. There are three steps to finish this whole building work. First, edit configuration files under NAMD_2.6b1_Source directory to specify the right position of installed packages. These files include ./Make.charm, arch/Solaris-Sparc.fftw, arch/Solaris-Sparc.tcl and arch/Solaris-Sparc-MPI.arch

61

Then configure NAMD with an auto-configure script and several arguments. This step created a compilation environment and the Makefile. At last enter the new created directory and execute the Makefile to compile NAMD 2.6b1. The executable is named namd2. Figure 7.4 A in appendix gives the procedures of building NAMD. A simple example alanin were used to test executable namd2 with different MPI. Sun MPI mprun –np 4 ./namd2 src/alanin MPICH2 mpiexec –n 4 ./namd2 src/alanin LAM/MPI mpirun –np 4 ./namd2 src/alanin NAMD porting work complete here


There are four parameter files included in Apoa1 benchmark: Two CHARMM force fields in X-PLOR format (par_all22_prot_lipid.xplor and par_all22_popc.xplor), an X-PLOR format PSF file describing the molecular structure (apoa1.psf) and the initial coordinates of the molecular system in the form of a PDB file (apoa1.pdb) [23]. There is also a configuration file linking all these parameters for NAMD (apoa1.namd). It contains all the other critical information of apoa1 benchmark as well. It shows that apoa1 benchmak has 7426 seeds and a 108 x 108 x 80 PME Grid. The number of run steps is set to 50 as Lomond is not fast as HPCx; 50steps could be done in a reasonable time. The benchmark was run using native Sun MPI, MPICH2 and LAM/MPI seperately. It is a pity that the porting work with OpenMPI failed17 so this MPI was dropped in NAMD benchmaking. Figure 7.5 shows the NAMD apoa1 benchmark results on Lomond

17 The compiling of NAMD succeeded but the executable crashed at the beginning when running on Lomond.

62

0.000

5.000

10.000

15.000

20.000

25.000

30.000

35.000

1 2 4 8 16 24 32

cpu time per step

Sun MPI

MPICH2

LAM/MPI

Figure 7.5 NAMD Apoa1 benchmark results on Lomond

The results of Apoa1 benchmark are ascending curves no matter what MPI is used. The increasing time cost came from the additional cost of communication between the increasing number of processors. The three curves are very close to each other till the point of 16cpu, which indicates that Sun MPI, MPICH2 and LAM/MPI provided approximate performance for NAMD Apoa1 benchmark when using no more than 16 processors on Lomond. In low-level PingPong and SendRev benchmark, these three MPI got quite close value of bandwidth and latency. I.e. The latency difference between the fastest MPI (Sun MPI) and the second (MPICH2) was less than 13%, and this value kept decreasing with the increment of processors. When using more than 16 processors, the performance difference between various MPI began to be magnified. LAM/MPI was not able to keep the same level of scaling as the other two MPI any more. This is foreseeable result, as LAM/MPI’s collective communication performance was not as good as MPICH2 and Sun MPI in low-level benchmark. Besides, MPICH2 slightly outperformed Sun MPI at high number of processors point in Apoa1 benchmark, which can be explained by MPICH2’s small advantage to Sun MPI in bandwidth of Pingpong benchmark. Nevertheless, a conclusion that MPICH2 is definitely better than native MPI may be not exactly correct, because their results in fact are still very close compared with results of LAM/MPI. If we turn to the speed up graph, it is very clear that MPICH2 and Sun MPI’s speed up curves basically overlap each other and are higher than LAM/MPI’s speedup curve. This proves that these two MPI got almost the same scaling in Apoa1 benchmark. See the speedup curves in figure 7.6

CPU

63

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

speed up Sun MPI

MPICH2

LAM/MPI

Linear

Figure 7.6 Speedup of NAMD Apoa1 benchmark results on Lomond

So a more reasonable understand of the above benchmark result should be MPICH2 and Sun MPI got very similar performance in NAMD apoa1 benchmark and they both outperform LAM/MPI running on more than 16 processors on Lomond.

7.4 Summary

NAMD is C++ parallel molecular dynamic code developed by the University of Illinois. It is built on charm++ and a many other packages, which provides NAMD powerful function and good scaling on a range of HPC machines, but make its porting work complicated and difficult as well. In apoa1 benchmark on Lomond, MPICH2 showed similar performance to native MPI. Although LAM/MPI also sustained close performance at the beginning, it finally lost in competition with MPICH2 and Sun MPI on high number of processors. OpenMPI did not pass the porting work, so it was dropped in benchmarking work The porting work all failed on e-Science cluster neither. To compare the MPI performance of MD codes, one backup code LAMMPS was benchmark on that machine (see next chapter)

CPU

64

Chapter 8

LAMMPS Benchmark

Due to the failure of porting NAMD on e-Science cluster, LAMMPS was benchmarked on this machine to give comparisons on MPI Performance. As a backup code, LAMMPS has many NAMD-like features (Molecular Dynamic Code, written in C++, needs other packages). Even the porting procedures are quite similar. This chapter introduces LAMMPS and its porting work on e-Science cluster, as well as the benchmark results using all three public domain MPI implementations.


LAMMPS which stands for Large-scale Atomic/Molecular Massively Parallel Simulations is a molecular dynamics code that models an ensemble of particles in all kinds of states. It uses a variety of force fields and boundary conditions to model atomic, polymeric, biological, metallic, or granular systems. Although LAMMPS could efficiently run on a single-processor PC, it is designed to run in parallel on a range of HPC platforms. Any machine which has a C++ compiler and a MPI implementation should be able to run LAMMPS in parallel [20]. If you want to use the particle-particle particle-mesh (PPPM) option in LAMMPS for long-range Coulombics, you must have a 1d FFT library installed on your platform [21]. LAMMPS was first developed by Sandia National Laboratories, a US Department of Energy facility, with funding from the DOE [20]. It is a copyrighted code that is distributed free-of-charge. Keeping the GPL LICENSE file and source file header, anyone may use or modify the code to public a new release. The latest release is version 17 July 2006. Version 12 April 2006 was benchmarked in this project [20].

65

The main reason why LAMMPS was used in this project is the porting work failed on e-Science cluster. Porting and benchmarking a NAMD-like code on that machine could help give some comparisons. As a backup code, LAMMPS owns many similar characters with NAMD. They are both molecular dynamic codes primarily designed for modeling biological molecules; they are both written in C++; they both work with other FFT packages; they both use spatial-decomposition approaches. Since LAMMPS was used in previous research on HPCx, there are pre-compiled binary files of different version on that machine. It is interesting to have a look at the performance and scaling of LAMMPS on HPCx. The following figure shows the Rhodopsin benchmark performance of LAMMPS version 12 April 2006 on HPCx with native MPI.

0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

4 8 16 32 64 96 128 192 256


IBM MPI

Figure 8.1 LAMMPS Rhodopsin benchmark on HPCx Phase2a using native MPI

LAMMPS’s performance curve on HPCx has a similar shape with curves of NAMD. The performance curve gradually ascends with the increase of processors. The increasing gradient indicates that communication between high numbers of processors cost much time in Rhodopsin benchmark. The more processors were used, the less efficiency the parallel code acquired.

8.2 Porting code

Similar to NAMD, the porting of LAMMPS contains two steps: install FFTW library and compile LAMMPS with specific MPI.

CPU

66

Install FFTW18 FFTW is a library of C subroutines developed for Discrete Fourier transform (DFT) in one or more dimensions. It supports arbitrary input size and both real and complex data [22]. There are two versions of FFTW being maintained now, FFTW3 and FFTW2 which have incompatible APIs. In this project, FFTW version 2.1.5 was installed, because MPI parallel transforms are still only available in FFTW2. The operations in figure 8.2 in Appendix A built the FFTW complex and real transform libraries along with test programs on e-Science cluster. Since no compiler was specified, gcc was used as default C compiler in GNU Linux system. The make install command installed the fftw and rfftw libraries at the directory specified by --prefix flag.

Compiling LAMMPS LAMMPS provides a top level Makefile in src directory and a group of pre-configured low-level Makefile Makefile.* in src/MAKE directory to automatically finish LAMMPS compilation for some machines and options. Unfortunately, the low-level Makefile must be manually configure in this project as none of the pre-configured low-level Makefile is fit for the combination of machine and options in this project. We named the low-level Makefile to be configured as Makefile.ompi (using OpenMPI), Makefile.mpich2 (using MPICH2) and Makefile.lam (using LAM/MPI). And it is wise to configure these three Makefile from one pre-configured Makefile Makefile.g++ which is configured for Linux + g++ compiler + MPICH + FFTW. Figure 8.3 in appendix A lists the low-level Makefile for MPICH2 on e-Science cluster --- Makefile.mpich2 and the following list explains the steps to correct original Makefile.g++ into Makefile.mpich2

Still choose g++ as C++ compiler and Linker on e-Science cluster

CCFLAGS specified the include directory of MPICH2 and FFTW library by –I switch. It also indicated that using FFTW to perform DFFT by –DFFT_FFTW

18 Here FFTW can be replaced by other vender-provided libraries, like the DFT libraries of INTEL, DEC, SGI,

and SCSL

67

LINKFLAGS specified the lib directory of MPICH2 and FFTW library by –L switch.

USRLIB said what user library was applied with in the building of LAMMPS. They were -lfftw for FFTW and –lmpich for MPICH2

Ignore all dependency operations and settings.

Makefile.ompi and Makefile.lam were corrected by the similar steps with different flag values. Table 8.1 in Appendix A includes the values of key flags in Makefile.ompi, Makefile.mpich2 and Makefile.lam.

Problem in Building LAMMPS When build LAMMPS with MPICH2 on e-Science cluster, there was on problem causing the following error message:

“SEEK_SET is #defined but must not be for the C++ binding of MPI” This is a special issue of MPICH2 for C++. The problem is that SEEK SET, SEEK CUR, and SEEK END are used by both stdio.h and the MPI C++ interface, which is a bug in the MPI-2 standard [7]. There are two methods to solve this problem: adding #undef SEEK_SET #undef SEEK_END #undef SEEK_CUR before mpi.h is included or add the definition -DMPICH_IGNORE_CXX_SEEK to the command line [7]. The latter method was applied in this project and -DMPICH_IGNORE_CXX_SEEK was added in CCFLAGS of Makefile.mpich2 (see table 8.1) LAMMPS were successfully built on e-Science cluster using OpenMPI, MPICH2 and LAM/MPI. The executables were lmp_ompi, lmp_mpich2 and lmp_lam separately. To run LAMMPS in parallel on e-Science cluster, use the following operations:

68

OpenMPI: mpirun –np 4 lmp_ompi < in.melt MPICH2: mpiexec -n 4 lmp_mpich2 < in.melt LAM/MPI: mpirun –np 4 lmp_lam < in.melt

8.3 Benchmark on e-Science cluster

The Rhodopsin benchmark which was used in previous HPCx research was adopted in this project. This benchmark models an all-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long range Coulombics via particle-particle particle mesh, SHAKE constraints [25]. It contains 32,000 atoms and runs for 100 timesteps. Figure 8.4 shows LAMMPS Rhodopsin benchmark results with different MPI on e-Science cluster.

0.000

0.500

1.000

1.500

2.000

2.500

1 2 4 8 12

cpu

tim

e per s

tep(

cpu*

sec)

OpenMPI

MPICH2

LAM/MPI

Figure 8.4 Results of LAMMPS Rhodopsin benchmark on e-Science cluster

In figure 8.4, all CPU time per step curves are ascending with the increase of processors. Since e-Science cluster does not have a high efficient interconnect like HPCx, it is hard to keep curves horizontal when using increasing the number of processors; the increasing costs of communication push up performance curves, no matter which MPI is used. In the ascending procedure, three MPI implementations gradually formed different performance curves by their different scaling. With the increasing of processors, the gaps between different MPI became more and more obvious. The three MPI had very close performance using 1 processor, but LAM/MPI gradually fell behind since more than 2 processors were used. OpenMPI and MPICH2 kept similar performance until more than 4

CPU

69

processors involved. MPICH2 gradually outperformed OpenMPI to acquire the best performance. Although OpenMPI lost in competition with MPICH2, it still kept advantage to LAM/MPI. This result can be understood through MPI trace view and low-level benchmark results.

Figure 8.5 Vampir MPI trace views of LAMMPS Rhodopsin benchmark Figure 8.5 gives MPI trace views using vampir. Figure (A) records MPI_Send, MPI_Irecv, MPI_Wait function group in timeline. Figure (B) records MPI_Allreduce function in timeline. The last figure (C) shows call tree which lists the statistic information of used MPI function. The most frequent functions have been circled

The time line view and call tree in vampir indicate that the communications of LAMMPS are dominated by two kinds of operations. One is point-to-point communications --- MPI_Send, MPI_Irecv, MPI_Wait function group; the other one is collective communications --- MPI_Allreduce function. In addition, the identified message says the

(A) Point-to-point communication in timeline (B) Collective communication in timeline

(C) Call tree

70

messages in point-to-point communication are in size of around 140,000 bytes and collective communications are done on messages of 192 bytes. All these functions’ performance were benchmarked in chapter 6 by Intel MPI Benchmarks. Some relevant results are quoted here to help understand the application benchmark results. Figure 8.6 and figure 8.7 are quoted results from Pingpong benchmark and Allreduce benchmark separately.

0.00

10.0020.00

30.00

40.0050.00

60.00

70.00

80.0090.00

100.00

0 200000 400000 600000 800000 1000000 1200000


OpenMPI

MPICH2

LAM/MPI

Figure 8.6 low-level Pingpong benchmark result

0.00

200.00

400.00600.00

800.00

1000.00

1200.00

1400.001600.00

1800.00

2000.00

2 4 8 12


OpenMPI

MPICH2

LAM/MPI

Figure 8.7 low-level Allreduce benchmark result

It is known from above two figures that LAMMPS rhodopsin benchmark results are consistent to the low-level Allreduce results in performance order, which proved that MPI’s performance of Allreduce operation could decide their performance of LAMMPS. Although OpenMPI owns much higher bandwidth than the other two MPI on point-to-point operation, this advantage does not help OpenMPI beat MPICH2 in real application benchmark.

140,000 size (byte)

192 bytes

CPU

71

Because one Allreduce operation takes much longer time than one P2P message passing, the advantage on P2P operation is not enough to counteract the performance gap of Allreduce function. It is counted from vampir timeline that there are over 12 Allreduce operations within one run step. So LAMMPS performance is definitely determined by Allreduce with much frequent calls.

8.4 Summary

LAMMPS is a NAMD-like molecular dynamic code. It was ported to e-Science cluster as a backup of NAMD in this project. MPICH2 got the best performance in LAMMPS Rhodopsin benchmark finally, followed by OpenMPI. LAM/MPI ranked at last. This order is determined by their performance of low-level Allreduce benchmark. It was shown in vampir MPI trace view that the cost of frequent Allreduce calls dominated total communication cost of LAMMPS application.

72

Chapter 9

Infiniband and Shared Memory

Although Traditional cluster design (normal PCs connected by Gigabit Ethernet) is easy for people to create an own cluster, this kind of clusters have serious limit on scaling. The inefficient communication mode (message passing through Gigabit Ethernet) makes it hard for a parallel code to profit from large number of processors. So some new features like Infiniband and shared memory are proposed and added into modern cluster design to improve the scaling of cluster machines. However, to get real effect on application codes, these new design features must be supported by MPI libraries as well. This chapter discusses if the public domain MPI implementations can profit from these two features on a cluster machine.

9.1 Introduction of Infiniband and Shared Memory

Interconnect is designed to provide communication between processes on various individual servers. It plays a vital role to a cluster machine because it enables individual compute nodes work together in an organized way to acquire high performance. An efficient interconnects could help exploit the full potential of compute nodes in a cluster, so it is the key component that affects cluster’s scaling and performance [26]. Traditional design usually prefers to connect compute nodes with Gigabit Ethernet, a popular interconnect using standard TCP/IP protocol. Since TCP/IP protocol has been widely used in networking today, this choice could produce a cluster in a flexible way and minimize the cost as well. However, the application benchmark results of e-Science cluster have reflected a serious drawback of this simple design: cluster machines of this kind could not sustain a good scaling to provide high performance with large number of compute nodes.

73

In MPI Coursework code 500x500 benchmark on e-Science cluster, the parallel efficiency lost up to 80% when running codes on 12 processors. This is because TCP/IP is had high CPU overhead and high latency protocol, it is not suited for HPC [26]. To improve cluster machines’ scaling to HPC level, some new proposed technologies represented by Infiniband and shared memory has been added into the modern cluster design.

Infiniband was developed in late 1990s by InfiniBand Trade Association and quickly became an interconnect of choice for HPC applications [26]. As specialized cluster interconnect, it provides lower CPU usage and lower latency than Gigabit Ethernet to pass message in a more efficient way. Infiniband has a point-to-point switched fabric architecture that connect different end points including host channel adapter (HCA) used to connect host processor and target channel adapter (TCA) used to connect independent I/O device [26]. Infiniband implements user level access to HCA for multiple applications by transferring data directly from application memory space to remote memory bypassing CPU and operating system so that CPU overhead can be reduced and transferred into latency improvement [26]. This high bandwidth low latency mechanism makes Infiniband become more efficient than Gigabit Ethernet. Figure 9.1 shows Infiniband architecture.

Figure 9.1 the architecture of Infiniband

Switch Switch

Switch Switch

CPU CPU

MEM HCA

CPU CPU

MEM HCA

I/O module

TCA TCA

I/O module

TCA

Controller

Processor Node Processor Node

I/O Storage

74

Note: Figure 9.1 is reproduced from Figure 1 in [26]

Shared memory feature is to replace simple processor compute nodes in cluster by Symmetric Multi processing (SMP) nodes which contain more than one processors sharing common memory. The design of SMP cluster was a complex and costly design which was not suited for low cost Linux cluster, but the spread of multi-core technology make the this design can be easily implemented on a cluster machine now. This technology builds one die with more than one processing cores (normally two) which have their own L2 cache but leverage the same memory. It is known from the results of benchmarks on Lomond that writing and reading shared memory is a high efficient message passing manner. So the shared memory nodes will help improve the scaling and performance of whole cluster. Figure 9.2 shows the architecture of a compute node with multi-core processors.

Figure 9.2 multi-core compute node

9.2 Application Benchmark

Today, both Infiniband and shared memory can be found on many cluster machines due to their advance features. But to get profit of real performance from these features, the MPI installed must be able to support them. In this project, an application code was benchmarked on the Scaliwag machine which is composed of 32 Opteron dual-core processors connected by three different interconnects including Infiniband and Gigabit Ethernet to investigate if the public domain MPI implementations on Scaliwag could profit from Infiniband and shared memory. The MPI Coursework code was selected in this benchmark. Because this code was

CPU

Memory

L2 Cache L2 Cache

CPU

75

completely written by author of this article, the code’s functions, operations and architectures were completely known without requiring an MPI trace analysis by some external tool. Due to the limited time for using Scaliwag machine, only LAM/MPI was well installed on this machine, it therefore became the only target MPI in this benchmark. The introduction on LAM website says LAM/MPI was designed supporting Infiniband. “ib RPI SSI module provides low latency, high bandwidth of Infiniband networks using the Mellanox Verbs Interface (VAPI).” [27] To run application with LAM/MPI using Infiniband and Gigabit Ethernet, use following commands: Infiniband: mpirun –np $p –ssi rpi ib ./foo Gigabit Ethernet: mpirun –np $p –ssi rpi tcp ./foo To investigate the effect of Infiniband on LAM/MPI, compare the results of two individual groups of test: one is running MPI Coursework Code on various number of processors using Gigabit Ethernet; the other one is running the same code on different number of processors using Infiniband. The coursework code was tested with both dynamic on and off strategies. Figure 9.3 plot the results of Coursework code 1000x1000 benchmark with Gigabit Ethernet and Infiniband.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

2 4 8 16 32


ib,off

ib,on

tcp,off

tcp,on

Figure 9.3 Coursework code benchmark with using two interconnects

In figure 9.3 ib: Infiniband tcp: Gigabit Ethernet

off: dynamic off on: dynamic on

CPU

1000x1000

76

There are four time per step curves in figure 9.3. They represent the coursework code performance of LAM/MPI using either Infiniband or Gigabit Ethernet interconnects, applying either dynamic off or dynamic on strategies on Scaliwag. The two curves for dynamic on strategy are apparently lower than the other two curves. Dynamic off strategy performed Allreduce operation once in every iteration so that the huge cost on this operation dominated whole communication cost. When applying dynamic on strategy, Allreduce call was reduced by around 1/28, and the reduction of cost was transferred to performance improvement. That’s why the other two curves are lower in figure 9.3. Another discovery in this figure is both curves for Infiniband are in lower position than the ones for Gigabit Ethernet using the same collective communication strategy. And former curves appear more stable and horizontal than their opponent. This result proves that LAM/MPI do profit from Infiniband interconnect to improve scaling and performance. Figure 9.3 shows by using Infiniband, the cpu time per step decreases about 30% in maximum in dynamic on mode and about 50% in maximum in dynamic off mode. Presuming that Infiniband reduced latency of collective communication in the benchmarks, so it had more effect on a heavy collective communication code. Two curves for dynamic off get down around the point of 32 CPU, the other two curves slows their ascending speed at the same location. This incompatible phenomenon was caused by increasing cache during the benchmarks. For the 1000x1000 coursework code benchmark, the memory used is 32Mbyte in total19, and there is 1 Mbyte L2 cache for each core processor on Scaliwag. When 32 processors were involved, most data used in code could reside in L2 cache instead of memory, which efficiently speedup data reading and writing and improved the whole scaling and performance. But it was not the effect of MPI libraries or communication features. Next, it’s turn to investigate shared memory effect. To specify the node using mode applied in benchmark, just use different arguments to –pe flag in command line or script Single processor mode: -pe mpich $p Multi processor mode -pe mpich-multi $p To investigate the effect of shared memory on LAM/MPI, compare the results of two

19 Four 1000x1000 double precision arrays are defined in code. The total memory requested is 4 x 1000 x 1000 x 8byte = 32Mbyte

77

individual groups of test: one is running MPI Coursework Code on various numbers of processors using two processors in every node; the other one is running the same code on different number of processors using only one processor in each node. The coursework code was tested with both dynamic on and off strategies and the interconnect manner was Infiniband. Figure 9.4 plot the results of Coursework code 1000x1000 benchmark with both multi and single modes.

0

0.01

0.02

0.03

0.04

0.05

0.06

4 8 12 16 20

cpu time per

step(cpu*se

c)

multi,off

multi,on

single,off

single,on

Figure 9.4 Coursework code benchmark in multi and single mode on Scaliwag

In figure 9.4 multi: multi processor mode single: single processor mode

off: dynamic off on: dynamic on

There are four time per step curves in figure 9.4. They represent the coursework code performance of LAM/MPI using single and multi processor mode, applying dynamic off and dynamic on strategies on Scaliwag. As in figure 9.3, the two curves using dynamic on strategy are apparently in lower position than both the other two curves. The reduction of Allreduce operation brought huge performance improvement again. And this effect is determinant. Under the same dynamic strategy, two curves standing for multi-processor mode are lower than the other two single-processor curves in figure 9.4. This result proves that shared memory is totally supported by LAM/MPI on the Scliwag machine. With the increase of processors, two curves of dynamic off tend to get close with each other, not like the other two curves which remain separated by nearly the same amount. A guess of reason is that when more processors were used, the amount of interconnects also increased. Although fast

CPU

1000x1000

78

message passing could guarantee the efficiency of point-to-point communication within compute nodes, it had to be synchronized with relatively slow Infiniband or Gigabit Ethernet interconnects within collective communication. The expanding interconnects finally became bottleneck in a heavy collective communication code environment so that shared memory gradually lost its effect. Unfortunately there is no time to do relevant low-level benchmark on Scaliwag, so this guess could not be proved in this project.

9.3 Summary

Infiniband and shared memory are two important features to improve the communication efficiency on a cluster machine. They have been utilized on many new designed clusters. The coursework code benchmark on Scaliwag proved that both Infiniband and shared memory feature are supported by LAM/MPI. Besides, the comparison of results of benchmarks using two different strategies indicated that the effect of both these two features could be influenced by different code environments. A heavy collective communication code would magnify inter-node Infiniband’s effect but it would also make in-node shared memory tend to be useless, and vice versa.

79

Chapter 10

Conclusion With the mature and spread of cluster technology in recent years, more and more people began to utilize Linux cluster machines to provide high performance at low cost in scientific research and business activities. Although this choice could save people lots of money, most Linux clusters are not as stable and efficient as vendor provided HPC machines. One important reason is there are many software and tools of similar function. They were developed on various platforms for different target machines. But there is no uniform standard or guide telling people which software or tool they should install on their machine. An improper choice would reduce the performance of the cluster system at the software layer. As a result, it is important to compare the performance of replaceable software, libraries and toolkits on specific platforms. The results could guide others how to enable a cluster of this type to achieve the best performance. As one of the most important software components on a cluster machine, a MPI implementation may directly affect the parallel computing scaling and performance of the cluster. In this project, three recently developed and popular MPI implementations, OpenMPI, MPICH2 and LAM/MPI were benchmarked using a series of application codes on different machines to find out which one has the best performance. OpenMPI is a completely new MPI-2 compliant implementation combining technologies and resources from prior LAM/MPI, LA-MPI, FT-MPI, and PACX-MPI projects. It was designed as component architecture to enable OpenMPI to be a stable and flexible MPI. IMB low-level benchmark results showed that OpenMPI performed poor in both Pingpong and Allreduce benchmarks on Lomond (a shared-memory machine). It had lower bandwidth of point-to-point communication and took much longer to start passing a message than the other MPI implementations. In addition, OpenMPI was the most expensive to execute a collective communication. The unsatisfied low-level results made OpenMPI impossible to

80

achieve good performance on application codes. Its performance fell behind MPICH2 and LAM/MPI in both the Coursework code 1000x1000 benchmark and PCHAN T1 benchmark. There is no reason to use OpenMPI on a Lomond–like shared memory machine. OpenMPI didn’t have best low-level performance on e-Science cluster. But it had the best bandwidth!!! And it had good Allreduce performance in low-level benchmarks. The gap between OpenMPI and the other MPI implementations was not as large as on Lomond. In addition, there two high spots of OpenMPI: it owned higher bandwidth on long messages; its Allreduce performance was ranked in the middle of MPICH2 and LAM/MPI. To some extent, these two points helped the performance of OpenMPI in application benchmarks on e-Science cluster. It was found that OpenMPI performed very close to the other MPI implementations in the Coursework Code and it even outperformed LAM/MPI in the LAMMPS rhodopsin benchmark. Although OpenMPI performed better on e-Science cluster than it did on Lomond, this new MPI is still not good enough to be recommended on e-Science cluster. MPICH2 is a new version of MPICH with MPI-2 functions support. It aims to provide an MPI implementation for important platforms, including clusters, SMPs, and massively parallel processors. We have reason to believe MPICH2 is do fit for SMP machine, because it produced excellent results in IMB low-level benchmarks. MPICH2 not only showed high bandwidth and short latency in P2P communication, it also got the best performance of collective communication. Supported by these results, MPICH2 provided best performance to all application codes ported to Lomond especially to heavy collective communication codes. MPICH2 is definitely the preferred MPI implementation on Lomond except native MPI. As on Lomond, MPICH2 had good low-level benchmark results on e-Science cluster as well. Especially, it had an obvious advantage on collective communication. This advantage made MPICH2 beat the other two MPI implementations on the MPI Coursework Code and LAMMPS which contains frequent calls of Allreduce. Here, it is proved that if the proper software were used, even very simple Linux clusters which only has Ethernet interconnectmay achieve good performance for some problem (i.e. LAMMPS) on more than 10 CPUs. MPICH2’s performance fell behind LAM/MPI’s in PCHAN T1 benchmark. Without advantage of collective communication, MPICH2’s bandwidth and latency were still not good as the same scale of LAM/MPI. Although MPICH2 was not the absolute No.1 on e-Science cluster, it had obvious advantages on heavy collective communication codes (Coursework Code and LAMMPS). The performance difference with LAM/MPI on PCHAN

81

was very slight as well. So MPICH2 is still qualified to be called the best MPI on e-Science cluster. LAM/MPI is a high quality public domain MPI implementation with long history. Its high performance had been proved by thousands of users. The version 7.1.2 tested in this project Lam 7.1.2 provides a complete implementation of the MPI 1.2 standard and most MPI-2 standard. LAM/MPI showed obvious performance character in low level benchmarks. It had very short latency which enabled LAM/MPI to start a communication in shortest time. But its poor performance of collective communication weakened LAM/MPI’ competition in many application benchmarks. Basically, the application performance is determined by the type of code. In little collective communication code like PCHAN, LAM/MPI beat MPICH2 acquiring the best performance, but in heavy collective communication code like LAMMPS or NAMD, LAM/MPI’s performance was seriously limited. To be portable on various platforms, LAM/MPI supports many MPI communication protocols (TCP, Gigabit Ethernet, Myrinet and InfiniBand). The Coursework Code benchmark on Scaliwag proved that LAM/MPI could profit from high performance interconnects, Infiniband and shared memory feature from multi-core in compute nodes. Since the other MPI had not been full established on Scaliwag, LAM/MPI is the preferred MPI on this machine as this moment. In summary, according to the results of low-level benchmarks and application benchmarks on both Lomond and e-Science cluster, MPICH2 is recommended to be installed and used on both Lomond-like SMP machines and Ethernet-based Linux clusters like the eScience cluster. It was only possible to test a single MPI implementation, LAM/MPI, on a machine with Infiniband interconnect. LAM/MPI was able to take advantage of the better performance of Infiniband giving increased application performance compared to Ethernet Further study If more time is given, it is interesting to do more work on Scaliwag when this machine finishes its update and return. First, OpenMPI and MPICH2 are to be installed. Then we can port the same application codes (MPI Coursework Code, PCHAN, NAMD and LAMMPS) to this machine and use them to benchmark the application performance of all three MPI implementations in this project. It is also worthy to spend more time in understanding why OpenMPI didn’t do better for

82

PCHAN on e-Science cluster. In addition, OpenMPI is a new MPI implementation which is still being developed. It is interesting to check if OpenMPI’s new release in future could have better performance than MPICH2 and LAM/MPI.

83

Appendix A

Code Porting Detail

A.1 Porting MPI Coursework Code

Figure 5.2 in Appendix lists the Makefile to compile Coursework Code on Lomond using OpenMPI # Makefile to compile Coursework Code on Lomond using ompi MF= Makefile CC= mpicc CFLAGS=-lm EXE= imgprocess INC= arralloc.h \ prototypes.h SRC=cio.c \ arralloc.c \ communicate.c \ process.c \ main.c OBJ = $(SRC: .c=.o) .c.o: $(CC) $(CFLAGS) -c $(SRC) all: $(EXE) $(OBJ): $(MF) $(OBJ): $(INC)

84

$(EXE): $(OBJ) $(CC) $(CFLAGS) -o $(EXE) $(OBJ) clean: rm -f imgprocess *.o

Figure 5.2 Makefile for Coursework Code on Lomond using OpenMPI

This Makefile for OpenMPI can be reused to compile the MPI Coursework Code on both Lomond and e-Science cluster using OpenMPI, MPICH2 and LAM/MPI, because all public domain MPI implementations’ MPI C compilation executable is named mpicc. Figure 5.3 shows a script to porting Coursework Code and run it on Lomond using OpenMPI. #!/usr/bin/bash export PATH=$PATH:/home/emcclem/ompi/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/emcclem/ompi/lib echo ‘compiling’ make

Figure 5.3 Script for porting Coursework Code on Lomond using OpenMPI

A.2 Porting PCHAN

Figure 6.2 and 6.3 lists the target parameters for compiling PCHAN on Lomond and e-Science cluster with MPICH2.

pdns3d-epccsun-mpich2: make $(PROGRAM) "EXE=$(EXE)" \ "OPTIONS=$(OPTIONS)" \ "OPTIONS2=-DEPCC -DMPI" \ "CPP=cpp" \ "CPPFLAGS=-C -I/home/emcclem/mpich2shm/include" \ "FC=/home/emcclem/mpich2shm/bin/mpif90" \

"FFLAGS=-xO5 -r8const -xtypemap=real:64 -xchip=ultra3 -xarch=v8plusb -xcache=64/32/4:8192/64/1" \

"LFLAGS=" \ "LIBS=-lmvec -L/home/emcclem/mpich2shm/lib"

Figure 6.2 Makefile target parameters for Lomond using MPICH2

85

pdns3d-eScience-mpich2:

make $(PROGRAM) "EXE=$(EXE)" \ "OPTIONS=$(OPTIONS)" \ "OPTIONS2=-DMPI" \ "CPP=cpp" \ "CPPFLAGS=-C -I/home/s0566708/mpich2/include" \ "FC=/home/s0566708/mpich2/bin/mpif90" \ "FFLAGS=-O3" \ "LFLAGS=" \ "LIBS=-L/home/s0566708/mpich2/lib"

Figure 6.3 Makefile target parameters for e-Science cluster using MPICH2 Table 6.1 in appendix contains values of the four key parameters for all MPI in this project

Machine MPI Parameter Values

ompi

CPPFLAGS=-C -I/home/emcclem/ompi/include FC=/home/emcclem/ompi/bin/mpif77 FFLAGS=-f77=%none -xO5 -r8const -xtypemap=real:64 \

-xchip=ultra3 -xarch=v8plusb \ -xcache=64/32/4:8192/64/1

LIBS=-lmvec -L/home/emcclem/ompi/lib

mpich2

CPPFLAGS=-C -I/home/emcclem/mpich2shm/include FC=/home/emcclem/mpich2shm/bin/mpif90 FFLAGS=-xO5 -r8const -xtypemap=real:64 -xchip=ultra3 \

-xarch=v8plusb -xcache=64/32/4:8192/64/1 LIBS=-lmvec -L/home/emcclem/mpich2shm/lib

lam

CPPFLAGS=-C -I/home/emcclem/lam/include FC=/home/emcclem/lam/bin/mpif77 FFLAGS=-f77=%none -xO5 -r8const -xtypemap=real:64 \

-xchip=ultra3 -xarch=v8plusb \ -xcache=64/32/4:8192/64/1

LIBS=-lmvec -L/home/emcclem/lam/lib

Lo

mo

nd

Sun

CPPFLAGS=-C -I/opt/SUNWhpc/include FC= /opt/SUNWhpc/bin/mpf90 FFLAGS=-xO5 -r8const -xtypemap=real:64 -xchip=ultra3 \

-xarch=v8plusb -xcache=64/32/4:8192/64/1 LIBS=-lmvec -L/opt/SUNWhpc/lib

86

ompi

CPPFLAGS=-C -I/home/s0566708/ompi/include FC=/home/s0566708/ompi/bin/mpif90 FFLAGS=-O3 LIBS=-L/home/s0566708/ompi/lib

mpich2

CPPFLAGS=-C -I/home/s0566708/mpich2/include FC=/home/s0566708/mpich2/bin/mpif90 FFLAGS=-O3 LIBS=-L/home/s0566708/mpich2/lib

e-S

cien

ce c

lust

er

lam

CPPFLAGS=-C -I/home/s0566708/lam/include FC=/home/s0566708/lam/bin/mpif77 FFLAGS=-O3 LIBS=-L/home/s0566708/mpich2/lib

Table 6.1 PCHAN building parameter value for different MPI

Note that LAM/MPI doesn’t provide mpif90 wrapper, but its mpif77 does support Fortran 90 codes. To ensure an f90 compiler is used, we gave –showme option to see the name of the back-end compiler that will be invoked, and also added –f77=%none to shut down f77 support of the f90 compiler. The OpenMPI implementation on Lomond had the same issue, as mpif90 wrapper was not built20. mpif77 –f77=%none is used to replace mpif90.

Figure 6.6-6.8 lists the procedure to install OpenMPI, MPICH2 and LAM/MPI with Intel Fortran Compiler. #! /bin/bash $ gunzip –c openmpi-1.0.2.tar.gz | tar xf – $ cd openmpi-1.0.2 # configure OpenMPI $ ./configure --prefix=/home/s0566708/ompi --enable-mpi-f90

--enable-mpi-f77 --enable-mpi-cxx --enable-mpi-profiling --with-f90-max-array=3 F90=/usr/local/Cluster-Apps/intel_fc_8.1.025/bin/ifort

# build and install $ make all install

Figure 6.6 the procedure of OpenMPI installation with ifort on e-Science cluster

20 All MPI was installed on Lomond and Scaliwag by Erik McClements and MPI installation work on e-Science

was done by Yuan WAN, the author of this dissertation.

87

#! /bin/bash $ gunzip –c mpich2-1.0.3.tar.gz | tar xf – $ cd mpich2-1.0.3 # configure MPICH2 $ ./configure --prefix=/home/s0566708/mpich2 --enable-f90 --enable-f77 --enable-cxx

F90=/usr/local/Cluster-Apps/intel_fc_8.1.025/bin/ifort # build MPICH2 $ make # install MPICH2 commands $ make install

Figure 6.7 the procedure of MPICH2 installation with ifort on e-Science cluster #! /bin/bash $ gunzip –c lam-7.1.2.tar.gz | tar xf – $ cd lam-7.1.2 # configure LAM/MPI $ ./configure --prefix=/home/s0566708/lam

FC=/usr/local/Cluster-Apps/intel_fc_8.1.025/bin/ifort # build LAM/MPI $ make # install LAM/MPI $ make install

Figure 6.8 the procedure of LAM/MPI installation with ifort on e-Science cluster

A.3 Porting NAMD

Figure 7.2-7.4 lists the whole procedure of porting NAMD to Lomond with MPICH2 1.Create porting directory: $ mkdir /home/ywan/msc_project/namd_port/ $ cd /home/ywan/msc_project/namd_port/

88

2.Download tcl fftw and plugins libraries: $ mkdir fftw $ cd fftw $ wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-solaris.tar.gz $ gunzip -c fftw-solaris.tar.gz | tar xf - $ cd .. $ mkdir plugins $ cd plugins $ wget http://www.ks.uiuc.edu/Research/namd/libraries/plugins-2.5

/plugins-SOLARIS2.tar.gz $ gunzip -c plugins-SOLARIS2.tar.gz | tar xf - $ cd .. $ mkdir tcl $ cd tcl $ wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl-solaris.tar.gz $ gunzip -c tcl-solaris.tar.gz | tar xf - $ cd ..

Figure 7.2 NAMD porting procedure – install TCL, FFTW and plug-in 3.Unpack NAMD and matching Charm++ source code and enter directory: $ gunzip -c NAMD_2.6b1_Source.tar.gz | tar xf - $ cd NAMD_2.6b1_Source $ tar xf charm-5.9.tar $ cd charm-5.9 4.Build the Charm++ library: $ ./build charm++ mpi-sol --basedir=/home/emcclem/mpich2 $ cd ..

Figure 7.3 NAMD porting procedure – install Charm++ To use another MPI building charm++, just replace the value of basedir by the install location of MPI to be used 5.Edit configuration files: ./Make.charm, arch/Solaris-Sparc.fftw arch/Solaris-Sparc.tcl arch/Solaris-Sparc-MPI.arch

89

6.Set up build directory and compile: $ ./config tcl fftw plugins Solaris-Sparc-MPI $ cd Solaris-Sparc-MPI $ make

Figure 7.4 NAMD porting procedure – build NAMD

A.4 Porting LAMMPS

Figure 8.2 lists how to install FFTW 2.1.5 on e-Science cluster $ gunzip -c fftw-2.1.5.tar.gz | tar xf - $ cd fftw-2.1.5 $ ./configure --prefix=/home/s0566708/fftw $ make $ make install

Figure 8.2 Installation of FFTW 2.1.5 Figure 8.3 lists the low-level Makefile for MPICH2 on e-Science cluster --- Makefile.mpich2 # mpich2 = RedHat Linux box, g++, MPICH2, FFTW SHELL = /bin/sh # System-specific settings CC = g++ CCFLAGS = -g -O -I/home/s0127384/mpich2/include \ -I/home/s0566708/fftw/include -DFFT_FFTW -DGZIP \ -DMPICH_IGNORE_CXX_SEEK DEPFLAGS = -D LINK = g++ LINKFLAGS = -g -O -L/home/s0127384/mpich2/lib \ -L/home/s0566708/fftw/lib USRLIB = -lfftw -lmpich SYSLIB = ARCHIVE = ar ARFLAGS = -rc SIZE = size

90

... # %.d:%.cpp # $(CC) $(CCFLAGS) $(DEPFLAGS) $< > $@ # Individual dependencies # DEPENDS = $(OBJ:.o=.d) # include $(DEPENDS)

Figure 8.3 LAMMPS low-level Makefile for MPICH2 on e-Science cluster Makefile.ompi and Makefile.lam were similar to Makefile.mpich2. Table 8.1 includes the values of key flags in Makefile.ompi, Makefile.mpich2 and Makefile.lam.

Machine MPI Parameter Values

ompi

CCFLAGS=-g -O -I/home/s0127384/ompi/include \ -I/home/s0127384/ompi/include/openmpi/ompi \ -I/home/s0566708/fftw/include -DFFT_FFTW -DGZIP LINKFLAGS=-g -O -L/home/s0127384/ompi/lib \ -L/home/s0566708/fftw/lib USRLIB=-lfftw -pthread -lmpi_cxx -lmpi -lorte -lopal \ -lutil -lnsl -ldl –Wl --export-dynamic -lutil -lnsl -lm -ldl

mpich2

CCFLAGS=-g -O -I/home/s0127384/mpich2/include \ -I/home/s0566708/fftw/include -DFFT_FFTW -DGZIP \ -DMPICH_IGNORE_CXX_SEEK LINKFLAGS = -g -O -L/home/s0127384/mpich2/lib \ -L/home/s0566708/fftw/lib USRLIB = -lfftw -lmpich

e-S

cien

ce c

lust

er

lam

CCFLAGS=-g -O -I/home/s0127384/lam/include \ -I/home/s0566708/fftw/include -DFFT_FFTW -DGZIP LINKFLAGS = -g -O -L/home/s0127384/lam/lib \ -L/home/s0566708/fftw/lib USRLIB = -lfftw -llammpio -llammpi++ -lmpi -pthread -llam \ -laio -laio -lutil -ldl

Table 8.1 LAMMPS low-level Makefile flag values of different MPI on e-Science cluster

91

Appendix B

Work Plan and Final Outcome

B.1 Original Plan

According to the work plan shown in the presentation on 29th May, the project is focused on: - four machine platforms: e-Science cluster, Scaliwag, Lomond, HPCx - three MPI implementations: OpenMPI, MPICH2, LAM/MPI - three application codes: PCHAN, NAMD, CASTEP And the time table is as the following:

17/4 – 14/5

- Understand project aim and content - Make work plan - Apply for machines’ account - Download MPI Implementations - Pick out application codes from earlier research

15/5 – 28/5

- Be familiar with machine operations - Install MPI Implementations - Request Application codes and apply for license - Run application codes on HPCx - Prepare for Presentation

29/5 - Presentation of Work Plan

29/5 – 11/6

- Complete MPI installations - Complete code running on HPCx - Port the codes to Linux clusters

12/6 – 25/6

- Benchmark MPI implementations with application codes - Table and plot the results

26/6 – 30/6 - Write mid-term report

92

30/6 - Deadline of Mid-term Report

01/7 – 16/7

- Analyze the benchmark result - Give conclusion together with low-level benchmark result

17/7 – 30/7

- Look at the influence of compiler choice to final performance- Study the further features of one MPI Implementation - Optimise application performance with one MPI version

31/7 – 25/8

- Summary and Conclusion - Final dissertation writing and modifying - Deeper analyse and discuss on the topic of this project

25/8 - Deadline of Final Dissertation

26/8 – 03/9 - Prepare for final presentation

04/9 - Final Presentation

Table B.1 Original work plan timetable

risk probability Impact Exposure Strategy Happened in this project

Delay in machine account apply

50 15 days 750 Focus work

on other machines

Yes

Machine break down 25 30 days 750

Focus work on other machines

Yes

Machine busy

100 3 days 300

Do some text work

while job is waiting

Yes

MPI install fail on some machine 75 7 days 350

Drop this MPI on this

machine Yes

Code unavailable or takes long time to get

50 7 days 350

porting and benchmark

available code first

Yes

All codes are hard to port to clusters 25 20 days 500

Replace by backup codes

Yes

Lack of experience of code porting and benchmark

100 3 days 300 Take some

time brushing up

Yes

93

consult to experts

Supervisor away 100 2 days 200 e-mail contact Yes

Table B.2 Project Risk

Table B.2 lists several risks which may delay the procedure of the whole project. The biggest risks come from the machine, as all the code porting and benchmark work must be done on specific Linux clusters. Some machine account apply may take long time to approve. Although unlikely, some machine may happen to breakdown for a long by system upgrading or accidents. Even all machine are available during the project, some may be very busy at some time so that one job keeps in queue for days. One strategy for machine risk is use dynamic overlap schedule. When the machine is unavailable or busy, do some other text work but keep waiting the results. Try to make machine and people work in parallel but in sequential. The other way is to prepare more than one clusters to reduce the risk that no machine can be use. In this project, both e-Science cluster and Scaliwag can be backup machine of the other one. Another big risk is code porting. It is known from early research that some code is quite nasty to port although they may be efficient to run. It is not wise to spend months of time on the porting work of one code. If one application code fails to be ported, it can be dropped and replaced by one backup code. One another potential risk that is not in the table is the low-level part of project. Although low-level benchmark is not the task of this dissertation, the result of application code benchmark can not be understood and explained without its result.

B.2 Current states after two months

The following tables shows the current project states after two months’ work

Machine Account Architecture Usable

e-Science cluster Yes Yes Yes Scaliwag Yes Yes upgrading Lomond Yes Yes Yes HPCx Yes Yes Yes

Table B.3 Machine states

94

TableB.3 shows that all the machines are ready by now. However, the account of Scaliwag is only received one several days before and this machine is going to have O/S upgrading in a week with no idea when it can return.

Machine Native MPI OpenMPI MPICH2 LAM/MPI

e-Science cluster - Yes Yes Yes Scaliwag Scali MPI No (MPICH) Yes Lomond Sun MPI Yes Yes Yes HPCx IBM MPI No No No

Table B.4 MPI installs states

Currently, except Scaliwag machine which was just able to access, all three MPI implementations have been well installed on e-Science cluster and Lomond. HPCx unfortunately, seems not compatible with these public domain MPI implementations.

Application PCHAN NAMD CASTEP

Machine compile run compile run Compile Run e-Science cluster - - - - - -

Scaliwag - - - - - - Lomond ok crash process - - - HPCx ok ok process binary - -

Table B.5 Code porting states

It is obvious from table B.5 that the code porting work is not as expected as in work plan. There is no chance to compile and run CASTEP code as the CD of code takes several weeks to arrive but it has been damaged unfortunately. The second CD takes another two week on the way and was received several days before. For the other two codes, the original plan is to compile and run them successfully on HPCx and Lomond first, and then port it to e-Science cluster. But they met different accidents. PCHAN passed the compiling phase but crashed when began to run. The possible reason is the stack limit setting on memory. NAMD serial executable has been compiled but the parallel version is still in process as NAMD compile is a complicated procedure. No chance to port code to Scaliwag as this machine was just allow to access several days ago.

95

B.3 Changes of original work plan

Since the work of porting codes and part MPI install work are not as easy as expected, some changes is made on original plan in order to cope with the damages and continue the project.

The application code benchmark with public domain MPI on HPCx is temporarily dropped as public domain MPI implementations install failed on this machine. The install work can be re-tried later if time is enough. Benchmark work is performed on the other three machine platforms.

Current code porting and benchmark work are focused on MPI coursework code (a 2D

decomposition image update code) as one backup of the other three codes. The reason why select this code is that the coursework code is easy to re-configure and port. Some benchmark work can be done in short time to acquire some result data to analyse. Factually, this code have been successfully ported and benchmarked with all MPI installed on lomond and e-Science cluster before this report is written.

It is urgent to benchmark coursework code with installed MPI on Scaliwag in the next

days. Because this machine is about to upgrade Operating System and there is no idea when it returns.

The work of next two weeks is to port PCHAN, NAMD and CASTEP on e-Science

clusters. If some code still fails, use other backup codes (LAMMPS, DL_POLY) soon.

B.4 Final Porting Outcome

Applications Native MPI OpenMPI MPICH2 LAM/MPI

HPCx ok - Lomond ok ok ok ok

e-Science - ok ok ok Coursework

Code (C)

Scaliwag ok ok - ok HPCx ok -

Lomond ok ok ok miss PCHAN (F90) e-Science - ok ok ok

HPCx ok - NAMD (C++)

Lomond ok miss ok ok

96

HPCx ok - LAMMPS (C++) e-Science ok ok ok ok

Table B.6 Final Project States

Table B.6 lists the final states of the project. As a whole, the benchmark work was well finished. Four codes (MPI Coursework Code, PCHAN, NAMD, and LAMMPS) were successfully ported and benchmarked using different MPI on target machine platforms. But there were still some miss and fail which need to be explained and understood.

The MPI install work on HPCx is still fail1, the codes can only be benchmarked with native IBM MPI on HPCx.

The native MPI is IBM MPI on HPCx, Sun MPI on Lomond, and ScaliMPI on

Scaliwag. There is no native MPI on e-Science cluster.

There is not enough time to install all public domain MPI on Scaliwag as this machine is only available for less than one week. MPICH2 is not installed on this machine2.

Although PCHAN porting work with LAM/MPI succeeded on Lomond, LAM/MPI was

down on backend of lomond and can not be repaired. So PCHAN benchmark with LAM/MPI on lomond was dropped.

Porting NAMD on Lomond using OpenMPI failed. The code crashed at the beginning

and the reason is not known. The benchmark work had to be dropped.

LAMMPS was proposed as the backup of NAMD on e-Science cluster. (They are both Molecular Dynamic code written in C++). As the porting work of NAMD on e-Science cluster were unsuccessful.

ok code is ported and benchmarked with this MPI miss porting work is fail and dropped with this MPI - this MPI is not installed on the machine

97

Bibliography

[1] William Gropp, Ewing Lusk and Anthony Skjellum, “Using MPI - 2nd Edition - Portable Parallel Programming with the Message Passing Interface” Scientific and Engineering Computation

[2] Sun Microsystems Computer Company “Sun MPI 3.0 User’s Guide”

Http://docs.sun.com/app/docs/doc/805-1556/6j1h6mo85?a=view Access on 25 June 2006

[3] Galen M. Shipman, Tim S. Woodall, Rich L. Graham, Arthur B. Maccabe, and Patrick G.

Bridges “Infiniband Scalability in Open MPI” Los Alamos National Laboratory, University of New Mexico Dept. of Computer Science 2006

[4] Richard L. Graham1, Timothy S. Woodall1, and Jeffrey M. Squyres “Open MPI: A

Flexible High Performance MPI” Advanced Computing Laboratory, Los Alamos National Lab, Open System Laboratory, Indiana University 2005

[5] MPICH homepage http://www-unix.mcs.anl.gov/mpi/mpich1/

Access on 25 June 2006 [6] MPICH2 homepage http://www-unix.mcs.anl.gov/mpi/mpich/index.htm

Access on 25 June 2006 [7] William Gropp, Ewing Lusk et al

“MPICH2 User’s Guide version 1.0.3” Mathematics and Computer Science Division, Argonne National Laboratory 23 November 2005

[8] “LAM/MPI overview” LAM/MPI Parallel Computing website

http://www.lam-mpi.org/about/overview/ Access on 26 June 2006

[9] “Scaliwag: A Brief Introduction & Guide” Scaliwag cluster specifications

http://www.cse.clrc.ac.uk/disco/scaliwag/scaliwag.shtml CCLRC Access on 26 June 2006

98

[10] “Introduction to the University of Edinburgh HPC Service” Hardware http://www.epcc.ed.ac.uk/computing/services/sun/documents/hpc-intro/html/index.html

Access on 26 June 2006 [11] Phil Merkey “Beowulf History” Beowulf.org http://www.beowulf.org/overview/history.html

Access on 27 June 2006 [12] “Development History of MPICH” http://www-unix.msc.anl.gov/mpi/mpich1/papers/mpicharticle/node5.html

Access on 27 June 2006 [13] “The History of LAM/MPI” lam/mpi parallel computing site http://www.lam-mpi.org/about/overview/history.php

Access on 27 June 2006 [14] Mike Ashworth, CCLRC Daresbury Laboratory

“PCHAN - Benchmarking a Direct Numerical Simulation Code” CCLRC CSE website http://www.cse.clrc.ac.uk/arc/pchan.shtml

Access on 28 June 2006 [15] A.Grey, D.Henty, L.Smith, J.Hein, S.Booth, J.M.Bull, F.Reid EPCC 20 Feb, 2006 “A Performance Comparison of HPCx Phase2a to Phase2” [16] “Sun Studio Compilers and Tools - Fortran 95 Features by Release”

Sun Developer Network http://developers.sun.com/prodtech/cc/support/F95compare.html Access on 03 August 2006

[17] “The overview of NAMD” HPCx website

http://www.hpcx.ac.uk/research/chemistry/namd.html Access on 03 August 2006

[18] “NAMD – Scalable Molecular Dynamics”

Theoretical and Computational Biophysics Group, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Access on 05 August 2006

[19] M. Bhandarkar, R. Brunner, C. Chipot etc. 29 July, 2005 Theoretical Biophysics Group, University of Illinois and Beckman Institute “New Features in version 2.6b1” NAMD 2.6b1 User’s Guide

99

http://www.ks.uiuc.edu/Research/namd/2.6b1/ug/node5.html Access on 10 August 2006

[20] Introduction - LAMMPS documentation 17 July 2006 http://www.cs.sandia.gov/~sjplimp/lammps/doc/Section_intro.html

Access on 12 August 2006 [21] Getting Started - LAMMPS documentation 17 July 2006 http://www.cs.sandia.gov/~sjplimp/lammps/doc/Section_start.html#2_2

Access on 14 August 2006 [22] FFTW introduction http://www.fftw.org/ FFTW website

Access on 15 August 2006 [23] M. Bhandarkar, R. Brunner, C. Chipot etc.

Theoretical Biophysics Group, University of Illinois and Beckman Institute “What is needed” NAMD 2.6b1 User’s Guide 29 July, 2005

http://www.ks.uiuc.edu/Research/namd/2.6b1/ug/node10.html Access on 15 August 2006

[24] “Intel ® MPI Benchmarks User Guide and Methodology Description” Intel GmbH [25] “Rhodopsin protein benchmark” LAMMPS WWW website. http://www.cs.sandia.gov/~sjplimp/lammps/bench.html#rhodo

Access on 18 August 2006 [26] Onur Celebioglu, Ramesh Rajagopalan and Rizwan Ali

“Exploring Infiniband as a HPC Cluster Interconnect” October 2004

[27] “LAM FAQ: Running LAM/MPI applications” LAM/MPI website http://www.lam-mpi.org/faq/category5.php3#question7

Access on 18 August 2006 [28] David Henty “MPI Coursework exercise” EPCC 2005 [29] Eric McClements

“Low level Performance Comparison of Public Domain MPI Implementations” EPCC, University of Edinburgh Aug 2006

Performance Comparison of Public Domain MPI ... · PDF filewith its installation on Solaris...

Documents

Transcript of Performance Comparison of Public Domain MPI ... · PDF filewith its installation on Solaris...