Exploring Performance Improvement for Java-based Scientiﬁc ...nom/Papers/XiangSwarm2003.pdf ·...

Exploring Performance Improvement for Java-based Scientific

Simulations that use the Swarm Toolkit

Xiaorong Xiang and Gregory MadeyDepartment of Computer Science and Engineering

University of Notre DameNotre Dame, IN 46556 USA

Abstract

There has been growing interest for using the Javaprogramming language in scientific and engineeringapplications. This is because Java offers several fea-tures, which other traditional languages (C, C++,and FORTRAN) lack, including portability, garbagecollection mechanism, built-in threads, and the RMImechanism. However, the historic poor performanceof Java stops it from being widely used in scien-tific applications. Although research and develop-ment on Java resulting in JIT compilers, JVM im-provement, and high performance compilers, havebeen done much for runtime environment optimiza-tion and significantly speeded up the Java programs,we still believe that creating hand-optimized Javasource code and optimizing code for speed, reliability,scalability, and maintainability are also crucial duringprogram development stage. Natural organic matter(NOM), a mixture of organic molecules with differenttypes of structure and composition, micro-organisms,and their environment form a complex system. TheNOM simulator is an agent-based stochastic modelfor simulating the behaviors of molecules over time.The simulator is built using Java programming lan-guage and the Swarm agent-based modeling library.We analyze the NOM simulation model from sev-eral aspects: runtime optimization, database access,objects usage, parallel and distributed computing.The NOM simulation model possesses most of char-acteristics which general scientific applications have.These techniques and analysis approaches can be gen-erally used in other scientific applications. We expectthat our experiences can help other developers usingJava/Swarm to find a suitable way of achieving higherperformance for their applications.

1 Introduction

C, C++ and FORTRAN have traditionally been usedfor modeling scientific applications. Since the Javaprogramming language was introduced by Sun Mi-crosystem in the mid-1990s, there has been growinginterest for using it in scientific and engineering appli-cations. The reason for this is that Java offers severalfeatures, which other traditional languages lack. Sci-entists use various platforms (such as Windows, Unixand MacOs ) for their scientific studies. It is impos-sible to deploy an application written in C, C++, orFORTRAN languages from one platform to the otherwithout rebuilding the application or changing thecode. One of the most attractive features of Javais its portability, “write once, run anywhere.” Javaruntime environment (JVM) provides an automaticgarbage collection feature that reduces the burden ofexplicitly managing memory for programmer. TheJava built-in threads implementation and Java’s Re-mote Method Invocation (RMI) mechanism make iteasy for parallel computing and distributed comput-ing.

Although there are many attractive features pro-vided by the Java language and the J2EE architec-ture makes Java a potential language for scientific ap-plications, performance remains a prime concern forprogram developers using Java. The portability andmemory management implementation in JVM imposea penalty on the performance. Ashworth (1999) [1]discussed several issues related to the use of Java forhigh performance scientific applications.

However, much research has been done to reducethe performance gap between Java and other pro-gramming languages. This research includes Just-in-Time (JIT) compilers that compile the byte code intonative code on-the-fly just before execution, adap-tive compiler technology (Sun’s HotSpot VM), third-party optimizing compilers that compile the Javasource code to the optimized bytecode, and high per-

formance compilers that compile the Java source tonative code for a particular architecture (IBM High-Performance Compiler for Java for RS6000 architec-ture). Bull (2001) [2] rewrote the Java Grande Bench-marks in C and FORTRAN. Bull also comparedthe performance between these languages in differ-ent Java runtime environments on different hardwareplatforms. The results demonstrate that the perfor-mance gap is quite small on some platforms.

The runtime environment optimization is an im-portant aspect to study in order to achieve high per-formance. Determining and understanding the fac-tors that affect the performance of scientific appli-cations from a software engineering perspective andidentifying and eliminating the bottlenecks that limitscalability at the software development stage are alsonecessary for high performance scientific applications.

In this paper, several approaches for analyzing andimproving the performance of a particular scientificapplication, the NOM simulation model, have beendescribed. The NOM simulation model is intendedto simulate NOM behavior over time in a complexsystem. Such behavior includes transport throughsoil pores, adsorption to or desorption from mineralsurfaces, and interactions with each other. The sim-ulation model was built using the Java programminglanguage and the Swarm library. The NOM simula-tion model is a typical distributed, stochastic scien-tific application that uses an agent-based modelingapproach. It generates a substantial data set thatmust be stored in a remote database, able to be ma-nipulated for producing useful information. The ap-plication simulates the behavior of a large numberof molecules. An application of the NOM simulationmodel needs to run for a long time, often for days.Several aspects of the simulation model have beenanalyzed, including runtime optimization, databaseaccess, objects usage, and parallel and distributedcomputing.

The NOM simulation model possesses the char-acteristics which typical scientific applications have.These techniques and analytical approaches can gen-erally be used in other scientific applications. Weexpect that our experiences can help other scientificapplication developers to find a suitable way of tun-ing and achieving higher performance for their appli-cations.

2 Profiling tool

OptimizeIt is a Java J2EE performance tuning tooldeveloped by Borland company. It is a commercialtool and the trial version can be downloaded from

their Web site 1. OptimizeIt can display the infor-mation about heap allocation, garbage collection, ac-tive threads, and class load in the form of graphs.The CPU sampling information is also displayed in atree structure. These data can be exported to a for-matted text file (HTML). Information about objectallocation and deallocation can be viewed in a userselected order. OptimizeIt is a powerful tool thatdetects memory leaks and CPU performance bottle-necks in Java applications. The screen images shownin Appendix C reflect some of the features of Opti-mizeIt while it was used for profiling the NOM simu-lation model.

3 Data Structure

Selecting the appropriate data structure is importantin a scientific application. Different data structureshave significant impacts on the performance. No sin-gle data structure is appropriate for all situations.The best way to decide which type of data structureto use in an application is to create some benchmarksthat reflect how the structure is used in the applica-tion.

In the NOM simulation model, each molecule ob-ject needs to be held in a collection. Individualmolecule objects are accessed randomly at each timestep. They can also be added or removed from thiscollection. A List data structure is a suitable choiceto store, retrieve, and manipulate molecule objects.Java 2 platform provides two types of List structures:ArrayList and LinkedList. The ArrayList is a randomaccess list, and the LinkedList is a sequential accesslist. The same operation has very different perfor-mance characteristics for different types of List. Po-sition access is an operation used to access objectsthrough its index in the List. Position access is alinear-time operation in the sequential access list, andit is a constant-time operation in a random access list.However, the remove operation for the random accesslist is more expensive than for sequential access list[3].

An algorithm for manipulating random access listscan produce quadratic behavior when it is appliedto sequential access lists. In the NOM simulationmodel, in order to randomize the order of accessingthe molecules in a List, the List needs to be shuffledat the beginning of each time step. The collectionsclass in Java 2 platform provides a shuffle methodto randomize the order of objects in the List. Theshuffle algorithm used in the implementation has alinear time complexity for random access lists and

1http://www.borland.com/optimizeit/

quadratic complexity for sequential access lists. Inorder to avoid the expensive operation that would re-sult from shuffling a sequential access list, the shufflemethod converts the list into an array before execut-ing shuffling and converts the shuffled array back intothe list.

Three benchmarks have been created in order totest the performance, using different List structuresand operations. These benchmarks reflect exactlyhow the data structures have been used in the NOMsimulation model. In benchmark A, the MoleculeListis implemented using ArrayList, objects are accessedusing get(), and MoleculeList is randomized usingshuffle(). In benchmark B, MoleculeList is imple-mented using LinkedList, objects are accessed us-ing get(), and shuffle() is used. In benchmark C,MoleculeList is implemented using LinkedList, objectsare accessed using iterator, and shuffle() is used. Ineach benchmark, the add and remove operations aremeasured, and the overall performance is measuredto reflect the NOM application behavior.

Figure 1 shows the relationship between differenttypes of List with different operations. It also il-lustrates the behavior of different operations whenthe List size is doubled. In the NOM application,molecules are uniformly distributed on the grid. Dou-bling the grid size, therefore, will double the moleculenumber and the List size.

Figure 1: Performance comparison of different opera-tions for ArrayList and LinkedList, the simulation runsfor 500 time steps. Top: grid size is 400 X 300. Bot-tom: grid size is 800 X 300

The overall performance gain for ArrayList has beenoffset by the add and remove operations. By dou-bling the grid size, the time for randomly accessingthe LinkedList increases more than 6 times. For thisparticular application, the ArrayList is the best choicefor implementing the MoleculeList. It can get a max-imum 2.8 speedup in this benchmark with grid size400 X 300. When the grid size becomes large, thespeedup increases.

4 Object reuse

Objects play a vital role in Object-Oriented pro-gramming, C++, and Java. The Java Virtual Ma-chine (JVM) can automatically manage memory us-ing the “garbage collection” mechanism. Java devel-opers can allocate objects as necessary without con-sidering deallocation, while C++ programmers mustmanually specify where an object in the heap is tobe reclaimed by coding a “delete” statement. Exces-sively creating objects not only increases the memoryfootprint and CPU time for garbage collection, it alsoincreases the possibility of memory leak.

Sosnoski (1999) [4] showed that the time for theobject allocation in a Java program running on theJVM is 50 percent longer than one using the C++code. This is caused by the overhead of adding in-ternal information to help in the garbage collectionprocess when allocating objects in heap.

Different JVMs, different versions of Sun JVM andIBM JVM, use various techniques to automaticallydiscover objects when they are no longer being re-ferred to and to recycle the memory periodically. De-spite the automatic nature of the garbage collectionprocess, a potential disadvantage is that it adds anoverhead that can affect program performance, evenin some JVMs such as Sun HotSpot JRE, where thegarbage collection job runs in a separate thread.

Although the JVM is responsible for reclaiming theunused memory, Java programmers still need to puteffort into making it clear to the JVM what objectsare no longer being referenced [5]. For example, insome situations, a programmer needs to set the ob-ject into “null” manually in order to dereference theobject. Although the memory leaks that are commonin C++ are less likely to happen in Java, they can stilloccur due to poor design or simple coding errors.

An elegant way of reducing the overhead of objectscreated and destroyed and of improving the perfor-mance is object reuse. It can also reduce the prob-ability of potential memory leaks. In order to reusea certain type of object, several steps need to be fol-lowed:

• Isolating object

Due to the overhead of object reuse, such as man-aging the object pools, only objects that need tobe created and destroyed frequently are compat-ible with this technology. For a large scale ap-plication, using a powerful profiling tool is neces-sary to help developers detect this kind of object.OptimizeIt has been chosen in the developmentprocess of the NOM simulation model.

In the NOM simulation model, at each time step,a certain number of molecules can enter into thesystem or leave the system. The molecule objectsneed to be created and destroyed frequently andthey are candidates for potential reuse.

• Optimizing object size

The size of objects not only has an effect on thememory footprint but also on the CPU time. Itis worthwhile to estimate the size of a partic-ular object and the number of instances for agiven class in memory. A trade off can be madeby either reducing the object size or keeping theprecision.

• Reinitializing object

Although Sun’s HotSpot VMs radically improvedthe performance of object allocation and garbagecollection, object allocation and instantiationstill has a significant cost, especially when theobject size is small.

A micro-benchmark is created for testing thetime of creation or re-initialization of theMolecule objects in the NOM simulation model.In this benchmark, 1000 Molcule objects are cre-ated, then reinitialized to their original state.The average creation time for these Molecule ob-jects is 53.85 million seconds, while the averagereinitializing time is 3.9 million seconds.

• Creating object pool

In order to reuse certain types of objects, theseobjects need to be kept in a collection in thememory, the so-called object pool. An objectpool is used to store a free list of objects. Gen-erally, an object pool can be implemented usingVector, Linked List, ArrayList, or a raw array. Se-lecting a suitable type of data structure dependson the type of operations used for managing thepool.

In the NOM simulation model, the object poolhas been implemented as a First-In-First-Out(FIFO) queue. When a new molecule objectneeds to be created, the method first needs to

check the object pool. If there is an availableobject in the pool, the object is removed fromthe queue and reinitialized. If there is no ob-ject available, a new object is created. When amolecule object leaves the system, it is added atthe end of the queue. The data structure chosenfor the object pool implementation is LinkedList.

Object reuse is a simple and elegant way to con-serve memory and enhance speed. By sharing andreusing objects, processes or threads are not sloweddown by the instantiation and loading time of newobjects, or the overhead of excessive garbage collec-tion.

5 Database connection and

database query

The database design and optimization plays a criti-cal role in evaluating the performance and scalabilityof scientific applications. Most modern databases,called database management systems (DBMS), arebuilt for speed and scalability. They are distributedsystems that consist of a number of concurrently run-ning processes, threads, and various machines thatwork together in an efficient way to deliver fast andscalable data. The basic purpose for a database is tostore, manage, and access the data on one or severaldistributed machines.

For an application written in the Java program-ming language, communication to a database is ac-complished through a Java Database Connectivity(JDBC) driver, with all database Input/Output viaSQL (Structured Query Language). Database ven-dors provide their own JDBC drivers that conform tothe common Java API defined by Sun Microsystems.Using JDBC allows developers to change databaselocation, port, and database vendors with minimalchanges in code.

Database query operations include data retrieval,insertion, updating, and deletion, as well as tablecreation and alteration. In a Web-based scientificapplication, data retrieval, insertion, updating, anddeletion are the four most frequently used operations.JDBC offers several advanced techniques that allowthe programmer to write high performance queries[6]. These techniques are presented to show the sig-nificant impact on performance and scalability.

• Connection Pooling

Establishing a connection to a database is a time-consuming process, especially for a short query.It is therefore reasonable to reuse the connection

object that connects to the same database re-peatedly. Connection pooling is a runtime objectpool that can be used to cache connections. Theconnection pool is normally used in a Web-basedapplication connected to a database that a num-ber of clients frequently access. Database ven-dors and application server vendors provide con-nection pooling utilities for managing the con-nection between Web applications and databasesin an efficient and reliable way.

• Prepared statements

Query processing is a process for resolving a SQLquery. It can be broken down into three ba-sic phases: query parsing, query plan generationand optimization, and plan execution [6]. Thequery parsing phase is a syntax-checking processfor the string-based SQL query that ensures thequery statement is legal. If a SQL statement ora set of similar SQL statements needs to be exe-cuted repeatedly, the cost for the parsing processcan be reduced by caching the previously parsedqueries. JDBC provides this function with a Pre-paredStatement object. Unlike the Statement ob-ject, which needs to be sent to a database forparsing each time, the PreparedStatement is com-piled in advance and can be executed as manytimes as needed. The speedup of a query variesaccording to the type of query (SELECT, IN-SERT, UPDATE, DELETE) or the complexityof the query (more than one table involves).

• Batch updates

For a large scale scientific application, the appli-cation and the database normally reside on phys-ically distributed machines. Substantial networklatency can lead to very inefficient query execu-tion. JDBC provides the batch update approachthat can decrease the network latency effect byexecuting a number of queries in one networkround trip. The query processing, however, isnot necessarily faster when using the batch up-date approach instead of the PreparedStatment.The trade off comes in two forms, (1) batch up-dates can only be combined with a Statement ob-ject and (2) the size of the data needs to be sentfrom the client to the server in one large roundtrip. The batch updates approach offers morebenefits when the network connection is slow.

• Transaction management

In SQL terms, a transaction is a series of op-erations that must be executed as a single log-ical unit. When a connection is created using

JDBC, the database is set in autocommit modeand each SQL statement is treated as a separatetransaction. In order to allow two or more state-ments to be grouped into a transaction, the au-tocommit mode needs to be disabled using Con-nection.setAutoCommit(false). Treating a groupof operations as a transaction is a safe way toguarantee the integrity of a database. For exam-ple, if a bank customer wishes to transfer fundsfrom a savings account to a checking account,the two update operations need to be sent. Ifthese two calls are treated as two separate trans-actions, then one is successful and the other failsdue to the network or other factors. This resultsin data that is not consistent in the database.

Explicitly commit a group of SQL statementsis not only a safe approach, but also has largeperformance impact because the overhead of thecommit operation is reduced. In the NOMmodel, all the data and information pertainingto the reacted molecules in the system are storedin the database at the end of each time step.The insertions for the data of all the reactedmolecules at each time step are treated as onetransaction to ensure data consistency in the sys-tem.

The NOM core simulation engine connects to theOracle database using the JDBC thin driver. Figure 2shows the performance comparison for data insertionsusing five approaches. The simulation runs for 96time steps, with a total of 1782 insertions.

The benchmark consists of five cases. In case 1,each insertion for each reacted molecule was treatedas a single transaction with a Statement object. Incase 2, each insertion for each reacted molecule wastreated as a single transaction with a PreparedState-ment object. In case 3, a group of insertions for everyreacted molecule in one time step is treated as a sin-gle transaction with Statement object. In case 4, agroup of insertions for every reacted molecule in onetime step is treated as a single transaction with Pre-paredStatement object. In case 5, a batch updatesapproach is applied.

Case 4, the PreparedStatement object with trans-action management has the best performance in theNOM application. It offers 3.03 times speedup rela-tive to case 1 in this benchmark.

Figure 2: Comparison of data insertions using fiveapproaches in NOM simulation model

6 Parallel data output withJava threads

Querying a database is an I/O bound process, es-pecially when the database sever and the applicationserver are on different machines. This is a common ar-chitecture design in large-scale scientific applications.For several independent queries, it is more efficient tomake use of idle CPU cycles and have these queriesexecuted in parallel instead of one after the other.For database queries that only involve the data readfrom a database, the parallelism can be easily accom-plished by using the Java built-in threads and con-nection pooling technologies.

Large-scale scientific applications are not only I/Obound processes, but also CPU bound processes. Ifthe application server has multiple processors, multi-threading can be used to overlap the computation andcommunication. More specifically, parallelism canbe achieved by overlapping the computation and theI/O. There are, however, trade offs between the sim-plicity of programming and the performance. Whenan application not only involves the data-read butalso involves the data-write, several programming is-sues need to be considered to prevent the deadlockand the race conditions.

In the NOM simulation model, large amounts ofdata need to be written in the database at each timestep. The average time for one record insertion us-ing PreparedStatement object with transaction man-agement is 4.7 milliseconds as shown in previous sec-tion. In order to parallelize the data write to the

database, a buffer (FIFO queue) is allocated and anextra thread is created. While the computationalthread adds the object to the queue, the I/O threadremoves the object from the queue and writes thedata to the database. If there are no objects in thequeue, the data-writing thread executes busy waiting.If the number of objects in the queue is equal to thequeue size, the computation thread waits. In orderto safely add to and remove from the queue, all theaccesses to the queue are synchronized.

The NOM simulation model has been tested andrun for various time steps, from 96 time steps to 579time steps. Refer to Figure 3.

Figure 3: Overlap the computation and I/O usingJava threads. Left figure shows that the number ofdata insertion over the time steps. Right figure showsthat the speedup of using a separate thread for dataoutput.

By using a separate thread for data output, thereis average 1.3 speedup relative to the single threadsmodel for the NOM application. In the NOM simula-tion model, the time for computation is more than thetime for I/O at each time step. If the computationtakes much less time than the I/O, we can considerusing more I/O threads to execute data writing.

7 Choosing JVM

Various Java Virtual Machines (JVM), such as IBMJVM and Sun HotSpot JVM, have been implementedin conforming to the Java Virtual Machine Specifica-tion [7]. The runtime and the compiler are two mainparts in the Sun HotSpot architecture. The compilerportion in a JVM translates the bytecode into na-tive machine instructions in order to improve execu-tion speed. The runtime portion in a JVM includes abytecode interpreter, memory management, garbagecollection, and a low-level task handler [3].

Sun MicroSystem implements two types of HotSpotJVM, Client VM and Server VM, to meet different

requirements for different applications. Comparedwith server side programs, client side programs oftenrequire a smaller RAM footprint and have a fasterstart-up time. These two HotSpot JVM share thesame runtime portion, but the main difference be-tween them is found in their compiler technologies.The Server VM contains a highly advanced adaptivecompiler that includes many of the optimization tech-nologies which are used in C++ compilers [3].

The application of the NOM simulation model hasbeen benchmarked using two different runtime modesof Sun JVM 1.4.1 01 on Redhat Linux 8.0 for 500 timesteps and 1500 time steps. Figure 4 shows that as thegrid size and the time steps increase, Server VM of-fers higher performance than Client VM. The resultsshows that choosing different JVMs can produce sig-nificant differences in the performance.

Figure 4: Performance comparison for a NOM ap-plication run in Sun Client VM and Sun Server VMwith different grid sizes.

Ladd (2003) [8] showed benchmark results for bothSun and IBM JVM with versions 1.3.1 and 1.4.1 01on a Linux platform. According to Ladd, the JVMversion 1.3.1 has a higher performance level than ver-sion 1.4.1 and the IBM JVM has a better performancelevel than Sun JVM. Choosing the appropriate JVMfor a particular scientific application also involves theconsideration of the hardware architecture and theoperating system.

8 High performance compilers

Traditionally, a Java program is compiled into byte-code using a compiler and a Java Virtual Machine(JVM) is needed to read in and interpret the byte-code.

Jikes2, a faster compiler developed by IBM, cantranslate Java source code into bytecode 10 timesfaster than the javac from Sun JDK 1.4.1 01. Al-though it does not necessarily generate a faster run-ning code, it can speed up the development processfor large applications.

GCJ3, the GNU Compiler for the Java language,can compile the Java source code into either the byte-code or the native code. GCJ has been integrated intoGCC. Bothner (2003) [9] discussed the advantages,features, and limitations of GCJ in detail. Ladd(2003) [8] did several benchmarks using GCJ on theLinux platform and showed a performance gain ofGCJ over other JVMs. The GCJ compiler is a projectunder development and several limitations still exist.For example, GCJ can not compile the Java programwith swing. Using GCJ, the Java application withSwarm library can be compiled into native code, butit does not improve the performance.

There are many other runtime environment opti-mizers and high performance compilers. AlphaWorksis a high performance compiler for Java from IBMalphaWorks. It can be used on OS/2, AIX and Win-dows NT platform. JOVE 4 also can only be usedon Windows machines. TowerJ environment 5, devel-oped by Tower Technology Corporation, is anotherexample of high performance compilers. It is avail-able for Solaris and Linux platforms.

9 Scalability

For large scale scientific applications written in Java,the scalability can be improved using the parallelismprogramming model. The parallelism model can runwith a single JVM in a shared memory multiproces-sor system or with multiple JVMs in a distributedmemory system. The Java built-in threads mecha-nism is a convenient method for parallelism imple-mentation in shared memory environments. However,for large scale applications that require large memoryand CPU time, distributing the application on multi-ple JVMs in a distributed memory system with some

2http://www-124.ibm.com/developerworks/oss/jikes/3http://gcc.gnu.org/java/4http://www.instantiations.com/jove/5http://www.towerj.com

message passing mechanisms for inter-VM communi-cation is a suitable way to address the requirements.

The standard Java libraries, thread class, is appro-priate for using in the parallel programming paradigmin a single JVM environment. Since most scientificapplications are CPU bound, in order to avoid thecontext switching to achieve best performance, thenumber of threads should match the number of pro-cessors in the hardware architecture. Additionally,the thread creation and destroying should be avoidedby creating and managing a thread pool.

OpenMP [10], an open standard for shared mem-ory directives, defines directives for FORTRAN, C,and C++. OpenMP provides a portable and scal-able model that offers a simple and flexible interfacefor developing parallel applications in shared memorysystems. JOMP [11] provides a set of OpenMP-likedirectives and library routines for supporting sharedmemory parallel programming in Java. It uses Javathreads as the underlying parallel model and is mostuseful for parallelizing scientific applications at theloop level.

For distributed computing, Java provides a com-munication mechanism using sockets and the RMI(Remote Method Invocation). Java RMI [12] is amessage passing paradigm based on the RPC (Re-mote Procedure Call) mechanism. RMI is primarilyintended for use in the client-server model instead ofthe peer-to-peer communication model. On the otherhand, the explicit use of sockets is too low-level to beused to develop a parallel application [13].

In C, C++, and FORTRAN, an explicit messagepassing interface (MPI) [14] standard has been de-fined for supporting communication of an applicationin cluster environments. MPI is the most widely-usedstandard for inter-process communication in high-performance computing. MPICH [15], LAM MPI[16]are two successful examples of portable MPI imple-mentations in traditional languages. Programmingwith MPI is relatively straightforward because itsupports the single program multiple data (SPMD)model of parallel computing, wherein a group of pro-cesses cooperate by executing identical program im-ages on local data values.

In 1998, a group of researchers of the Java GrandeForum worked on a specification MPI-like applica-tion programming interface for message passing inJava (MPJ) [17]. The current implementations of theMPJ can be separated in two ways: as a wrapper toexisting native MPI libraries or written in pure Java.NPAC’s mpiJava [18] is an example of the wrapperapproach using Java Native Interface (JNI) to exe-cute a native call. The JMPI project [19] implementsmessage passing with Java RMI and object serializa-

tion. The jmpi [20] is built upon the JPVM system.MPIJ [21] is a Java based implementation of MPI in-tegrated with DOGMA (Distributed Object GroupMetacomputing Architecture). The MPJ implemen-tation in pure Java is usually slower than wrapperimplementations for existing MPI libraries, but pureJava implementations are more reliable, stable, andsecure [20]. Getov (2001) [13] did an experiment thatused the IBM High Performance Compiler for Java(HPCJ), which generates native code for the RS6000architecture, to evaluate the performance of MPJ onan IBM SP2 distributed-memory parallel machines.The results show that when using such a compiler,the MPJ communication components are as fast asthose of the MPI.

9.1 Parallel implementations

The scalability of the NOM simulation model involvestwo aspects, the required total number of simula-tion time steps and the grid size. Two parallelismprogramming models are implemented for the NOMsimulation model. The Java thread version is im-plemented using built-in Java threads and runs on asingle JVM. The distributed memory model is imple-mented by using mpiJava library with LAM MPI.

In the sequential implementation model for simu-lating the NOM complex system, the behavior of indi-vidual molecules is simulated using the agent-basedmodeling approach. Molecules reside in cells on a2D grid and individual molecules can be transportedthrough the soil medium via water flow. At each timestep, molecules can move from one cell to the otherwhen a random event occurs. The time for finishinga simulation is largely determined by the number ofmolecules in the system and the time steps.

9.1.1 Java threads version

In order to parallelize the programming model, theoriginal grid has been equally separated into two sub-set grids (e.g., a 800X300 grid is separated into two400X300 grids). Two threads are created, each threadhas its own grid object to place molecules and a col-lection to hold molecules. In each time step, the com-putation for individual molecules is executed on thetwo threads concurrently.

When one molecule moves across the boundary, themolecule is removed from the grid and placed intoa local buffer in the current thread. At the end ofthe time step, a Barrier is used to synchronize thesetwo threads at this point. After all the threads reachthis state, one of the threads executes the exchangeoperation by maintaining the state of two grids and

two MoleculeLists. Threads have been synchronizedbefore this thread finishes the setting of boundarycondition. Figure 5 shows the design.

JVM

Thread0

Thread1

LeftGrid

RightGrid

MoleculeList0

MoleculeList1

Exchangebuffer0

Exchangebuffer1

a

b

BarrierThread

0

Exchangebuffer X

removemolecules

MoleculeList X

addmolecules

LeftGrid

RightGrid

ba

putmolecule

to new

location

Barrier

timestepi

endof timestep

i

beginof timestep i+1

Thread1

wait

Thread0

Thread1

Figure 5: The design for parallelism of NOM simula-tion model using Java threads.

9.1.2 MPJ version

In the distributed memory model, each processorruns its own copy of code. The ModelSwarm objectcontains a subset of the original grid. These sub-sets of the grid have equal size in order to ensurethe computational balance of each node in the clus-ter. The basic design model is similar to the Javathread model. Each node in the cluster machine pro-cesses its own computation of the subset of the grid.When a molecule crosses the boundary, it is addedinto the local buffer. MPI.COMM WORLD.Barrieris used to synchronize these processes at the eachtime step. Molecule objects that cross the boundaryare sent to their neighbor grids using blocking sendand receive modes, MPI.COMM WORLD.Sendrecv,MPI.COMM WORLD.Send, andMPI.COMM WORLD.Recv. Figure 6 shows this de-sign. The molecule list and grid on each machine areupdated after all the sending and receiving are fin-ished.

rank 0

JVM

rank 1 rank 2

JVM JVM

Grid 0 Grid 1 Grid 2

a

b

c

d

sendbuffer

receivebuffer

send leftbuffer

send rightbuffer

sendbuffer

Barrier

receivebuffer

receiveleft

receiveright

blocksendandrecv

Barrier

UpdateList&grid

UpdateList&grid

UpdateList&grid

Figure 6: The design for parallelism of NOM simula-tion model using MPJ.

9.2 Performance results

In order to measure the performance of the parallelimplementations, a Linux cluster has been built. Thecluster consists of 4 PCs. Each PC has dual 650 MHzIntel processor running the RedHat Linux 8.0 Oper-ating System. The Java codes are implemented andcompiled using SUN’s 1.4.1 01 Java Development Kitand executed on SUN’s Java Virtual Machine.

Figure 7 presents the results for the sequential ver-sion and Java thread version for the NOM applicationthat runs for 1500 time steps with various grid size.Both versions ran on a single dual Intel computer.

The experiments for parallelism of the NOM sim-ulation model using the distributed memory modelare made on a cluster of four PCs. LAM MPI 6.5.9and mpiJava have been used to build the executionenvironment. The NOM application runs on 2 and 4machines.

Figure 8 shows that the performance comparisonbetween the sequential programming model and theMPJ model that ran on 4 nodes in the cluster. Thesetwo models both ran for 500 and 1500 time steps.This figure shows that the communication betweennodes and the grid and MoleculeList maintenancesoffset the performance gained by distributing the jobto different computers when the problem size is small.

Figure 7: Compare the performance between sequen-tial programming model and Java thread model inthe NOM application (one thread vs. two threads ona single dual CPU computer).

As the problem size grows (larger grid size or longertime steps), however, a performance improvement ap-pears by distributing the job to. Compared to theideal performance gain of a factor of 4, the efficiencyis low.

Figure 9 illustrates that the job has been dis-tributed on four nodes in the cluster machines and ranfor 500 time steps. There is no synchronization be-tween nodes. Instead of sending the molecules whichcross the boundary to other nodes, they are wrappedto the other side of this subset grid. This figure showsthat as the grid size increases the speedup is closer tothe ideal linear speedup of 4.

10 Conclusion

In this paper, several approaches for exploring theperformance and scalability improvement of a typicalscientific application, the NOM simulation model, hasbeen presented. These approaches are summarized inTable 1. The speedup varied by the problem size anddifferent situations. The speedup shown in the ta-ble came from the benchmarks that were describedin the previous sections. All the numbers of speeduplisted in the table show the approximate highest per-formance gain in each benchmark.

Figure 8: Comparison of the performance betweensequential programming model and MPJ model thatruns on 4 machines for 1500 time steps.

Figure 9: Comparison of the performance betweensequential programming model and MPJ model on 4machines without synchronization and molecule ex-change.

Table 1: Summary of the performance improvements

Approaches Speedup CommentsData struc-ture

2.8 using ArrayList provideshigher overall performancethan using LinkedList in thebenchmark.

Object reuse - reduce the memory foot-print, the garbage collectioncycle, and the overhead ofobject allocation and deal-location. The performancegain is relatively small com-pare to the overall computa-tion time in the NOM simu-lation model.

JDBC 3 JDBC tuning can effect theperformance of data I/O.

Parallel dataoutput

1.3 overlap the computationand I/O using Java threadscan improve the perfor-mance.

Java runtime 1.4 Sun Server VM has higherperformance than SunClient VM for large-scalescientific applications.

Java threadsmodel

1.1 performance of Java threadsmodel largely depends onthe JVM implementationand how efficient the op-erating system handle thethreads.

MPJ model 1.5 distributing the job to 4nodes in a cluster hasrelative larger performancegains when problem size isbig and the communicationbetween nodes is minimized.However, it is still muchlower than the ideal perfor-mance gain (4).

Besides the approaches that listed in the Table, us-ing a native code compiler that can compile the Javasource code to native code can increase the perfor-mance for some applications.

Program profiling is a crucial step for high perfor-mance computation in Java-based applications. Twomajor aspects, CPU time and memory usage, need tobe monitored.

Selecting the appropriate runtime environment fora particular application is important. Besides theJVMs that have been evaluated here, IBM JVM isalso valuable to be investigated.

Using Java built-in threads to parallelize the Javaapplications is a convenient approach. How much per-formance can be gained from this parallism dependson the JVM implementation and the efficiency of theoperating system handling the Java threads. The ex-periments that we did are on a dual CPU PC withLinux operating system. It is valuable to extend theseexperiments to Windows operating system or Solarisoperating systems.

Distributing jobs on multiple machines in a clus-ter environment is an efficient approach for large sizeproblems. However, the communication among nodesand the grid and list maintenances offset the perfor-mance gain. In order to avoid this overhead, insteadof synchronizing all the processes at each time step,they can be synchronized at every 5 time steps ormore. For this particular application, it is also possi-ble to distribute the job on multiple machines usingMPJ and combine the results at the end of the sim-ulation in the database. There is no communicationat all after the job is distributed. However, the vali-dation is necessary for these two approaches.

References

[1] Mike Ashworth. The potential of Java for highperformance applications. In The First Interna-tional Conference on the Practical Application ofJava, pages 19–33, 1999.

[2] J. M. Bull, L. A. Smith, L. Pottage, and R. Free-man. Benchmarking Java against C and For-tran for scientific applications. In Proceedingsof the 2001 joint ACM-ISCOPE conference onJava Grande, pages 97–105, June 2001.

[3] Steve Wilson and Jeff Kesselman. Java platformperformance strategies and tactics, chapter Ap-pendix B. Addison-Wesley, 2000.

[4] Dennis M. Sosnoski. Java performance pro-gramming, part 1: Smart object-management

saves the day. Java World, Nov 1999.http://www.javaworld.com/javaworld/jw-11-1999/jw-11-performance.html.

[5] Steve Wilson and Jeff Kesselman. Java platformperformance strategies and tactics, chapter Ap-pendix A. Addison-Wesley, 2000.

[6] Greg Barish. Building Scalable and High-Performance Java Web Applications usingJ2EE Technology. Addison-Wesley, 2002.

[7] Tim Lindholm and Frank Yellin. The JavaVirtual Machine Specification, Second Edition.Addison-Wesley, 1999.

[8] Scott Robert Ladd. Benchmark-ing compilers and languages for ia32.http://www.coyotegulch.com/reviews/almabench.html,01 2003.

[9] Per Bothner. Compiling Java withGCJ. Linux Journal, Jan 2003.http://www.linuxjournal.com/article.php?sid=4860.

[10] OpenMP Architecture Review Board. OpenMPC and C++ application programming inter-face. Technical report, OpenMP Architec-ture Review Board, 1998. Available fromhttp://www.openmp.org.

[11] Mark Bull and Mark Kambites. JOMP–AnOpenMP-like interface for Java. In ACM 2000Java Grande Conference. ACM, 2000. Availablefrom http://www.epcc.ed.ac.uk/research/jomp.

[12] Sun Microsystems. Java remote methodinvocation specification. Technical report,Sun Microsystems, 1998. Available at:http://java.sun.com/products/jdk/rmi/.

[13] V. Getov, G. von Laszewski, M. Philippsen, andI. Foster. Multiparadigm communications inJava for grid computing. Commication of theACM, 44(10):118–125, 2001.

[14] MPI. http://www-unix.mcs.anl.gov/mpi/.

[15] MPICH: A portable implementation of MPI.http://www-unix.mcs.anl.gov/mpi/mpich/.

[16] LAM/MPI parallel computing.http://www.lam-mpi.org/.

[17] Bryan Carpenter, Vladimir Getov, Glenn Judd,Anthony Skjellum, and Geoffrey Fox. MPJ:MPI-like message passing for Java. Concur-rency: Practice and Experience, 12(11), 2000.

[18] Mark Baker, Bryan Carpenter, Geoffrey Fox,Sung Hoon Ko, and Sang Lim. mpiJava: Anobjected-oriented Java interface to MPI. In In-ternational Workshop on Java for parallel andDistributed Computing, IPPS/SPDP 1999, April1999.

[19] Steven Morin, Israel Korean, and C. Mani Kr-isha. JMPI: Implementing the message passingstandard in Java. In Internatinal Parallel andDistributed Processing Symposium: IPDPS 2002Workshops. IEEE, 2002.

[20] Kivanc Dincer. Ubiquitous message passinginterface implementation in Java: jmpi. In13th International Parallel Processing Sympo-sium and 10th Symposium on Parallel and Dis-tributed Processing. IEEE, 1999.

[21] Glenn Judd, Mark Clement, and Quinn Snell.Dogma: Distributed object group managementarchitecture. In Concurrency: Practice and Ex-perience, volume 10. ACM 1998 Workshop onJava for High-Performance Network Computing,1998.

Exploring Performance Improvement for Java-based Scientiﬁc ...nom/Papers/XiangSwarm2003.pdf ·...

Documents

Transcript of Exploring Performance Improvement for Java-based Scientiﬁc ...nom/Papers/XiangSwarm2003.pdf ·...