DCA++: Winning the Gordon Bell Prize with Generic …...di cult for Version 1. This change resulted...

DCA++: Winning the

Gordon Bell Prize

with Generic Programming

Michael S. Summers

May 3 2009

Abstract

The 2009 Gordon Bell prize was won by the DCA++code, the first code in history to run at a sus-tained 1.35 petaflop rate. While many GB prizewining codes have been written in FORTRAN, theDCA++ code is fully object-oriented C++ andmakes heavy use of generic programming. This pa-per discusses how the DCA++ team simultaneouslyachieved world class performance and also the main-tainability and elegance of modern software prac-tice.

1 DCA++:The Rest of the Story

The DCA++ team is a large diverse team with ajoint skill set that covers all aspects of designing, de-veloping, hosting, running and using HPC applica-tions. The story of this team’s scientific pursuits andhow they won the Gordon Bell prize has been welltold by the team’s leadership. For a quick overview,the funding sources, and the full list of team mem-bers see Thomas Schulthess’s “The DCA++ story”[TSCH09]. What has been mentioned but not pre-sented in detail is the software development storybehind the DCA++.

The team’s software development experience in-fluenced its development. The author’s own experi-ence started in the early 70’s, using FORTRAN on aCDC 6600 (designed by Seymour Cray). In the 80’she researched prototype command and control sys-tems for the Air Force. Some of these systems ranon artificial intelligence machines such as the Sym-bolics 3600. Finally his experience came full circle

back to scientific computing with Version 2 of theDCA++ code.

Figure 1: The code of DCA++ Version 2 was de-signed using skills first learned on a Symbolics LispMachine.

Researchers working in the 80’s with Symbolicssystems were asked to demonstrate very complex hy-brid AI/numerical applications. Because of the na-ture of the problems they faced, the research ofteninvolved a mix of computer science, physics, and en-gineering. Since they needed to explore many possi-bilities they also had to develop and re-develop theirsoftware very quickly.

These researchers recognized that they werejointly developing both procedural and ontologicalknowledge and that they were encoding this knowl-edge into the software they developed. They under-stood that in order to manage the complexity andcost of their software they needed the ability to de-fine software abstractions, and they developed lan-guages and tools to this end.

However, it was not the languages and tools thatmanaged the complexity of their code. It was theidentification and use of correct/efficient abstrac-tions (concepts in generic programming terminol-ogy) that managed complexity both in their mindsand in their code. The structure of the code hadto reflect the most efficient way to think about theproblem. In cases where there were real time re-quirements the abstraction process had to be interwoven with performance confederations.

Looking back, the Symbolics represents more thana hardware and software platform. It represents thecollective software development wisdom of its time.

This paper is the story of the application of thiswisdom to the DCA++ code.

1

Cray User Group 2009 Proceedings

http://www.er.doe.gov/ASCR/ASCAC/Meetings/Mar09/Schulthess.pdf

2 DCA++ Requirements

The DCA++ code is a multi-scale, quantum fieldtheory application. At the bulk scale the “bath” (seeFigure 2) is represented with a self energy function.At the atomic scale it uses a very large set of dis-order configurations to model the many impuritiesthat would be found in a super-conducting material.Each disorder configuration (the cluster of Figure 2)is represented by a single band 2D Hubbard model.

The self energy of the bath is used to initializethe clusters and the correlation functions computedfrom the clusters can be used to compute a newself energy. The code iterates until a self consistentself energy and correlation function are found. Thecomputation of each cluster’s correlation functionsinvolves a Monte Carlo integration (represented bythe dice in the figure).

Figure 2: The DCA++ code computes correlationfunctions from a description of the material lat-tice. Material properties such as the superconduct-ing transition temperature, Tc, can be calculatedfrom the correlation functions.

The structure of this problem has at least threelevels of hierarchical parallelism (see Figure 3). Theavailable processors are first divided up into teams,one for each disorder configuration. Each disorderteam can then be divided into Monte Carlo integra-tion sub-teams.

Each integration sub-team uses a separate Markovchain to generate a sequence of integrand values.The processing of each Markov chain involves an ex-pensive update procedure which can optionally behandled by sub-sub-team of processors. The updateprocedure involves some linear algebra which is thecurrent bottleneck in the system. The code spends95% of its time in these procedures. Depending onthe hardware it may be appropriate for the bottle-neck sub-sub-teams to share memory.

Figure 3: The DCA++ has a natural parallelismstructure that can make use of a very large numberof processors.

The DCA++ code was constructed in two stages.The first stage, Version 1, was developed to demon-strate that the team’s algorithms could be imple-mented efficiently and that the code could be or-ganized to scale to 50k or more processors. WhileVersion 1 was written in C++ and made some use ofobject orientation and generic programming, it wasnot written in a way that would reduce the cost ofadding future extensions.

As Figure 4 shows, Version 2, a complete rewrite,was required to meet all of Version 1’s requirementsand to also be written in a way that would reducethe cost of developing the many envisioned exten-sions. The allocation of requirements between Ver-sion 1 and 2 was also chosen to match the work loadsand skills of the team’s principle software developers.The principal developer of Version 1, Gonzalo Al-varez, an excellent and efficient coder, was foremosta physicist. The version that he produced providedthe author (only a “wannabe” quantum mechanic)with the basis he needed to ’refactor’ Version 1 intoVersion 2. This pairing of skill sets has worked wellfor us.

3 Refactoring

The basic idea of refactoring is presented in Fig-ure 5. As every code is developed, it increases inboth functionality and complexity. As the complex-ity of the code grows it becomes harder and harderto change until it reaches a ceiling where adding newcapabilities is too costly. While it may appear thatthe code is at an impasse at this point, it is usuallynot so.

2


Figure 4: The DCA++ Requirements distributedbetween versions.

Much of the complexity of the code is not intrinsicto the problem the code is solving. During the refac-toring process code is rewritten (often just a por-tion of it) so that 1) the non-intrinsic complexity isgreatly reduced and 2) the new abstractions used inthe code change the rate at which complexity growswith additional functionality. This is illustrated inFigure 5 by the reduced slope after refactoring.

Figure 5: Refactoring enables us to afford more func-tionality.

It is usually not possible to avoid refactoring bycomprehensive upfront design. This is especiallytrue for research code. There are several reasonsfor this:

• The design of the programs abstractions involvemany trade-offs. We often need experience withactual systems before we can make good designdecisions.

• As researchers we are learning about the prob-lem we are trying to solve as we develop thecode.

• As researchers we are also discovering the bestalgorithms to use as we develop the code.

• Many refactoring opportunities can be identi-fied by simply looking for redundant (or nearlyredundant) code.

• Upfront designs of a system of abstrac-tions/concepts are usually too general. Devel-opers end up writing more code than necessaryand leave this code untested.

4 The Race to the Finish

Our experience developing the DCA++ and submit-ting it for the Gordon Bell prize provides an inter-esting illustration of the refactoring process.

Figure 6 presents the two year time frame prior tothe Gordon Bell DCA++ runs in November of 2008.During the first year Gonzalo Alvarez was writingVersion 1 and the author was working on the un-derlying symmetry package which is shared by bothsystems. When the symmetry package was finishedand integrated into Version 1, work on Version 2 be-gan.

Development of Version 1 continued. As it demon-strated its functionality and performance it becamethe production version. It was used to demonstratedthat it could scale to the order of 50k processors andwas expected to be the Gordon Bell code.

Figure 6: The time line of the DCA++ development.

Initially the performance runs were to be on the“old” Jaguar XT4, later in the year it became clearthat the XT5 (with it’s 150k processors) would beavailable for Gordon Bell Prize runs. There wasa narrow time window in which the new machinewould be available for this purpose. By that timeseveral things had changed.

• We had developed a very detailed automatedtesting system to verify that Version 2 was pro-ducing exactly the same results as Version 1.

3


This system showed that Version 2 was veryclose to meeting its functional requirements.

• After sorting out some issues, Version 2 ap-peared to be running as fast as Version 1.

• We had decided that we should modify thecodes so that the Monte Carlo integration wouldoperate in single precision. Due to the natureof the integration process, we knew it was un-necessary to run this part of the code in dou-ble precision. It was also clear that our perfor-mance would increase significantly if we madethis change.

Figure 7: During the Gordon Bell runs a need tomodify the collective MPI approach was discovered.

In the end it was very close race, Version 2 passedall of its acceptance test days before the window onthe XT5 opened. However during the runs the refac-toring paid off in two ways:

• The precision change turned out to be fairlytrivial for Version 2 (an afternoon’s work) anddifficult for Version 1. This change resulted in1.35 petaflops/s instead of .7 petaflops/s.

• Running on 150k processors versus 50k proces-sors revealed a weakness in the MPI collectivecommunication approach that we had previ-ously taken. The same approach was used byboth versions. However, Version 2 required achange to only a few lines of code in one inher-ited method, whereas the same change in Ver-sion 2 had to be made in various places.

As a result, Version 1 never actually crossed thefinish line in terms of petaflop performance.

Figure 8: The use of generic programming made thechange of precision easy in Version 2. The actualchange from dgemm to sgemm did not require anycode changes. It was performed automatically atcompile time by the C++ template and overloadedfunction matching.

5 After the Gordon Bell Runs

After the Gordon Bell runs, Version 2 became theproduction code. Gonzalo Alvarez now maintainsthis code and jealously guards it against unvalidatedmodifications since he (and the other Physicists onthe team) use the code for their work. The authorworks on the development versions. Just before theGordon Bell runs we switched our version controlsystem from Subversion(SVN) to Mercurial(Hg) (adistributed system). As Figure 6 shows, we now useHg to manage all of the development and productionbranches of the code, periodically merging develop-ment capabilities back into the production version.

The development branches provide functionalitywhich:

• Replaces the MpiSplit-based processor topol-ogy with a multi-dimensional MpiCart topol-ogy. (See Figure 9) This allows us to constructmulti-dimensional disorder configurations. Italso allows us to add another hierarchical layerin the parallel processing. This will permit us toexperiment with combining results from manyDCA++ runs each of which has its momentumstructure shifted from the other.

• Provide the capability to exhaustively generatemany combinations of disorder configurations.

• Provide a lightweight Java Script Object Nota-tion (JSON) parser written in C++ with a fewspecial features that allow us to handle largearrays efficiently.

4


• Replace the Delayed Update Algorithm with amuch faster algorithm.

During all of this development activity low-levelrefactoring continues as we learn new ways to makethe code simpler and more succinct.

Figure 9: The development branch of Version 2has changed from an MpiSplit-based topology to aCartesian topology.

We have also been experimenting with the “D”programming language. The “D” language is sim-pler than C++, it’s compilers are easier to writeand it has a much cleaner template system. Thevery useful meta programming capabilities of C++were accidental in nature. Because of this they areunnecessarily awkward. This makes “D”’s templatesystem very attractive.

So far we have been able to install the GDC com-piler on smoky.ccs.ornl.gov, construct the equivalentof mpic++, and write a small test MPI application.This application links with MPI, BLAS, and LA-PACK which is all we require of DCA++. The nextstep is to write some small benchmark codes andevaluate them viz. a viz. their C++ equivalent.

Figures 10 and 11 contain summery informationabout the current development branch. As the fig-ures show, the system has a minimum of externaldependencies (requiring only MPI and linear alge-bra packages) and is almost entirely generic.

As we approach Version 3, we will be reworkingour testing framework. Comparison runs with Ver-sion 1 are now insufficient since we will be testingmany new capabilities. We will be automating andextending our unit testing framework. The theoret-ical side of the team will be identifying new systemlevel tests. However, as we continue to refactor thesystem we find that its clarity makes it easier to ver-ify through simple audit.

Figure 10: 97% of the DCA++ code is Object Ori-ented, 89% of it is Generic.

Figure 11: 97% of the DCA++ code is Object Ori-ented, 89% of it is Generic.

“There are two ways of constructing a softwaredesign: One way is to make it so simple that thereare obviously no deficiencies, and the other way isto make it so complicated that there are no obviousdeficiencies. The first method is far more difficult. Itdemands the same skill, devotion, insight, and eveninspiration as the discovery of the simple physicallaws which underlie the complex phenomena of na-ture.” [Hoare1980]

6 Summary

Physicists generally understand that working withcomplex physical phenomena requires a foundationin mathematics. Analogously, we have come to un-derstand that the development of good scientificcode also requires a foundation. This foundation isa mastery of software technologies like generic pro-gramming.

5


The mathematics that a physicist uses to describenature makes what he writes clearer to the math-ematically trained, but at the same time it makesit more difficult to understand by those who do nothave the requisite training. Similarly, the use of soft-ware abstraction technologies makes the code easierto work with for those who have training and harderto work with for those who don’t.

Our assumption is that in the case of scientificcodes such as DCA++, the use of “SWAT” devel-opment teams will be the most productive. Clearlythose who learn the mathematics of quantum me-chanics may also learn the analogous software tech-nologies.

It has been our experience that it is relativelyeasy to develop software that is both high perfor-mance and well designed. Performance issues do af-fect the design of the code and the choice of the ab-stractions used. The developer must be constantlyaware of the performance implications of his designdecisions. Fortunately, compile time technologiessuch as generic programming help to improve per-formance more often than they cause problems.

7 About the Author

Michael SummersOak Ridge National LaboratoryS&T StaffComputer Science and Mathematics Div.1 Bethal Valley Road, Oak Ridge, TN(865) [email protected]

References

[TSCH09] Thomas C. Schulthess,[email protected],The DCA++ Story.Advance Scientific Computing,Advisory Committee Meeting,Washington DC, March 3-4, 2009.http://www.er.doe.gov/ASCR/ASCAC/Meetings/Mar09/Schulthess.pdf

[Hoare1980] Charles A. R. Hoare,1980 Turing Award Lecture;Communications of the ACM 24 (2),(February 1981): pp. 75-83.

6


http://www.er.doe.gov/ASCR/ASCAC/Meetings/Mar09/Schulthess.pdf

DCA++: Winning the Gordon Bell Prize with Generic …...di cult for Version 1. This change resulted...

Documents

Transcript of DCA++: Winning the Gordon Bell Prize with Generic …...di cult for Version 1. This change resulted...