IMPLEMENTING CFD (COMPUTATIONAL FLUID DYNAMICS… · IMPLEMENTING CFD (COMPUTATIONAL FLUID...

8
IMPLEMENTING CFD (COMPUTATIONAL FLUID DYNAMICS) IN OPENCL FOR BUILDING SIMULATION Yue Wang 1 , Ali Malkawi 1 , Yun Yi 1 1 T.C. Chan Center, University of Pennsylvania ABSTRACT Though researchers in computer graphics have started to use the GPGPU (General Purposed Graphics Pro- cessing Unit) method to speed up their procedural pro- grams, these techniques are seldom used in the build- ing simulation field. It is possible to apply the GPGPU method to many simulation scenarios (i.e. human evacuation, shadow simulation) to speed up perfor- mance. In this paper, CFD is used as an example to introduce how the GPGPU method will provide ben- efits to building simulation. CFD is widely used for building performance analysis. However CFD is com- putationally expansive so large scale problems often take a long time to perform in a simulation. INTRODUCTION Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics that uses numerical meth- ods and algorithms to solve and analyze problems that involve fluid flows. Computers are used to perform the millions of calculations required to simulate the inter- action of liquids and gases with surfaces defined by boundary conditions. The foundation of CFD is based on the Navier-Stokes equations. The Navier-Stokes equations, named af- ter Claude-Louis Navier and George Gabriel Stokes, describe the motion of fluid substances, which have the ability to flow and are nonresistant to deformation. These equations arise from applying Newton’s second law to fluid motion, together with the assumption that the fluid stress is the sum of a diffusing viscous term proportional to the gradient of velocity, plus a pressure term. Since air distribution within a room is usually tur- bulent, the Reynolds-averaged Navier-Stokes (RANS) equation and accompanying turbulence model are used instead of direct simulation of the traditional NS equation. The K-epsilon model is one of the most common RANS turbulence models. The model is widely used in building science research, especially indoor air quality and thermo-distribution simulation (CHENG W. C., 2008), (Sun Huawei, 2007), (TOM- INAGA YOSHIHIDE, 2002), (P. Neofytou, 2006), (Selvam, 1996). The K-epsilon model is a industry standard for airflow simulation in building performance analysis, and most papers published in the building simulation field used this model as the primary engine. However, even with this time-averaged approach, the Navier-Stokes equa- tion is notable for the length of time it takes to solve. Though researchers continuously improve the numer- ical algorithm to improve the efficiency, CFD is non- linear and can only be used to solve limited simple building simulations problems on modern CPU’s. For this reason, in recent years, researchers are explor- ing new ways to speed up the simulation. GPGPU (General Purposed Graphics Processing Unit) method is one of the tentative approaches. The GPU has attracted attention for numerical computing because its structure is highly parallelized and optimized to achieve high performance for image processing. GPU can perform floating-point calculations, which can be quickly translated into shading thanks to the acceler- ated hardware. Moreover, GPU speed has improved dramatically over the past five years, and the accel- eration is now much greater than CPUs(Santa Clara, 2007). After realizing the power to do general-purpose com- putation on a GPU (a.k.a. GPGPU), several companies published their specifications and implementations of computation framework on their own GPUs. Compute Unified Device Architecture (CUDA) was developed by NVIDIA cooperation in 2007 (Santa Clara, 2007). AMD offers a similar SDK for their ATI-based GPUs and that SDK and technology is called Stream SDK (formerly CTM, Close to Metal), designed to com- pete directly with Nvidia’s CUDA (ATI, 2009). Di- rectCompute was developed by Microsoft to take ad- vantage of the massively parallel processing power of a modern graphics processing unit (GPU) to acceler- ate PC application performance in Microsoft Windows Vista or Windows 7 (Microsoft). Several following papers published related to utilize the power of GPU to do CFD simulations using the GPGPU frameworks like CUDA(Andrew Corrigan), (Wangda Zuo, 2010). Unfortunately, there are still many problems related to the CUDA framework. Portability is a most important issue, and CUDA programs can not be run on a tradi- tional CPU. In 2009, Apple Inc. developed a new technology called OpenCL which harnesses the power of GPU cal- culation to general purposed numerical calculations. With the support of AMD, Intel, and NVIDIA, Ap- ple proposed OpenCL to the Khronos Group (cre- ators of OpenGL, a cross-platform computer graph- ics API) as the basis for a new standard. Demon- strating the strength of the proposal, OpenCL was ex- panded to include digital signal processors (DSPs) and other specialized processor architectures. It was rati- fied as a royalty-free open standard in December 2008 Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November. - 1430 -

Transcript of IMPLEMENTING CFD (COMPUTATIONAL FLUID DYNAMICS… · IMPLEMENTING CFD (COMPUTATIONAL FLUID...

IMPLEMENTING CFD (COMPUTATIONAL FLUID DYNAMICS) IN OPENCL FORBUILDING SIMULATION

Yue Wang1, Ali Malkawi1, Yun Yi11 T.C. Chan Center, University of Pennsylvania

ABSTRACTThough researchers in computer graphics have startedto use the GPGPU (General Purposed Graphics Pro-cessing Unit) method to speed up their procedural pro-grams, these techniques are seldom used in the build-ing simulation field. It is possible to apply the GPGPUmethod to many simulation scenarios (i.e. humanevacuation, shadow simulation) to speed up perfor-mance. In this paper, CFD is used as an example tointroduce how the GPGPU method will provide ben-efits to building simulation. CFD is widely used forbuilding performance analysis. However CFD is com-putationally expansive so large scale problems oftentake a long time to perform in a simulation.

INTRODUCTIONComputational Fluid Dynamics (CFD) is one of thebranches of fluid mechanics that uses numerical meth-ods and algorithms to solve and analyze problems thatinvolve fluid flows. Computers are used to perform themillions of calculations required to simulate the inter-action of liquids and gases with surfaces defined byboundary conditions.The foundation of CFD is based on the Navier-Stokesequations. The Navier-Stokes equations, named af-ter Claude-Louis Navier and George Gabriel Stokes,describe the motion of fluid substances, which havethe ability to flow and are nonresistant to deformation.These equations arise from applying Newton’s secondlaw to fluid motion, together with the assumption thatthe fluid stress is the sum of a diffusing viscous termproportional to the gradient of velocity, plus a pressureterm.Since air distribution within a room is usually tur-bulent, the Reynolds-averaged Navier-Stokes (RANS)equation and accompanying turbulence model are usedinstead of direct simulation of the traditional NSequation. The K-epsilon model is one of the mostcommon RANS turbulence models. The model iswidely used in building science research, especiallyindoor air quality and thermo-distribution simulation(CHENG W. C., 2008), (Sun Huawei, 2007), (TOM-INAGA YOSHIHIDE, 2002), (P. Neofytou, 2006),(Selvam, 1996).The K-epsilon model is a industry standard for airflowsimulation in building performance analysis, and mostpapers published in the building simulation field usedthis model as the primary engine. However, even withthis time-averaged approach, the Navier-Stokes equa-tion is notable for the length of time it takes to solve.

Though researchers continuously improve the numer-ical algorithm to improve the efficiency, CFD is non-linear and can only be used to solve limited simplebuilding simulations problems on modern CPU’s.For this reason, in recent years, researchers are explor-ing new ways to speed up the simulation. GPGPU(General Purposed Graphics Processing Unit) methodis one of the tentative approaches. The GPU hasattracted attention for numerical computing becauseits structure is highly parallelized and optimized toachieve high performance for image processing. GPUcan perform floating-point calculations, which can bequickly translated into shading thanks to the acceler-ated hardware. Moreover, GPU speed has improveddramatically over the past five years, and the accel-eration is now much greater than CPUs(Santa Clara,2007).After realizing the power to do general-purpose com-putation on a GPU (a.k.a. GPGPU), several companiespublished their specifications and implementations ofcomputation framework on their own GPUs. ComputeUnified Device Architecture (CUDA) was developedby NVIDIA cooperation in 2007 (Santa Clara, 2007).AMD offers a similar SDK for their ATI-based GPUsand that SDK and technology is called Stream SDK(formerly CTM, Close to Metal), designed to com-pete directly with Nvidia’s CUDA (ATI, 2009). Di-rectCompute was developed by Microsoft to take ad-vantage of the massively parallel processing power ofa modern graphics processing unit (GPU) to acceler-ate PC application performance in Microsoft WindowsVista or Windows 7 (Microsoft). Several followingpapers published related to utilize the power of GPUto do CFD simulations using the GPGPU frameworkslike CUDA(Andrew Corrigan), (Wangda Zuo, 2010).Unfortunately, there are still many problems related tothe CUDA framework. Portability is a most importantissue, and CUDA programs can not be run on a tradi-tional CPU.In 2009, Apple Inc. developed a new technology calledOpenCL which harnesses the power of GPU cal-culation to general purposed numerical calculations.With the support of AMD, Intel, and NVIDIA, Ap-ple proposed OpenCL to the Khronos Group (cre-ators of OpenGL, a cross-platform computer graph-ics API) as the basis for a new standard. Demon-strating the strength of the proposal, OpenCL was ex-panded to include digital signal processors (DSPs) andother specialized processor architectures. It was rati-fied as a royalty-free open standard in December 2008

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1430 -

(Gourp). On August 28, 2009, Apple released MacOS X Snow Leopard, which contains a full implemen-tation of OpenCL (Apple). AMD and NVIDIA areclosely following this step, releasing several OpenCLimplementations in beta (AMD), (NVIDIA).With the availability of these tools, it is now possi-ble to apply the OpenCL computing model to manysimulation scenarios (i.e. human evacuation, shadowsimulation) to speed up performance. GPU is good atcomputing many similar floating point calculations si-multaneously. Therefore if in a mathematical model,all the elements ,or agents, share the same governingequation, it is possible to speed up the performance byparallel computing. For example, in Helbing’s humanevacuation model (Helbing, 1995), (Helbing, 2000),all the pedestrians share the same social force equa-tion so parallel computing is possible by sharing andcreating numerous threads with each tread predictingthe movement of one agent. In this paper, CFD is usedas an example to introduce how the GPGPU methodcan provide benefits to building simulation since in aCFD computation, all the elements (grid cells) usuallyshare the same governing equation, the Navier Stokesequation.The performance benefit of GPGPU computing giveshope to solve CFD problems more efficiently. In 2010,Wangda Zuo optimistically wrote, “It is possible to im-plement the CFD solver on the GPU by using a sim-ilar strategy. One can also expect that the speed ofCFD simulations on the GPU should be faster thanthat on the CPU. For the CFD codes written in Clanguage, the implementation will be relatively easysince only the parallel computing part needs to berewritten in CUDA” (Wangda Zuo, 2010). Howeverno research in the building simulation field has suc-cessfully achieved this goal in order to ship a produc-tion tool due to many technical difficulties. The ma-jor problem is that most mature CFD programs allhave a large source code base. For example, Open-FOAM(Ltd, 2009), an open source CFD toolbox, hasmillions of lines of code. When compiled into bina-ries, the program consumes about 200 megabytes ofdisk space. It is virtually impossible to convert suchlarge portions of code into GPU source code sinceGPU code and CPU code are different in many ways,and converting CPU code to GPU code might takeeven more time than writing the CPU code itself, inas-much as GPU programming requires detailed knowl-edge of the hardware, and proficient GPGPU program-ming and debugging experience (GPU programmingtools are not as advanced as CPU ones). Thereforethese factors make GPU programming much harderthan CPU programming. Moreover, CFD softwareprograms do not use simple C programming paradigm,instead, they usually have very advanced software ar-chitecture, and use generic programming and object-oriented programming extensively to make the codemaintainable. OpenFOAM, for example, heavily de-

pends on C++ features like operator overloading, classinheritance, and templates. This makes it extremelydifficult to convert to OpenCL or CUDA code with-out losing the programming flexibility provided by thesoftware before. As a result, converting software pro-grams like this requires rewriting the code to C first,which makes the program source code footstep sizeincreased by several times in the end.There are several ways to simplify the Navier-Stokesequation numerical solving method to make CFD cal-culations faster. One of the most popular ways isFast Fluid Dynamics, which breaks the Navier-Stokesequations into several sub-equations and solves themone by one. The FFD scheme was originally pro-posed for computer visualization and computer games(J., 1999), (MJ, 2003), (Song O-Y). FFD creates atest bed to carry out experiments of fluid simulationon GPU. FFD’s algorithm is a simple 4-step solverand can be written within one or two hundred lines ofcode (LOC), compared to CFD’s millions of LOC. Thestructure of all FFD procedures are simple iterationsover all the grids, which makes parallelization possi-ble. However, traditional CFD does not iterate throughall the grid elements, but solves non-linear equationset instead. This makes writing a FFD solver and con-verting the code to GPU version is practically possi-ble. There are many open source software librariesand applications widely available for download on var-ious websites. Some of the recent papers apply FFDto building simulation. In 2007, Qinyan Chen pub-lished a proceeding paper to describe the initial workto validate FFD for room airflow to Building Simu-lation Conference 2007 (Zuo W, 2007). They pub-lished a comprehensive conclusion on Indoor Air in2009 (Zuo W, 2009). The results showed that the FFDis about 50 times faster than the CFD. The FFD couldcorrectly predict the laminar flow, such as a laminarflow in a lid-driven cavity at Re = 100. Howeverthis research also showed the FFD has some problemsin computing turbulent flows due to the lack of turbu-lence treatments. Although the FFD can capture themajor pattern of the flow, it cannot compute the flowprofile as accurately as the CFD does. Researcherstried to improve the FFD by adding some simple tur-bulence treatments, but no general improvement wasfound.Though various researchers use FFD as a simpletestbed to show the possibility of running fluid sim-ulation on GPU, no paper has been published in theBuilding Simulation field that uses CFD that includesa fully fledged RANS model running on top of GPU,and there is also no available code ready to be used forproduction due to the above technical difficulties.

METHODOLOGYGiven the fact that it’s not practical to transit a largecode base from CPU to GPU architecture due to theaforementioned technical difficulties mentioned, and

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1431 -

CFD programs usually contain a large source codefootprint with complicated software design and imple-mentation, this research adopts a eclecticism methodinstead of translating the entire fully fledged RANSsolver to OpenCL. Traditional performance tuning thatis widely used in computer software engineering isperformed by this research. In software engineering,the performance tuning should follow these steps andthis research strictly follows them:

1. Assess the problem and establish numeric valuesthat categorize acceptable behavior.

2. Measure the performance of the system beforemodification.

3. Identify the part of the system that is critical forimproving the performance. This is called thebottleneck.

4. Modify that part of the system to remove the bot-tleneck.

5. Measure the performance of the system aftermodification.

In the first step, we use OpenFOAM (Ltd, 2009) as theinitial code base. The execution binary files are com-piled, and case files are simulated using the engine asthe reference. Several unit testing procedures are writ-ten in this research to ensure that whenever the sourcecodes are optimized, the simulation results should besimilar as the reference implementation results (withsmall tolerance left for rounding error of floating pointcalculations).Secondly, this research adopts state of the art time pro-filing tool to measure the performance of the unmodi-fied program. In this case, Apple’s Xcode Instrumentstool is used for timing. This is a software built on topof Sun’s DTrace utilities, and it is capable to time var-ious system performance functions including the cu-mulative execution time consumption for each of thefunctions in the source code with great precision.Thirdly, the function that consumes most of the run-ning time, the bottleneck, is found. Later, this researchwill show that one tiny function, the conjugate gradi-ent solver, takes most of the running time. This be-havior is well known in numerical analysis and can beexplained since it is the dominant procedure in finitemethods.Forth, the bottleneck function is rewritten in OpenCLwith exactly the same mathematical algorithm, with allthe iteration running in parallel. Even though conju-gate gradient is a small function, the conversion takesgreat effort. The reasons to that are explained later.Lastly, with the unit test performed to ensure CPU ver-sion and GPU one give same results for simulationcases, this research compares the execution time of tra-ditional programs and GPU counterparts to show theperformance improvements.

SPEEDING UP SIMULATION USING OPENCLThough in theory it is possible to convert the entireprogram into a OpenCL version and run it in fullspeed, this kind of transition will never happen sinceit takes too much efforts. Given the fact that GPU pro-gramming is much harder than traditional program-ming, it’s not possible to transit that large code baseinto an OpenCL program. This is why all fluid pa-pers that incorporate GPGPU use a much simpler butnot precise algorithm / model and only work on oneor two hard coded cases for analysis. This is the onlyfeasible way to implement a fluid solver.For the CFD-related problem, most computations hap-pen in only a few pieces of code. For example, inOpenFOAM, there are thousands of C++ APIs andfunctions such as fvm::, fvc::, interpolation, ma-trix solvers, turbulent models, etc. However, bench-marking using profiling tools (such as Instruments onMac OS X based on Sun’s dtrace utility) shows thatone of the procedures, the conjugated gradient method,takes more than 90% of the overall running time.Other procedures’ running time can be neglected.This behavior is easy to explain. Traditional CFD pro-grams divide the space into grids and apply the NavierStokes equation, turbulent models, boundary condi-tions and so on to each grid, and then solve the prob-lem using Finite Difference Method, Finite VolumeMethod, or Finite Element Method. The most com-mon way to solve the Navier-Stokes equation in indus-try is to use the finite volume method. Most commer-cial or open source CFD codes have a finite volumesolver built in so all the differential equations will beturned into a linear form when this method is applied.Then the program will try to solve the linear systemusing the conjugate gradient method.For example, (Dean and Glowinski, 1993) listed all theprocedures it requires to discretize the Navier-Stokesequation into several linear equations using the finiteelement approach. Although the process is compli-cated, most operations are just one-pass scalar-vectormultiplication or vector-vector addition, which is rel-atively simple and requires little time. However, solv-ing the linear equation takes more time. Since theNavier Stokes equation is non-linear, although it canbe written in a linearized (matrix manipulation) way,both the left side and right side of the linear equationwill have unknown variables. It requires many iter-ations to be converged, while each iteration is also aconjugate gradient iteration. The ADFC project (theADFC team) implements the (Dean and Glowinski,1993) paper using C++, and preliminary benchmarksshow similar results (90% of the running time is de-voted to conjugate gradient method, implemented inthe source code file gradiente_conjugado.c)as of OpenFOAM.In mathematics, the conjugate gradient method is analgorithm for the numerical solution of particular sys-

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1432 -

tems of linear equations, namely those whose matrixis symmetric and positive-definite. The conjugate gra-dient method is an iterative method, so it can be ap-plied to sparse systems that are too large to be handledby direct methods such as the Cholesky decomposi-tion. Such systems often arise when numerically solv-ing partial differential equations.The algorithm is detailed below for solving Ax = bwhere A is a real, positive-definite matrix. The inputvector x0 can be an approximate initial solution or 0.

1. r0 = b−Ax0

2. p0 = r0

3. k = 04. while true

(a) αk = rTk rk

pTk Apk

(b) xk+1 = xk + αkpk

(c) rk+1 = rk − αkApk

(d) if (rk+1 < �) break

(e) βk = rTk+1rk+1

rTk rk

(f) pk+1 = rk+1 + βkpk

(g) k = k + 15. return xk+1

The conjugated gradient method can be written in asmall piece of code. For example, the conjugated gra-dient method source code in OpenFOAM, PCG.C andPBiCG.C is only 200 lines each with comments, or 70lines with comments stripped. This helps make GPUacceleration possible.The Conjugate Gradient method can be highly paral-lelized. Before introducing our GPGPU method lineby line, three elementary GPU parallel algorithms areprovided. These GPU methods are highly parallelizedso computing any of these methods is highly efficienton a multi-core GPU. For a vector sizes n, we can cre-ate n threads and simultaneously run the n kernel pro-grams together.The first GPU algorithm is saxby which is used tocompute s = ax + by. This algorithm is simple, itcreate the number of thread that equal to the numberof elements in the vector, and for each GPU thread i,it compute si = axi + byi.The second GPU algorithm is product which com-pute the product of a sparse matrix times a full vec-tor. The sparse matrix is stored in a CRS (CompressedRow Storage) form to reduce the matrix storage (so the0s are not stored in the memory) as well as increasethe efficiency (reduce the O(n2) problem into a O(n)problem). For each GPU thread i, it tries to calculatethe i-th element of the resulting vector.The Compressed Row Storage (CRS) format puts thesubsequent non-zeros of the matrix rows in contigu-ous memory locations. Assuming we have a non-symmetric sparse matrix A, we create 3 vectors: onefor floating-point numbers (val), and the other twofor integers (col_ind, row_ptr). The val vec-

tor stores the values of the nonzero elements of thematrix A, as they are traversed in a row-wise fash-ion. The col_ind vector stores the column in-dexes of the elements in the val vector. That is,if val(k) = ai,j then col_ind(k) = j . Therow_ptr vector stores the locations in the val vec-tor that start a row, that is, if val(k) = ai,j thenrow_ptr(i) ≤ k < row_ptr(i + 1) . By conven-tion, we define row_ptr(n + 1) = nnz + 1, wherennz is the number of non-zeros in the matrix A. Thestorage savings for this approach is significant. Insteadof storing n2 elements, we need only 2nnz + n + 1storage locations.The third algorithm is the dot_product. It calcu-lates the dot product of two vectors x and y. Thissounds simple, but it’s the most complicated oneamong the three. This is because it requires imple-menting a way to simultaneously add all the xiyi val-ues together in order to utilize the parallel computingpower. There is extensive research on this topic andthe most mature strategy is to utilize the parallel prefixreduce method (Harris).With these three fundamental algorithms, the previ-ous conjugate gradient method can be implemented.Step 1 is split into two sub steps. The first substep is to compute the multiply of A and x0.Then utilize the saxby routine with a = 1 andb = −1. Step 2 is solely memory copy. Step 3does not involve GPU computing. Step 4 a) uses themultiply method to do Apk then use the parallelprefix reduce dot_product to compute both rT

k rk

and pTk (Apk). Step 4 b) and Step 4 c) are two saxby

procedures. Step 4 d) involves computing the norm ofrk+1. This can be solved by parallel prefix reductiondot_product using rk+1 = √rk+1rk+1. Step 4 e)is similar to step 4 a) which involves two parallel pre-fix reduce dot_product. Step 4 f) is similar to step4 b) and step 4 c) which involves one saxby. Step4 g) and step 5 can be done by CPU since no parallelcomputing is required.Finally, the original PCG.C and PBiCG.C

source code were replaced with the newlywritten OpenCL parallelized OpenCLCG.C.The changed code is hosted on Google code(http://code.google.com/p/freefoam-bin/).The change of the code is massive and takes a greatamount of human labor because many technical diffi-culties are encountered during the modification:• GPU Programming is much harder than tradi-

tional programming in nature. Most program-ming interfaces and instructions are low level,and require knowledge of the hardware to getmaximized performance, However few technicaldetails are published by the GPU vendors. Thereare also few mature tools to do runtime debug-ging and profiling compared to what traditionalCPU has. Since GPU programming is hardware

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1433 -

programming by nature, and there’s no hardwareor software memory protection, it’s also easyto crash the entire operating system even withslightest ignorance (like pointer outbound).

• OpenFOAM (and most other mature CFD codes)use C++ programming language to abstracthigher level parameters (vectors or tensors) asclasses, and use advanced C++ features (opera-tor overloading, templates, generic libraries) forrapid development. However, none of these fea-tures are available in OpenCL. This procedureshould convert to a clean C implementation withadditional utility function and glue code to con-vert between C data structures and C++ classes.

• The converted C programs should be convertedto GPU version. Each iterations should be ab-stracted as thread, and all the variables should bemanually allocated, deallocated, copied, sent andreceived, in proper place and time. After the con-version, the code is almost expanded by ten times,and many of its parts are solely memory objectshandling code.

In a nutshell, the original source code with commentspurged was about 70 lines and is now expanded tomore than 600 lines, which includes the three kernelsdescribed before as well as many important routinesfor controlling the GPU to create memory allocations,performing the calculation, as well as sending the databetween GPU and CPU forward and backward. Sinceall solvers and models are directly based on these ma-trix solvers, our modification makes it possible to ac-celerate any solvers so all the cases that were sup-ported by the program can be simulated without anyproblem.

CASE STUDIESCAVITY CASE BENCHMARKThis research is benchmarked on an Intel Xeon CPU.The frequency of the processor is 3.60GHz. We useda case with space divided into 500×500 grids, and letthe OpenFOAM program to simulate one time step.Different GPU cards are used to test the program.For the CPU case, this research used the unmodifiedOpenFOAM solver to simulate the case. The conju-gated gradient solver requires 81.1 seconds to accom-plish the work while other procedures required 3.55seconds. This confirms our previous statement thatconjugate gradient is the bottleneck.This research attempted to simulate the case using dif-ferent GPUs with the OpenCLCG.C solver. GeForce9400M card uses 32.04 seconds on the conjugate gra-dient procedure. GeForce 9800 GTX card uses 8.03seconds. Quardro FX 5800 card uses 2.81 seconds. Soit is 28.86 times faster.The performance of cards varies because each carduses different technology and configurations. For ex-ample, The Quadro card uses 512-bit GDDR3 mem-

ory, and the memory bandwidth is twice faster thanGeForce 9800 GTX. The 240 processing cores insideQuadro FX 5800, compared to 16 in GeForce 9400M,can perform much more threads concurrently. It is ex-pected that with the latest generation cards like Fermi,the performance can be even better.As demonstrated with the benchmark 1, instead of us-ing 81.1 seconds, the conjugated gradient now takesabout 2.81 seconds to finish, even faster than otherprocedures, which takes 3.55 seconds to finish. So theconjugate gradient method procedure is no longer thebottleneck.

0

21.5

43

64.5

86

Intel Xeon CPU 3.60GHz GeForce 9400M GeForce 9800 GTX Quardro FX 5800

2.818.03

32.04

81.1

3.553.553.553.55

Timing Result (500*500 regular mesh, 1 time step)

Tim

e (s

econ

d) Others (fvc, fvm, Interpolation, Disk Writing, etc.)Conjugate Gradient Solver

Figure 1: OpenCL Benchmark for 500 by 500 regularmesh

PURE CG BENCHMARKThe performance speed up of the GPU code is heavilydependent on the problem size. Problems with dif-ferent sizes of banded matrixes (band size is 4) aresolved by the conjugate gradient solver using QuardroFX 5800 card and the execution times in relation toproblem sizes are listed as 2.

1E-04

1E-03

1E-02

1E-01

1E+00

64 91 128 181 256 362 512 724 1024 1448

GPU and CPU Solving Time of Conjugate Gradient Method

Solv

ing

Tim

e (s

ec)

Problem Size

CPU GPU

Figure 2: GPU and CPU solving time of conjugategradient method

Clearly for each single iteration, the execution time ofthe CPU procedure grows linearly with the linear sys-tem problem size. This is because the number of float-ing points calculation (FLOP) is a linear function ofmatrix and vector sizes. However, for GPU the linearrelation only holds when the problem size is relatively

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1434 -

large (in this case, N > 256, which is called the turn-ing point). For small problem sizes, GPU executiontime is not linear. This is because when OpenCL pro-cedure is started, it takes some time to do initialization,kernel compilation, data loading and pushing, etc. Forreally small problem sizes, GPU might be even slowerthan CPU. While for larger problems, these overheadscan be neglected. Not until the problem size reachedcertain limits (called turning point in the following dis-cussion) can the performance grow linearly.Note in Zuo’s research(Wangda Zuo, 2010) , the turn-ing point is 1E5 while in this research it is 256. Thisis because the two simulation procedures are different.Zuo was doing full FFD in GPU, while this researhjust performs a CG on GPU. FFD has only 4 equa-tions that takes short simulation time, but putting theentire program running on top of GPU requires muchmore memory objects, kernels, and other resources.So the initialization, data transport, etc, take muchlonger time. While in this research, the conjugate gra-dient resource requirement is more light weighted, butthe actual calculation is more intense. So the turningpoint is lower.Though not sufficient for real time dynamic fluid sim-ulation of large buildings, the performance improve-ment is good enough to enable various researches inbuilding simulation. One possible area is the exter-nal coupling method of CFD and energy simulation.According to (Bouwkunde, 2005) the minimum timestep for coupling simulation of Energy Simulation andCFD is 2 hours or less. However, in their research, 10days of spring condition, instead of annual simulation(which is important for year-span building analysis),was performed simply because even such short spantook their fastest machine 19 hours of running time.With the estimated speed up presented before, the an-nual case simulation is possible, and its total runningtime can be reduced to within a couple of hours, com-parable to the time used for Energy Simulation1, whiletraditional CPU based method will take months.

HOT ROOM CASEThe previous two cases compare the performance ofconjugate gradient alone and CFD as a whole to showgreat speed up over the traditional code implementa-tion. However, the accuracy of the OpenCL code isstill open to question. To show that the OpenCL en-hanced version can meet the precision requirement ofbuilding simulation study, in this case, we use the orig-inal hot room case provided by official OpenFOAMdistribution as testing example. The performance im-provements is similar (25.4x) to the previous experi-ments. Here the research only focuses on numericalprecision.

1With about 30x speed up, 2 hours time step, the estimated run-ning time is 11 hours. However, their paper was written in 2005while this research uses a much modern CPU. Taking the CPU speedimprovement over the past 6 years into consideration, the total sim-ulation time is estimated as several hours

In this case, there is a room with dimensions of 10x6x2meters and a box that will represent a heat source1x1x0.5 meters in dimensions. The temperature ofwalls, ceiling and floor is set to 300 K and the temper-ature of the heater to 500 K. The standard k-� modelis used and steady case is simulated. The grid size ismodified to 400x200x400. Refer to (Ltd, 2009) man-ual and case file for its setup.

310

322.5

335

347.5

360

0 1/2 1

Temparature!comparation!at!y!=!0.7!Height!and!z=!0.5!Depth

Tem

pera

ture

(K)

Distance from left / Total Width

CPU GPU

Figure 3: Temperature comparison at y = 0.7H ofHeight and z = 0.5D of Depth

After the iteration has been completed, the tempera-ture values along the x direction at the intersection lineof plain y = 0.7H and plain z = 0.5D are extractedfrom the result file and compared. As can be seen from3, CPU version’s result and GPU version’s are almostthe same. This is because the precision of the GPUused in this study is sufficient to perform 32bit float-ing point scientific calculations. The tiny differencebetween CPU and GPU results, however, are relatedto rounding errors during the numerical evaluations.

CONCLUSIONAs can be seen from the fist case analysis, the methodpresented in this paper can increase the performance ofCFD computation to great extent, with much less ef-fort than transit the entire code base. It is expected thatthis method can be applied to many other CFD codesas well because most CFD codes uses finite methods.According to the the second case, the turning point ofthis method is lower than previous studies of GPGPUfluid simulation, which means for almost any buildingsimulation problem ranging from coarse grid to finegrid, GPU is always faster.The most important feature of the method, as canbe seen from the third case, is high precision. Itsprecision is of the same level as CPU calculation.While FFD uses different equations and algorithms,this method follows the same governing equations andnumerical methods, which makes it highly accurateand thus can be applied to research and engineeringwork.It is still possible to port other portions of the sourcecode into OpenCL to eliminate more bottlenecks, and

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1435 -

thus makes the program even faster. It’s also possi-ble to use preconditioned method to further optimizethe speed of conjugate gradient. However, these arebeyond the scope of this paper.This paper demonstrated a way to speed up the CFDprogram without losing any features or functionalities.

REFERENCESAMD. Opencl and the ati stream sdk v2.0http://developer.amd.com/sdks/

AMDAPPSDK/assets/ATI_Stream_SDK_

Release_Notes.pdf.

Andrew Corrigan, Fernando Camelli, R. L. Run-ning unstructured grid based cfd solvers on moderngraphics hardware. 19th AIAA Computational Fluid

Dynamics, June 22-25, San Antonio, Texas.

Apple. Technology brief: Opencl takingthe graphics processor beyond graphicshttp://www.abc.it/assets/files/

pdf/Snow_Leop_OpenCL.pdf. Apple Devel-

opers Connection.

ATI 2009. Ati stream computing - technical overviewhttp://www.cct.lsu.edu/˜scheinin/

Parallel/StreamComputingOverview.

pdf.

Bouwkunde, F. 2005. External coupling betweenbuilding energy simulation and computational fluiddynamics. PhD thesis, Technische Universiteit

Eindhoven.

CHENG W. C., LIU Chun-Ho, L. D. Y. C. 2008.Computational formulation for the evaluation ofstreet canyon ventilation and pollutant removalperformance. Atmospheric Environment, Volume42:9041–9051.

Dean and Glowinski 1993. On some finite elementmethods for the numerical simulation of incom-pressible viscous flow. Incompressible Computa-

tional Fluid Dynamics, Trends and advances, Cam-

bridge University Press, pages 17–65.

Gourp, K. Opencl 1.0 specification http://www.

khronos.org/opencl/.

Harris, M. Nvidia cuda sdk, optimizing paral-lel reduction in cuda http://developer.

download.nvidia.com/compute/cuda/

1_1/Website/projects/reduction/

doc/reduction.pdf.

Helbing, D. 1995. Social force model for pedestriandynamics. PHYSICAL REVIEW E, Vol 51:4282–4286.

Helbing, D. 2000. Simulating dynamical features ofescape panic. Nature, Volume 407:487–490.

J., S. 1999. Stable fluids. Proceedings of 26th interna-

tional conference on computer graphics and inter-

active techniques, SIGGRAPH’99, Los Angeles.

Ltd, O. 2009. Openfoam 1.6 user guide open-foam.org/.

Microsoft. http://msdn.microsoft.com/

en-us/directx/default.aspx.

MJ, H. 2003. Real-time cloud simulation and render-ing. Ph.D. thesis, University of North Carolina at

Chapel Hill.

NVIDIA. Nvidia opencl programming overviewhttp://www.nvidia.com/content/

cudazone/download/OpenCL/NVIDIA_

OpenCL_ProgrammingOverview.pdf.

P. Neofytou, A.G. Venetsanos, e. a. April 2006. Cfdsimulations of the wind environment around an air-port terminal building. Environmental Modelling &

Software, Volume 21, Issue 4:Pages 520–524.

Santa Clara, C. N. C. 2007. Nvidia cuda com-pute unified device architecture, programmingguide (version 1.1) http://developer.

download.nvidia.com/compute/cuda/

1_1/NVIDIA_CUDA_Programming_Guide_

1.1.pdf.

Selvam, R. P. November 1996. Computation of flowaround texas tech building using k-epsilon and kato-launder k-epsilon turbulence model. Engineering

Structures Volume 18, Issue 11:856–860.

Song O-Y, Shin H, K. H.-S. Stable but nondissipativewater. ACM Transactions on Graphics 2005, Vol-ume 24(1):81–97.

Sun Huawei, Zhao Lingying, Z. Y. January 1, 2007.Evaluating rng k-[epsilon] models using piv data forairflow in animal buildings at different ventilationrates. ASHRAE Transactions, Volume 113:8.

the ADFC team. Adfc navier stokes solversourceforge.net/projects/adfc/.

TOMINAGA YOSHIHIDE, MOCHIDA AKASHI,e. a. 2002. Comparison of performance of variousrevised k-.epsilon. models applied to cfd analysis offlowfield around a high-rise building. Journal of Ar-

chitecture, Planning and Environmental Engineer-

ing, Volume 556:47–54.

Wangda Zuo, Q. C. 2010. Fast and informative flowsimulations in a building by using fast fluid dynam-ics model on graphics processing unit. Building and

Environment, Volume 45:747–757.

Zuo W, C. Q. 2007. Validation of fast fluid dynamicsfor room airflow. Proceedings of the 10th interna-

tional IBPSA conference, Building Simulation 2007,

Beijing, China.

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1436 -

Zuo W, C. Q. 2009. Real-time or faster-than-real-timesimulation of airflow in buildings. Indoor Air, Vol-ume 19(1):33–44.

Proceedings of Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, 14-16 November.

- 1437 -