A parallel computational framework for the simulation of variably saturated flow based on the...

1
A parallel computational framework for the simulation of variably saturated flow based on the Cellular Automata concept using CUDA architecture. Paolo Burlando 1 , Grigorios G. Anagnostopoulos 1 and Adamos Kyriakou 2 Institute of Environmental Engineering (1), Computer Vision Lab (2), ETH Zurich, Switzerland correspondence: [email protected] 1. Introduction A simple and efficient computational framework is presented for the simulation of variably satu- rated flow in porous media. In this modeling ap- proach the Cellular Automata (CA) concept is im- plemented. It is efficient for the simulation of large scale phenomena. The inherent CA concept simplicity and its natural parallelism make its implementation easy within the CUDA framework. 2. Computational algorithm According to the macroscopic CA notion the com- putational domain consists of a two or three di- mensional lattice, which is composed by rectan- gular or prismatic cells respectively. Every cell of the lattice communicates with its neighbors only through its faces. (0,0,-1) (0,0,1) (0,-1,0) (0,1,0) (1,0,0) (-1,0,0) (0,0,0) Q0 Q1 Q2 Q3 Q4 Q5 Coupling the discrete formulation of the mass balance of an arbitrary cell with the Darcy- Buckinghams law one can compute the head at time t + Δt: h t+Δt c = X 2I K c A c l c h t + V c σ ( c ) Δt h t c + X 2I 0 Q bound + S c X 2I K c A c l c + V c σ ( c ) Δt The above equation is applied in all the cells of the lattice except those, which have a Diriclets bound- ary condition, the hydraulic head of which is fixed throughout the simulation. 3. Verification of the algorithm The presented algorithm was tested against known benchmark cases available from the litera- ture, in order to evaluate its performance. These include experimental data, analytical solutions and numerical experiments (Anagnostopoulos and Burlando, 2011). An example is the infiltration experiment of Vauclin et al (1979), which is used to evaluate the ability of the model simulating the transient position of the water table in a laboratory scale soil box. Distance (m) Water Depth (m) 0 0.5 1 1.5 2 2.5 3 2 1.5 1 0.5 0 t = 2 hrs t = 3 hrs t = 4 hrs t = 8 hrs experimental data 4. CUDA Architecture CUDA is a general purpose parallel computing ar- chitecture that leverages the parallel compute en- gine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Device Memory Shared Memory Instruction Unit Processor 1 Registers Processor 2 Registers Processor M Registers Constant Cache Texture Cache CUDA comes with a software environment that al- lows developers to use C as a high-level program- ming language. 5. Implementation and performance The most challenging issue is the fact that the domain can have irregular geometry, which can make more difficult the exploitation of locality at the thread computations and the use of the shared memory. For the runs we used a Nvidia Quadro 2000 graphics card with 192 CUDA cores installed in a pc with an Intel Xeon pro- cessor at 2.93 GHz. The benchmark case of Vauclin et al (1979) was used for assessing the performance of the code for grid dimensions of increasing size (scale effect). Parallelization strategy: The cell values are stored in a 1D array and for each cell the indexes of its neighboring cells were also stored. Both of these matrices reside in the global memory. Simulation constants are stored in the constant memory. Soil properties for each soil class are stored in the texture memory. Atomic operations are used in order to check for convergence at every iteration. The shared memory is used to accelerate the atomic operations and the block’s memory accesses. Results and conclusions: 1 10 100 1000 10000 100000 1000 10000 100000 1000000 10000000 Speed ( cells/sec ) Number of Cells CPU GPU 0 10 20 30 40 50 60 70 80 90 1000 10000 100000 1000000 10000000 Speed Up Factor (.) Number of Cells The speed up factor increases with grid dimension. As the domain size increases more computational resources of the GPU are exploited. Our framework is very attractive for basin scale simulations (e.g. in natural hazards assessment) where the grid sizes can become excessively large. References [1] G.G. Anagnostopoulos, P. Burlando, (2011). Object-oriented computational framework for the simulation of variably saturated flow, using a reduced complexity model, Submitted in Environmental Modelling & Software [2] M. Vauclin, D. Khanji, G. Vachaud, (1979). Experimental and numerical study of a transient, two-dimensional unsaturated-saturated water recharge problem. Water Resources Research, Vol 15 [3] NVIDIA (2010). Cuda programming guide, 3.0, Available: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ NVIDIA_CUDA_ProgrammingGuide.pdf

description

A simple and efficient computational framework is presented for the simulation of variably saturated flow in porous media. In this modeling approach the Cellular Automata (CA) concept is implemented. The computational domain is thus discretized with a regular grid and simple rules govern the evolution of the physical phenomena. The inherent CA concept simplicity and its natural parallelism make the parallel implementation of algorithms very efficient, especially for the simulation of large scale phenomena. This is a very important feature that allows to incorporate the CA computational framework into a more general catchment scale distributed hydrological model for the detailed simulation of soil water balance or into other types of models, which are used to simulate the dynamics of water pressure heads and of soil saturation, such as in the case of the modelling of rainfall triggered landslides and of solute and contaminant transport in agricultural soils. CUDA architecture is utilized in order to take advantage of the computational capabilities of modern GPUs. Particular weight was given to the utilization of the different available memory types. Constant and texture memory are extensively used in order to accelerate the memory accesses, while shared memory is used to exploit the locality of thread computations and optimize block’s memory accesses. The presented model was applied to various test cases and showed good agreement with published results and scalability with increasing thread and block size.

Transcript of A parallel computational framework for the simulation of variably saturated flow based on the...

Page 1: A parallel computational framework for the simulation of variably saturated flow based on the Cellular Automata concept using CUDA architecture

A parallel computational framework for the simulation of variably saturated flowbased on the Cellular Automata concept using CUDA architecture.

Paolo Burlando1, Grigorios G. Anagnostopoulos1 and Adamos Kyriakou2

Institute of Environmental Engineering (1), Computer Vision Lab (2), ETH Zurich, Switzerlandcorrespondence: [email protected]

1. IntroductionA simple and efficient computational frameworkis presented for the simulation of variably satu-rated flow in porous media. In this modeling ap-proach the Cellular Automata (CA) concept is im-plemented.

• It is efficient for the simulation of large scalephenomena.

• The inherent CA concept simplicity and itsnatural parallelism make its implementationeasy within the CUDA framework.

2. Computational algorithmAccording to the macroscopic CA notion the com-putational domain consists of a two or three di-mensional lattice, which is composed by rectan-gular or prismatic cells respectively. Every cell ofthe lattice communicates with its neighbors onlythrough its faces.

(0,0,-1)

(0,0,1)

(0,-1,0) (0,1,0)

(1,0,0)

(-1,0,0)

(0,0,0)

Q0

Q1

Q2Q3

Q4

Q5

Coupling the discrete formulation of the massbalance of an arbitrary cell with the Darcy-Buckinghams law one can compute the head attime t+�t:

ht+�t

c

=

X

↵2I

K↵c

A↵c

l↵c

ht

+Vc

�( c

)

�tht

c

+X

↵2I

0Q↵

bound

+ Sc

X

↵2I

K↵c

A↵c

l↵c

+Vc

�( c

)

�t

The above equation is applied in all the cells of thelattice except those, which have a Diriclets bound-ary condition, the hydraulic head of which is fixedthroughout the simulation.

3. Verification of the algorithmThe presented algorithm was tested againstknown benchmark cases available from the litera-ture, in order to evaluate its performance. Theseinclude experimental data, analytical solutionsand numerical experiments (Anagnostopoulosand Burlando, 2011).

An example is the infiltration experiment ofVauclin et al (1979), which is used to evaluatethe ability of the model simulating the transientposition of the water table in a laboratory scale soilbox.

Distance (m)

Wa

ter

De

pth

(m

)

0 0.5 1 1.5 2 2.5 32

1.5

1

0.5

0t = 2 hrs

t = 3 hrs

t = 4 hrs

t = 8 hrs

experimental data

4. CUDA ArchitectureCUDA is a general purpose parallel computing ar-chitecture that leverages the parallel compute en-gine in NVIDIA GPUs to solve many complexcomputational problems in a more efficient waythan on a CPU. Chapter 4: Hardware Implementation

CUDA Programming Guide Version 2.3 75

A set of SIMT multiprocessors with on-chip shared memory.

Figure 4-2. Hardware Model

4.2 Multiple Devices The use of multiple GPUs as CUDA devices by an application running on a multi-GPU system is only guaranteed to work if these GPUs are of the same type.

When the system is in SLI mode, all GPUs are accessible via the CUDA driver and runtime as separate devices, but there are special considerations as described below.

First, an allocation in one CUDA device on one GPU will consume memory on other GPUs. Because of this, allocations may fail earlier than otherwise expected.

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device Memory

Shared Memory

Instruction Unit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

Constant Cache

Texture Cache

CUDA comes with a software environment that al-lows developers to use C as a high-level program-ming language.

5. Implementation and performanceThe most challenging issue is the fact that the domain can haveirregular geometry, which can make more difficult the exploitationof locality at the thread computations and the use of the sharedmemory.

For the runs we used a Nvidia Quadro 2000 graphics cardwith 192 CUDA cores installed in a pc with an Intel Xeon pro-cessor at 2.93 GHz. The benchmark case of Vauclin et al (1979)was used for assessing the performance of the code for griddimensions of increasing size (scale effect).

Parallelization strategy:

• The cell values are stored in a 1D array and for each cell the indexes of its neighboring cells were alsostored. Both of these matrices reside in the global memory.

• Simulation constants are stored in the constant memory.

• Soil properties for each soil class are stored in the texture memory.

• Atomic operations are used in order to check for convergence at every iteration.

• The shared memory is used to accelerate the atomic operations and the block’s memory accesses.

Results and conclusions:

1"

10"

100"

1000"

10000"

100000"

1000" 10000" 100000" 1000000" 10000000"Speed%(%cells/sec%)%

Number%of%Cells%

CPU" GPU"

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

1000" 10000" 100000" 1000000" 10000000"

Speed%Up%Factor%(.)%

Number%of%Cells%

• The speed up factor increases with grid dimension. As the domain size increases more computationalresources of the GPU are exploited.

• Our framework is very attractive for basin scale simulations (e.g. in natural hazards assessment)where the grid sizes can become excessively large.

References[1] G.G. Anagnostopoulos, P. Burlando, (2011). Object-oriented computational framework for the simulation of variably saturated flow, using a reduced complexity model, Submitted in Environmental Modelling

& Software

[2] M. Vauclin, D. Khanji, G. Vachaud, (1979). Experimental and numerical study of a transient, two-dimensional unsaturated-saturated water recharge problem. Water Resources Research, Vol 15

[3] NVIDIA (2010). Cuda programming guide, 3.0, Available: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/

NVIDIA_CUDA_ProgrammingGuide.pdf

Grigoris Anagnostopoulos
Abstract No: IN13A-1317