LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSORfdubois/organisation/05dec08/... · LBM...

21
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR CETHIL UMR5008 GPU COMPUTING PROCESSOR Frédéric Kuznik , [email protected] 1 LBE Workshop -15/05/2008 - CEMAGREF Antony

Transcript of LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSORfdubois/organisation/05dec08/... · LBM...

LBM BASED FLOW SIMULATION USING

GPU COMPUTING PROCESSOR

CETHILUMR5008

GPU COMPUTING PROCESSOR

Frédéric Kuznik, [email protected]

11LBE Workshop - 15/05/2008 - CEMAGREF Antony

Framework

Introduction

Hardware architecture

CUDA overviewCUDA overview

Implementation details

A simple case: the lid driven cavity problem

22LBE Workshop - 15/05/2008 - CEMAGREF Antony

Presentation of my activities

CETHIL: Thermal Sciences Center of Lyon

Current work: simulation of fluid flows with

heat transfer

CETHILUMR5008

33

Use/development of Thermal LBM1,2 LBM computation using GPU3

Objective: coupling efficiently the two approaches

1 A double population lattice Boltzman method with non-uniform mesh for the simulation of natural convection in a

square cavity, Int. J. Heat and Fluid Flow, vol. 28(5), pp. 862-870, 2007

2 Numerical prediction of natural convection occuring in building components: a double population lattice Boltzmann

method, Numerical Heat Transfer A, vol. 52(4), pp. 315-335, 2007

3 LBM based flow simulation using GPU computing processor, Computers & Mathematics with Applications, under

reviewLBE Workshop - 15/05/2008 - CEMAGREF Antony

Introduction

Due to the market (realtime and

high definition 3-D graphics), GPU

has evolved into highly parrallel,

multithreaded, multi-core

44

(from CUDA Programming Guide 06/07/2008)

GFLOPS=109 floating point operations per second

multithreaded, multi-core

processor.

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Introduction

The bandwidth of GPU has also

evolved for faster graphics

purpose.

55

(from CUDA Programming Guide 06/07/2008)

purpose.

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Previous works using GPU & LBM

GPU Cluster for high performance computing (Fan et al.

2004): 32 nodes cluster of GeForce 5800 ultra

LB-Stream Computing or over 1 Billion Lattice Updates per

second on a single PC (J. Tölke ICMMES 2007): GeForce 8800

ultra

66

ultra

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Hardware

Commercial graphics card (can beincluded in a desktop PC)

240 processors running at1.35GHz 1.35GHz

1 Go DDR3 with 141.7GB/s bandwidth

Theoretical peak at 1000GFLOPS

Price around 500€

77

nVIDIA GeForce GTX 280

LBE Workshop - 15/05/2008 - CEMAGREF Antony

GPU Hardware Architecture

◊ 15 texture processors (TPC)

◊ 1 TPC is composed of 2

streaming multiprocessors (SM)

◊ 1 SM is composed of 8

88

◊ 1 SM is composed of 8

streaming processors (SP) and 2

special functions units

◊ Each SM has 16kB of shared

memory

LBE Workshop - 15/05/2008 - CEMAGREF Antony

GPU Programming InterfaceNVIDIA® CUDATM C language programming environment

(version 2.0)

Definitions:

Device=GPU

Host=CPU

Kernel=function that is called from the host and runs on

99

Kernel=function that is called from the host and runs onthe device

A cuda kernel is exucuted by an array of threads

LBE Workshop - 15/05/2008 - CEMAGREF Antony

GPU Programming Interface

A kernel is executed by a gridof thread block

1 thread block is executed by 1multiprocessor

A thread block contains a

1010

A thread block contains amaximum of 1024 threads

Threads are executed byprocessors within a singlemultiprocessor

=> Threads from differentblocks cannot cooperate

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Kernel memory access

Registers

Shared memory: on-chip (fast),

small (16kB)

Global memory: off-chip, large

1111

Global memory: off-chip, large

The host can read or write

global memory only.

LBE Workshop - 15/05/2008 - CEMAGREF Antony

LBM ALGORITHM

Step 1: Collision (Local – 70% computational time)

Step 2: Propagation (Neighbors – 28% computational time)

1212

Step 2: Propagation (Neighbors – 28% computational time)

+ Boundary conditions

(from BERNSDORF J., How to make my LB-code faster – software planning,

implementation and performance tuning, ICMMES’08, Netherlands)

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Implementation details

Thread block

1313

Grid of blocksLattice grid

1- Decomposition of the fluid domain into a lattice grid

2- Indexing the lattice nodes using thread ID

3- Decomposition of the CUDA domain into thread blocks

4- Execution of the kernel by the thread blocks

Translation of lattice grid

into CUDA topology

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Pseudo-code

Combine collision and propagation steps:

for each thread block

for each thread

load fi in shared memory

1414

i

compute collision step

do the propagation step

end

end

Exchange informations across boundaries

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Application: 2D lid driven cavity

Problem description u=U0, v=0

u=0, v=0u=0, v=0

D2Q9 MRT model of d’Humières (1992)

1515

y

x

u=0, v=0u=0, v=0

u=0, v=0

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Results Re=1000

Reference data from Erturk et al. 2005:

0.1

0.2

0.3

0.7

0.8

0.9

1

1616

xv

velo

city

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

data from Erturk et al.LBM

u velocity

y

-0.25 0 0.25 0.5 0.75 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

data from Erturk et al.LBM

Vertical velocity profil for x=0.5 Horizontal velocity profil for y=0.5

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Performances

Mesh

grid size

Number of Threads

16 32 64 128 256

256² 71.0 71.0 71.1 71.1 70.9

512² 284.2 282.7 284.3 284.3 283.7

1717

512² 284.2 282.7 284.3 284.3 283.7

1024² 346.4 690.2 953.0 997.5 1007.9

2048² 240.0 583.0 803.3 961.3 1001.5

3072² 289.3 572.4 845.2 941.2 974.2

Performances in MLUPS

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Performances: CPU comparisons

80

100

120

140

160

180

200

256²

512²

1024²

GP

U/C

PU

1818

0

20

40

60

80

0 128 256

1024²

2048²

3072²

Number of Threads per block

GP

U/C

PU

CPU = Pentium IV, 3.0 GHz

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Conclusions

The GPU allows a gain of about 180 for the calculation time

Multiple GPU server exists: NVIDIA TESLA S1070 composed of

1919

4 GTX 280 (theoretical gain 720)

The evolution of GPU is not finished !

But using double precision floating point, the gain falls to 20 !

LBE Workshop - 15/05/2008 - CEMAGREF Antony

Outlooks

A workgroup concerning LBM and GPU

1 master student with Prof. Bernard TOURANCHEAU (ENS Lyon –

INRIA) and a PhD student in 2009

1 master student with Prof. Eric Favier (ENISE – DIPI)

2020

A workgroup concerning Thermal LBM …?

Everybody is welcome to join the workgroup !

x

y

0 0.5 10

0.5

1

x

y

0 0.5 10

0.5

1

x

y

0 0.5 10

0.5

1

Ra=103 Ra=104

Ra=105 Ra=106x

y

0 0.5 10

0.5

1

LBE Workshop - 15/05/2008 - CEMAGREF Antony

THANK YOU FOR YOUR ATTENTION

2121

THANK YOU FOR YOUR ATTENTION

[email protected]

LBE Workshop - 15/05/2008 - CEMAGREF Antony