A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow...

38
A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21 st 2013 Christian Godenschwager , Florian Schornbaum, Martin Bauer, Harald Köstler, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Transcript of A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow...

Page 1: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

A Framework for Hybrid Parallel Flow Simulations

with a Trillion Cellsin Complex Geometries

SC13, November 21st 2013

Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler, Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Page 2: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

• waLBerla Framework

• Lattice Boltzmann Method

• Benchmarked Test Cases

• Benchmark Results

• Conclusion & Future Work

2

Outline

SC13, Denver Christian Godenschwager November 21st 2013

Page 3: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

The waLBerla Framework

Page 4: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

• Focus on lattice Boltzmann method

• Written in C++

• Contains hand-crafted, machine-specifichigh-performance compute kernels

• Also generic, easily adaptable compute kernels for prototyping

• Modules for handling complexgeometries

• Particulate flow simulations by coupling with our physics engine pe

• Models for multiphase andfree surface flows

4

waLBerla – an HPC Framework

SC13, Denver Christian Godenschwager November 21st 2013

Page 5: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

• Hybridly parallelized (MPI + OpenMP)

• No data structures growing with number of processes involved

• Scales from laptop to recent petascale machines

• Parallel output

• Portable (Compiler/OS)

• Automated tests / CI servers

• Open Source release early 2014

5

waLBerla – an HPC Framework

SC13, Denver Christian Godenschwager November 21st 2013

llvm/clang

Page 6: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

6

Examples

SC13, Denver Christian Godenschwager November 21st 2013

Study of hemodynamical impact of stenoses in coronary arteries.

Turbulent flow (Re=11000)

around a sphere(Ehsan Fattahi, Daniel Weingaertner)

Page 7: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Examples

7SC13, Denver Christian Godenschwager November 21st 2013

Liquid-Gas-Solid Flow Simulation:Stable Floating Positions of

Box-Shaped Particles(Simon Bogner)

Constructing a hollow cylinder by electron beam melting

(Matthias Markl, Regina Ammer)

Rigid bodies simulated with pe

Page 8: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Lattice Boltzmann Method

LBM

Page 9: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

9

Lattice Boltzmann Method

SC13, Denver Christian Godenschwager November 21st 2013

• Explicit, mesoscopic method for solving fluid flow problems (or heat, arbitrary advection-diffusion equations…)

• Discretization of Boltzmann equation

• Provides solution for Navier-Stokes equations at low Mach numbers

• Based on uniformly structured, Cartesian grid of cells

Page 10: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

10

Lattice Boltzmann Method

SC13, Denver Christian Godenschwager November 21st 2013

Lattice Boltzmann equation (single-relaxation time, SRT)

𝑓𝑖(𝐱 + 𝐞𝐢𝛿𝑡 , 𝑡 + 𝛿𝑡) = 𝑓𝑖 𝐱, 𝑡 −𝑓𝑖 𝐱, 𝑡 − 𝑓𝑖

𝑒𝑞(𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡 )

𝜏

𝑓𝑖𝑒𝑞(𝐮, 𝜌) = 𝜔𝑖𝜌 1 +

𝐞𝐢 ⋅ 𝐮

𝑐𝑠2 +

(𝐞𝐢⋅ 𝐮)2

2𝑐𝑠4 −

3𝐮2

2𝑐𝑠2

Equilibrium distribution function

Macroscopic quantities (density, momentum density)

𝜌 = ∑𝑓𝑖 𝜌𝐮 = ∑𝐞𝐢𝑓𝑖

Lattice Boltzmann equation (two-relaxation time, TRT)

𝑓𝑖 𝐱 + 𝐞𝐢𝛿𝑡 , 𝑡 + 𝛿𝑡 = 𝑓𝑖 𝐱, 𝑡 −𝑓𝑖+ 𝐱, 𝑡 − 𝑓𝑖

𝑒𝑞,+𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡

𝜆0−

𝑓𝑖− 𝐱, 𝑡 − 𝑓𝑖

𝑒𝑞,−(𝐮 𝐱, 𝑡 , 𝜌 𝑥, 𝑡 )

𝜆1

TRT model can

improve accuracy

and stability of LBM

Page 11: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

11

LBM computationally

SC13, Denver Christian Godenschwager November 21st 2013

Streaming Step

Collision StepD2Q9

Page 12: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

12

LBM computationally

SC13, Denver Christian Godenschwager November 21st 2013

Streaming Step

Collision StepD3Q19:

19 Loads

198 Flops (TRT)

19 Stores (+19 Loads)

305 Byte

Page 13: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

13

LBM Data Structures

uniform blockdecomposition

SC13, Denver Christian Godenschwager November 21st 2013

• Domain partitioning into blocks containing uniform grid of cells

• Ghostlayer (halo) exchange of outer layer(s)

Page 14: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Benchmarked Testcases

Page 15: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Lid Driven Cavity (LDC) Flow

● Dense

● One block per process

● No load balancing

Flow through Coronary Arteries

● Sparse, but coherent

● Volume fraction 0.3%

● Multiple blocks per process

● Load balancing required

Testcases

15SC13, Denver Christian Godenschwager November 21st 2013

Page 16: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

16

Complex Geometry Initialization

SC13, Denver Christian Godenschwager November 21st 2013

Complex geometry given by surface Add regular block partitioning

Discard empty blocks

Allocate block data

Load balancing

Page 17: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

17

Complex Geometry Initialization

SC13, Denver Christian Godenschwager November 21st 2013

Complex geometry given by surface Add regular block partitioning

Discard empty blocks

Allocate block data

Load balancing

File size 500,000 blocks: ~40MB

Separate

domain partitioning

from simulation phase

Page 18: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

18

Domain Partitioning

SC13, Denver Christian Godenschwager November 21st 2013

143 → 649

183 → 413

233 → 277

293 → 201

373 → 149

333 → 154

313 → 184

303 → 190

303 → 190

block size → #blocksdx = 0.2mm target: ≤ 200 blocks

Page 19: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Coronary Artery Testcase Initialization

19

Domain partitioning of coronary tree dataset

One block per process

512 processes

485 blocks

458,752 processes

458,184 blocks

Page 20: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Hardware

20SC13, Denver Christian Godenschwager November 21st 2013

JUQUEEN SuperMUC

Forschungszentrum Jülich, Germany LRZ, Garching (Munich), Germany

IBM system IBM system

Blue Gene/Q Intel Sandy Bridge-EP

28,672 nodes 9,216 nodes

458,752 cores 147,456 cores

5.9 Petaflops peak 3.2 Petaflops peak

448 TB main memory 288 TB main memory

5D Torus Network Non-blocking tree / 4:1 pruned tree

Page 21: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Benchmark Results

Lid Driven Cavity

Page 22: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

22SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – single socket

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8

MLU

P/s

Cores

SRT1

SuperMUC - LDC - Weak

naïve, straightforwardimplementation

already quiteoptimized!

Page 23: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

23SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – single socket

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8

MLU

P/s

Cores

SRT2

SRT1

SuperMUC - LDC - Weak

naïve, straightforwardimplementation

already quiteoptimized!

Page 24: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

24SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – single socket

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8

MLU

P/s

Cores

SRT

SRT2

SRT1

SuperMUC - LDC - Weak

naïve, straightforwardimplementation

already quiteoptimized!

Page 25: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

25SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – single socket

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8

MLU

P/s

Cores

SRT

TRT

SRT2

SRT1

SuperMUC - LDC - Weak

naïve, straightforwardimplementation

already quiteoptimized!

Page 26: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

26SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – single socket

⇒ limited by memory bandwidth

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8

MLU

P/s

Cores

SRT

TRT

SRT2

SRT1

SuperMUC - LDC - Weak

naïve, straightforwardimplementation

already quiteoptimized!

Bandwidth

limit

Page 27: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

27SC13, Denver Christian Godenschwager November 21st 2013

• JUQUEEN – single node

⇒ limited by memory bandwidth

0

10

20

30

40

50

60

70

80

90

MLU

P/s SRT

TRT

JUQUEEN - LDC - Weak

1 2 4 8 16Cores

hybrid version(4 threads per core) Bandwidth

limit

Page 28: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

28SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – TRT kernel

0

1

2

3

4

5

6

7

8

9

10

32 256 2048 16384 131072

MLU

P/s

pe

r c

ore

Cores

16P 1T

4P 4T

2P 8T

SuperMUC - LDC - Weak

#processesper node

#threadsper process

Page 29: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

29SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – TRT kernel

0

1

2

3

4

5

6

7

8

9

10

32 256 2048 16384 131072

MLU

P/s

pe

r c

ore

Cores

16P 1T

4P 4T

2P 8T

SuperMUC - LDC - Weak

#processesper node

#threadsper process

2

islands

Page 30: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

30SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC – TRT kernel

0

10

20

30

40

50

60

0

1

2

3

4

5

6

7

8

9

10

32 256 2048 16384 131072

Co

mm

un

ication

share

(%)

MLU

P/s

pe

r c

ore

Cores

16P 1T

4P 4T

2P 8T

Comm

SuperMUC - LDC - Weak

2

islands

Page 31: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

31SC13, Denver Christian Godenschwager November 21st 2013

• JUQUEEN – TRT kernel

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

32 128 512 2048 8192 32768 131072 524288

MLU

P/s

pe

r c

ore

Cores

64P 1T

16P 4T

8P 8T

JUQUEEN - LDC - weak

#processesper node

#threadsper process

1.93 x 1012 cells updated per second(19 values per cell)

⇒ 383 TFlop/s (6.5% peak) ⇒ 800 TB/s (67% peak)

Page 32: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Benchmark Results

Coronary Artery Tree

Page 33: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

33SC13, Denver Christian Godenschwager November 21st 2013

• JUQUEEN– TRT kernel

JUQUEEN - COR - weak

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0

0,5

1

1,5

2

2,5

3

512 2048 8192 32768 131072 524288

Fluid

Fraction

MFL

UP/

s /

Co

re

Cores

Efficiency Fluid Fraction

1.03 trillion load balanced lattice cells

dx = 1.3μm

Page 34: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

34SC13, Denver Christian Godenschwager November 21st 2013

• JUQUEEN - TRT kernel - dx = 0.05

JUQUEEN - COR - strong

0

100

200

300

400

500

600

700

800

900

1000

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

512 2048 8192 32768 131072 524288

Time

Step

s/ sM

FLU

P/s

/ C

ore

Cores

Efficiency Peformance

Page 35: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

35SC13, Denver Christian Godenschwager November 21st 2013

• SuperMUC - TRT kernel - dx = 0.1 mm

SuperMUC - COR - strong

0

1.000

2.000

3.000

4.000

5.000

6.000

7.000

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1,80

32 128 512 2048 8192 32768

Time

Step

s/ sM

FLU

P/s

/ C

ore

Cores

Efficiency Performance

Page 36: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Conclusion & Future Work

Page 37: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

• waLBerla runs efficiently on current petascale supercomputers

• Excellent scaling properties

• Execution rates up to 6638 LBM time steps / s in strong scaling settings

• Discretization of coronary artery tree into 1,033,660,569,847 load balanced lattice cells

37

Conclusion & Future Work

SC13, Denver Christian Godenschwager November 21st 2013

• Future: Grid refinement and dynamic load balancing

• Useful for particulate flows with fully resolved particles

Page 38: A Framework for Hybrid Parallel Flow Simulations with a ...A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21st 2013 Christian

Thank you!