Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software...
Transcript of Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software...
![Page 1: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/1.jpg)
Peta-Scale Simulations with the HPC Software Framework waLBerla: Massively Parallel AMR for the Lattice Boltzmann Method
SIAM PP 2016, Paris
April 15, 2016
Florian Schornbaum, Christian Godenschwager, Martin Bauer, Ulrich Rüde
Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
![Page 2: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/2.jpg)
2
Outline
• Introduction • The waLBerla Simulation Framework
• An Example Using the Lattice Boltzmann Method
• Parallelization Concepts • Domain Partitioning & Data Handling
• Dynamic Domain Repartitioning • AMR Challenges
• Distributed Repartitioning Procedure
• Dynamic Load Balancing
• Benchmarks / Performance Evaluation
• Conclusion
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 3: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/3.jpg)
Introduction
• The waLBerla Simulation Framework
• An Example Using the Lattice Boltzmann Method
![Page 4: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/4.jpg)
4
Introduction
• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen):
• main focus on CFD (computational fluid dynamics) simulations based on the lattice Boltzmann method (LBM) (now also implementations of other methods, e.g., phase field)
• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers
• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)
• hybrid parallelization: MPI + OpenMP
• vectorization of compute kernels
• written in C++(11), growing Python interface
• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)
• automated build and test system
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 5: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/5.jpg)
5
Introduction
• AMR for the LBM – example (vocal fold phantom geometry)
DNS (direct numerical simulation)
Reynolds number: 2500 / D3Q27 TRT
24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels
without refinement: 311 times more memory …
… and 701 times the workload
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 6: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/6.jpg)
Parallelization Concepts
• Domain Partitioning & Data Handling
![Page 7: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/7.jpg)
7
simulation domain only in here
empty blocks are discarded
domain partitioning into blocks
static block-level refinement
Parallelization Concepts
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 8: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/8.jpg)
8
simulation domain only in here
empty blocks are discarded
domain partitioning into blocks
static block-level refinement
Parallelization Concepts
octree partitioning within every block of the initial partitioning
(→ forest of octrees)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 9: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/9.jpg)
9
static load balancing
allocation of block data (→ grids)
static block-level refinement (→ forest of octrees)
Parallelization Concepts
load balancing can be based on either space-filling curves (Morton or Hilbert order)
using the underlying forest of octrees or graph partitioning (METIS, …)
whatever fits best the needs of the simulation
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 10: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/10.jpg)
10
static load balancing
allocation of block data (→ grids)
static block-level refinement (→ forest of octrees)
DISK
DISK
separation of domain partitioning from simulation (optional)
Parallelization Concepts
compact (KiB/MiB) binary MPI IO
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 11: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/11.jpg)
11
static load balancing static block-level refinement (→ forest of octrees)
DISK
DISK
separation of domain partitioning from simulation (optional)
Parallelization Concepts
compact (KiB/MiB) binary MPI IO
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
allocation of block data (→ grids)
data & data structure stored perfectly distributed
→ no replication of (meta) data!
![Page 12: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/12.jpg)
12
static load balancing
allocation of block data (→ grids)
static block-level refinement (→ forest of octrees)
DISK
DISK
separation of domain partitioning from simulation (optional)
Parallelization Concepts
compact (KiB/MiB) binary MPI IO
all parts customizable via callback functions in order to adapt to the underlying simulation:
1) discarding of blocks 2) (iterative) refinement of blocks
3) load balancing 4) block data allocation†
† support for arbitrary number of block data items (each of arbitrary type)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 13: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/13.jpg)
13
Parallelization Concepts
forest of octrees: octrees are not explicitly stored,
but implicitly defined via block IDs
2:1 balanced grid (used for the LBM on refined grids)
distributed graph: nodes = blocks, edges explicitly stored as
< block ID, process rank > pairs
different “views” on / representations of the
domain partitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 14: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/14.jpg)
14
Parallelization Concepts
forest of octrees: octrees are not explicitly stored,
but implicitly defined via block IDs
2:1 balanced grid (used for the LBM on refined grids)
distributed graph: nodes = blocks, edges explicitly stored as
< block ID, process rank > pairs
different “views” on / representations of the
domain partitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
our parallel implementation [1] of local grid refinement for the LBM based on [2] shows
excellent performance:
simulations with in total close to one trillion cells close to one trillion cells updated per second (with 1.8 million threads) strong scaling: more than 1000 time steps / sec. → 1 ms per time step
[1] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-Uniform [1] Grids, SIAM Journal on Scientific Computing (accepted for publication) [http://arxiv.org/abs/1508.07982]
[2] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den Akker, A generic, mass conservative local grid refine- [2] ment technique for lattice-Boltzmann schemes , International Journal for Numerical Methods and Fluids
![Page 15: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/15.jpg)
Dynamic Domain Repartitioning
• AMR Challenges
• Distributed Repartitioning Procedure
• Dynamic Load Balancing
• Benchmarks / Performance Evaluation
![Page 16: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/16.jpg)
16
• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)
⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)
(→ octree partitioning & same number of cells for every block)
⇒ “split first, balance afterwards” probably won’t work
• for the LBM, all levels must be load-balanced separately
⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
AMR Challenges
![Page 17: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/17.jpg)
17
• challenges because of block-structured partitioning: • only entire blocks split/merge (only few blocks per process)
⇒ sudden increase/decrease of memory consumption by a factor of 8 (in 3D)
(→ octree partitioning & same number of cells for every block)
⇒ “split first, balance afterwards” probably won’t work
• for the LBM, all levels must be load-balanced separately
⇒ for good scalability, the entire pipeline should rely on perfectly distributed algorithms and data structures
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
AMR Challenges
→ no replication of (meta) data of any sort!
![Page 18: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/18.jpg)
18
split merge
1) split/merge decision callback function to determine which blocks must split
and which blocks may merge
2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,
2:1 balance is automatically preserved
different colors (green/blue) illustrate process assignment
Dynamic Domain Repartitioning
forced split to maintain 2:1 balance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 19: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/19.jpg)
19
split merge
1) split/merge decision callback function to determine which blocks must split
and which blocks may merge
2) skeleton data structure creation lightweight blocks (few KiB) with no actual data,
2:1 balance is automatically preserved
Dynamic Domain Repartitioning
forced split to maintain 2:1 balance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 20: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/20.jpg)
20
3) load balancing callback function to decide to
which process blocks must migrate to (skeleton blocks
actually move to this process)
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 21: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/21.jpg)
21
3) load balancing lightweight skeleton blocks
allow multiple migration steps to different processes
(→ enables balancing based on diffusion)
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 22: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/22.jpg)
22
3) load balancing links between skeleton blocks and corresponding real blocks are kept intact when skeleton
blocks migrate
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 23: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/23.jpg)
23
3) load balancing for global load balancing
algorithms, balance is achieved in one step → skeleton blocks
immediately migrate to their final processes
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 24: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/24.jpg)
24
4) data migration links between skeleton blocks and corresponding real blocks
are used to perform actual data migration (includes refinement and coarsening of block data)
refine coarsen
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 25: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/25.jpg)
25
4) data migration implementation for grid data:
coarsening → senders coarsen data before sending to target process
refinement → receivers refine on target process(es)
refine coarsen
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 26: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/26.jpg)
26
4) data migration implementation for grid data:
coarsening → senders coarsen data before sending to target process
refinement → receivers refine on target process(es)
refine coarsen
key parts customizable via callback functions in order to adapt to the underlying simulation:
1) decision which blocks split/merge 2) dynamic load balancing
Dynamic Domain Repartitioning
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 27: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/27.jpg)
27 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Dynamic Load Balancing
1) space filling curves (Morton or Hilbert): • every process needs global knowledge (→ all gather)
⇒ scaling issues (even if it’s just a few bytes from every process)
2) load balancing based on diffusion: • iterative procedure (= repeat the following multiple times)
• communication with neighboring processes only
⇒ calculate “flow” for every process-process connection
⇒ use this “flow” as guideline in order to decide where blocks need to migrate for achieving balance
⇒ runtime & memory independent of number of processes (true in practice? → benchmarks)
• useful extension (benefits outweigh the costs): all reduce to check for early abort & adapt “flow”
![Page 28: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/28.jpg)
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
LBM AMR - Performance
28 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
domain partitioning
4 grid levels
lid-driven cavity
![Page 29: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/29.jpg)
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
coarsen
LBM AMR - Performance
29 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 30: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/30.jpg)
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
refine coarsen
LBM AMR - Performance
30 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 31: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/31.jpg)
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
2:1 balance refine coarsen
LBM AMR - Performance
31 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 32: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/32.jpg)
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
LBM AMR - Performance
32
during this refresh process …
… all cells on the finest level are coarsened and the same amount of fine cells is created by splitting coarser cells
→ 72 % of all cells change their size
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 33: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/33.jpg)
• Benchmark Environments: • JUQUEEN (5.0 PFLOP/s)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (2.9 PFLOP/s) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
avg. blocks/process (max. blocks/proc.)
level initially after refresh after load balance
0 0.383 (1) 0.328 (1) 0.328 (1)
1 0.656 (1) 0.875 (9) 0.875 (1)
2 1.313 (2) 3.063 (11) 3.063 (4)
3 3.500 (4) 3.500 (16) 3.500 (4)
LBM AMR - Performance
33 Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 34: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/34.jpg)
34
• SuperMUC – space filling curve: Morton
0
0.5
1
1.5
2
2.5
3
3.5
seco
nd
s
209,671
497,000
970,703
1024 8192 65,536 cores
#cells per core
time required for the entire refresh cycle (uphold 2:1 balance, dynamic load balancing,
split/merge blocks, migrate data)
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 35: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/35.jpg)
35
• SuperMUC – space filling curve: Morton
0
0.5
1
1.5
2
2.5
3
3.5
seco
nd
s
209,671
497,000
970,703
1024 8192 65,536 cores
#cells per core
14 billion cells
64 billion cells
33 billion cells
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 36: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/36.jpg)
36
• SuperMUC – diffusion load balancing
0
0.5
1
1.5
2
2.5
3
3.5
seco
nd
s
209,671
497,000
970,703
1024 8192 65,536 cores
#cells per core
14 billion cells
64 billion cells
33 billion cells
time almost independent of #processes !
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 37: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/37.jpg)
37
• JUQUEEN – space filling curve: Morton
0
2
4
6
8
10
12
seco
nd
s
31,062
127,232
429,408
256 4096 32,768 458,752 cores
#cells per core 14 billion cells
197 billion cells
58 billion cells
hybrid MPI+OpenMP version with SMP 1 process ⇔ 2 cores ⇔ 8 threads
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 38: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/38.jpg)
38
• JUQUEEN – diffusion load balancing
0
2
4
6
8
10
12
seco
nd
s
31,062
127,232
429,408
256 4096 32,768 458,752 cores
#cells per core 14 billion cells
197 billion cells
58 billion cells
time almost independent of #processes !
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 39: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/39.jpg)
39
• JUQUEEN – diffusion load balancing
0
2
4
6
8
10
12
ite
rati
on
s
256 4096 32,768 458,752 cores
number of diffusion iterations until load is perfectly balanced
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
![Page 40: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/40.jpg)
40
• impact on performance / overhead of the entire dynamic repartitioning procedure?
• depends … • … on the number of cells per core
• … on the actual runtime of the compute kernels (D3Q19 vs. D3Q27, additional force models, etc.)
• … on how often dynamic repartitioning is happening
• previous lid-driven cavity benchmark: • overhead ≙ 1 to 3 (diffusion) or 1.5 to 10 (curve) time steps
⇒ In practice, a lot of time† is spent just to determine whether or not the grid must be adapted, i.e., whether or not refinement must take place.
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
† often the entire overhead of AMR
![Page 41: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/41.jpg)
41
• AMR for the LBM – example (vocal fold phantom geometry)
DNS (direct numerical simulation)
Reynolds number: 2500 / D3Q27 TRT
24,054,048 ↔ 315,611,120 fluid cells / 1 ↔ 5 levels
processes: 3584 (on SuperMUC phase 2)
runtime: c. 24 h (3 × c. 8 h)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
LBM AMR - Performance
![Page 42: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/42.jpg)
42
• AMR for the LBM – example (vocal fold phantom geometry)
load balancer: space filling curve (Hilbert order)
time steps: 180,000 / 2,880,000 (finest grid)
refresh cycles: 537 (→ refresh every 335 time steps)
without refinement: 311 times more memory …
… and 701 times the workload
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
LBM AMR - Performance
![Page 43: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/43.jpg)
Conclusion
![Page 44: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/44.jpg)
44
• the approach for massively parallel grid repartitioning by
… using a block-structured domain partitioning and
… employing a lightweight “copy” of the data structure … during dynamic load balancing
is paying off and working extremely well:
Conclusion & Outlook
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
we can handle 1011 cells (> 1012 unknowns) …
… with 107 blocks and 1.83 million threads
![Page 45: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/45.jpg)
45
• the approach for massively parallel grid repartitioning by
… using a block-structured domain partitioning and
… employing a lightweight “copy” of the data structure … during dynamic load balancing
is paying off and working extremely well:
Conclusion & Outlook
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
we can handle 1011 cells (> 1012 unknowns) …
… with 107 blocks and 1.83 million threads
resilience (using ULFM): store redundant, in-memory “snapshots” → one/multiple process(es) fail → restore data on different processes →
perform dynamic repartitioning → continue :-)
![Page 46: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/46.jpg)
THANK YOU FOR YOUR ATTENTION!
![Page 47: Peta-Scale Simulations with the HPC Software … · Peta-Scale Simulations with the HPC Software Framework waLBerla: ... April 15, 2016 allocation of block data ... 1 balanced grid](https://reader034.fdocuments.in/reader034/viewer/2022042605/5ad3a0c07f8b9a482c8dfdcd/html5/thumbnails/47.jpg)
THANK YOU FOR YOUR ATTENTION!
QUESTIONS ?