Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a...

30
Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Transcript of Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a...

Page 1: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Performance Optimization of a Massively Parallel Phase-Field Method

Using the HPC Framework waLBerla

SIAM PP 2016, April 13th 2016

Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Page 2: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

• Motivation

• waLBerla Framework

• Phase-Field Method in waLBerla• Single-Core Optimizations

• Asynchronous Communication

• I/O and Post-processing

• In-Situ Processing with Python

• Summary and Outlook

2

Outline

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 3: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

• large domain required to reduce boundary influence

• some physical patterns only occur in highly resolved simulations ( spiral )

• simulate big domains in 3D

• unoptimized, general purpose code phase field code from KIT available

• goal: write optimized parallel version for specific model

3

Motivation

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 4: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

The waLBerla Framework

Page 5: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

6

waLBerla Framework

• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with

Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on block-structured grids

• www.walberla.net

Vocal Fold Study(Florian Schornbaum)

Fluid Structure Interaction (Simon Bogner)

Free Surface Flow

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 6: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

• Written in C++ with Python extensions

• Hybridly parallelized (MPI + OpenMP)

• No data structures growing with number of processes involved

• Scales from laptop to recent petascale machines

• Parallel I/O

• Portable (Compiler/OS)

• Automated tests / CI servers

• Open Source release planned

7

waLBerla Framework

llvm/clang

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 7: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

• Domain Decomposition & Distribution to Processes:• regular decomposition into blocks containing uniform grids

• grid refinement: octree-like decomposition

8

Block-structured Grids

In most cases, if a regular decomposition of a uniform

grid is used, exactly one block is assigned to each process.

forest of octrees:each block contains a uniform grid

of the same size→ 2:1 balance between neighboring

cells on level transitions

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 8: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers

• support for overlapping communication and computation

• some advanced models require more complex communication patterns ( e.g. free-surface and fluid-structure interaction)

10

Hybrid Parallelization

receiverprocess

senderprocess

(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 9: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Phase field in waLBerla

Page 10: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

12

Phase field algorithm

• two lattices (fields):• phase field 𝜙 with 4 components

• chemical potential 𝜇 with 2 components

• storing two time steps in “src” and “dst” fields

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 11: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Optimizations of Phase Field algorithm

Page 12: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Optimization

• moving window technique

• simplifications due to special setup ( e.g. analytical temperaturegradient )

• shortcuts: some terms can be neglected in certain cell types

Model Layer

• access patterns / stencils

• overlapping computation and communication

• eliminate common subexpressions

Algorithm Layer

• SIMDification

• Memory layout (AoS vs. SoA)

Hardware Layer

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 13: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Single Core Optimization Results

• test system: one SuperMUC core (Intel Xeon E5-2680 8C)

• for both kernels: systematic performance engineering leads to 80x faster code

0 2 4 6 8 10 12

with shortcuts(cellwise branching)

with staggered buffer

with T(z) optimization

with SIMD intrinsicssingle cell

basic waLBerlaimplementation

general purposeC code

MLUP/s ( for 𝜙-Kernel only)

interface solid liquid

speedup: 6

speedup: 28

speedup: 59

speedup: 67

speedup: 101

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 14: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Optimization

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Page 15: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

19

Communication Overlap

communication of 𝜇 can be overlapped without kernel adaptations

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Page 16: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

20

Communication Overlap

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Page 17: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Optimization

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Page 18: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Optimization

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Page 19: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Scaling Results

SuperMUC JUQUEEN

Page 20: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

I/O and Postprocessing

Page 21: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

25

Managing I/O

• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm

• one mesh for each phase boundary

• mesh size: < 10 MB

• I/O necessary to store results (frequently) and for checkpointing(seldom)

• for highly parallel simulations the output of results quickly becomes bottleneck

• Example: storing one time step of(2420 x 2420 x 1474) domain as voxel file: 386 GB

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 22: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

26

Managing I/O

• surface meshes still unnecessarily fine resolved: one triangle per interface cell

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 23: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

27

Managing I/O

local fine meshes generated by marching cubes

on coarse mesh on root

• quadric edge reduce algorithm ( cglib )

• crucial: mesh reduction step preserves boundary vertices

• hierarchical mesh coarsening and reduction during simulation

• result: one coarse mesh with size in the order of several MB

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 24: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Python Coupling

waLBerlapython_coupling

module

boost::python

libpython

• extracting relevant data while simulationis running

• direct, efficient array access via Python numpy package – data is shared, not copied

• using boost::python library to connect C++ code with Python

• further applications:• flexible configuration

• model development: Matlab-like functionality available

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 25: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Simplify Workflow

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 26: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Python Coupling

waLBerlapython_coupling

module

boost::python

libpython

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Python interpreter

walberla.so

Method 1: Using Python from C++Host Language: C++

Method 2: Using C++ from PythonHost Language: Python

Page 27: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Python Coupling

waLBerlapython_coupling

module

boost::python

libpython

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Python interpreter

walberla.so

Method 1: Using Python from C++Host Language: C++

Method 2: Using C++ from PythonHost Language: Python

Demo

Page 28: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Summary

Page 29: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Summary / Outlook

• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )

• systematic performance engineering several levels

• speedup by factor of 80 compared to original version

• parallel output data processing during simulation to reduce result file size

• coupling to Python scripting language for in-situ processing

• GPU implementation

• coupling to Lattice Boltzmann Method

• improve discretization scheme (implicit method)

Summary

Outlook

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Page 30: Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework waLBerla ... • In-Situ

Thank you!

Questions?