Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a...

Post on 04-Jun-2018

224 views 0 download

Transcript of Performance Optimization of a Massively Parallel Phase ... · Performance Optimization of a...

Performance Optimization of a Massively Parallel Phase-Field Method

Using the HPC Framework waLBerla

SIAM PP 2016, April 13th 2016

Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

• Motivation

• waLBerla Framework

• Phase-Field Method in waLBerla• Single-Core Optimizations

• Asynchronous Communication

• I/O and Post-processing

• In-Situ Processing with Python

• Summary and Outlook

2

Outline

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

• large domain required to reduce boundary influence

• some physical patterns only occur in highly resolved simulations ( spiral )

• simulate big domains in 3D

• unoptimized, general purpose code phase field code from KIT available

• goal: write optimized parallel version for specific model

3

Motivation

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

The waLBerla Framework

6

waLBerla Framework

• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with

Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on block-structured grids

• www.walberla.net

Vocal Fold Study(Florian Schornbaum)

Fluid Structure Interaction (Simon Bogner)

Free Surface Flow

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

• Written in C++ with Python extensions

• Hybridly parallelized (MPI + OpenMP)

• No data structures growing with number of processes involved

• Scales from laptop to recent petascale machines

• Parallel I/O

• Portable (Compiler/OS)

• Automated tests / CI servers

• Open Source release planned

7

waLBerla Framework

llvm/clang

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

• Domain Decomposition & Distribution to Processes:• regular decomposition into blocks containing uniform grids

• grid refinement: octree-like decomposition

8

Block-structured Grids

In most cases, if a regular decomposition of a uniform

grid is used, exactly one block is assigned to each process.

forest of octrees:each block contains a uniform grid

of the same size→ 2:1 balance between neighboring

cells on level transitions

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers

• support for overlapping communication and computation

• some advanced models require more complex communication patterns ( e.g. free-surface and fluid-structure interaction)

10

Hybrid Parallelization

receiverprocess

senderprocess

(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Phase field in waLBerla

12

Phase field algorithm

• two lattices (fields):• phase field 𝜙 with 4 components

• chemical potential 𝜇 with 2 components

• storing two time steps in “src” and “dst” fields

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Optimizations of Phase Field algorithm

Optimization

• moving window technique

• simplifications due to special setup ( e.g. analytical temperaturegradient )

• shortcuts: some terms can be neglected in certain cell types

Model Layer

• access patterns / stencils

• overlapping computation and communication

• eliminate common subexpressions

Algorithm Layer

• SIMDification

• Memory layout (AoS vs. SoA)

Hardware Layer

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Single Core Optimization Results

• test system: one SuperMUC core (Intel Xeon E5-2680 8C)

• for both kernels: systematic performance engineering leads to 80x faster code

0 2 4 6 8 10 12

with shortcuts(cellwise branching)

with staggered buffer

with T(z) optimization

with SIMD intrinsicssingle cell

basic waLBerlaimplementation

general purposeC code

MLUP/s ( for 𝜙-Kernel only)

interface solid liquid

speedup: 6

speedup: 28

speedup: 59

speedup: 67

speedup: 101

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Optimization

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

19

Communication Overlap

communication of 𝜇 can be overlapped without kernel adaptations

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

20

Communication Overlap

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Optimization

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Optimization

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Algorithm Layer Optimization

Scaling Results

SuperMUC JUQUEEN

I/O and Postprocessing

25

Managing I/O

• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm

• one mesh for each phase boundary

• mesh size: < 10 MB

• I/O necessary to store results (frequently) and for checkpointing(seldom)

• for highly parallel simulations the output of results quickly becomes bottleneck

• Example: storing one time step of(2420 x 2420 x 1474) domain as voxel file: 386 GB

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

26

Managing I/O

• surface meshes still unnecessarily fine resolved: one triangle per interface cell

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

27

Managing I/O

local fine meshes generated by marching cubes

on coarse mesh on root

• quadric edge reduce algorithm ( cglib )

• crucial: mesh reduction step preserves boundary vertices

• hierarchical mesh coarsening and reduction during simulation

• result: one coarse mesh with size in the order of several MB

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Python Coupling

waLBerlapython_coupling

module

boost::python

libpython

• extracting relevant data while simulationis running

• direct, efficient array access via Python numpy package – data is shared, not copied

• using boost::python library to connect C++ code with Python

• further applications:• flexible configuration

• model development: Matlab-like functionality available

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Simplify Workflow

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Python Coupling

waLBerlapython_coupling

module

boost::python

libpython

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Python interpreter

walberla.so

Method 1: Using Python from C++Host Language: C++

Method 2: Using C++ from PythonHost Language: Python

Python Coupling

waLBerlapython_coupling

module

boost::python

libpython

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Python interpreter

walberla.so

Method 1: Using Python from C++Host Language: C++

Method 2: Using C++ from PythonHost Language: Python

Demo

Summary

Summary / Outlook

• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )

• systematic performance engineering several levels

• speedup by factor of 80 compared to original version

• parallel output data processing during simulation to reduce result file size

• coupling to Python scripting language for in-situ processing

• GPU implementation

• coupling to Lattice Boltzmann Method

• improve discretization scheme (implicit method)

Summary

Outlook

Optimization of a Phase-Field Method using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – April 13, 2016

Thank you!

Questions?