Massively Parallel Phase Field Simulations using HPC ...

32
Massively Parallel Phase Field Simulations using HPC Framework waLBerla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Transcript of Massively Parallel Phase Field Simulations using HPC ...

Page 1: Massively Parallel Phase Field Simulations using HPC ...

Massively Parallel Phase Field Simulations using HPC Framework waLBerla

SIAM CSE 2015, March 15th 2015

Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich Rüde

Chair for System SimulationFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Page 2: Massively Parallel Phase Field Simulations using HPC ...

• Motivation

• waLBerla Framework

• Phase Field Method• Overview

• Optimizations

• Performance Modelling

• Managing I/O

• Summary and Outlook

2

Outline

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Page 3: Massively Parallel Phase Field Simulations using HPC ...

• large domain required to reduce boundary influence

• some physical patterns only occur in highly resolved simulations ( spiral )

• simulate big domains in 3D

• unoptimized, general purpose code phase field code from KIT available

• goal: write optimized parallel version for specific model

3

Motivation

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Page 4: Massively Parallel Phase Field Simulations using HPC ...

The waLBerla Framework

Page 5: Massively Parallel Phase Field Simulations using HPC ...

5

waLBerla Framework

• widely applicable Lattice-Boltzmann from Erlangen • HPC software framework, originally developed for CFD simulations with

Lattice Boltzmann Method (LBM) • evolved into general framework for algorithms on structured grids• coupling with in-house rigid body physics engine pe

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Vocal Fold Study(Florian Schornbaum)

Fluid Structure Interaction (Simon Bogner)

Free Surface Flow

Page 6: Massively Parallel Phase Field Simulations using HPC ...

7

Block Structured Grids

• structured grid• domain is decomposed into blocks• blocks are the container data structure for simulation data (lattice) • blocks are the basic unit of load balancing

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Page 7: Massively Parallel Phase Field Simulations using HPC ...

• Distributed Memory Parallelization: MPI• data exchange on borders between blocks via ghost layers

• support for overlapping communication and computation

• some advanced models ( f.e. FreeSurface) require more complex communication patterns

8

Hybrid Parallelization

receiverprocess

senderprocess

(slightly more complicated for non-uniform domain decompositions, but the same general ideas still apply)

A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014

Page 8: Massively Parallel Phase Field Simulations using HPC ...

Phase field in waLBerla

Page 9: Massively Parallel Phase Field Simulations using HPC ...

10

Phase field algorithm

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• two lattices (fields):• phase field 𝜙 with 4 entries per cell

• chemical potential 𝜇 with 2 entries per cell

• storing two time steps in “src” and “dst” fields

• spatial discretization: finite differences

• temporal discretization: explicit Euler method

Page 10: Massively Parallel Phase Field Simulations using HPC ...

11

Phase field algorithm

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• two lattices (fields):• phase field 𝜙 with 4 components

• chemical potential 𝜇 with 2 components

• storing two time steps in “src” and “dst” fields

Page 11: Massively Parallel Phase Field Simulations using HPC ...

12

Phase field algorithm

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

FLOP per cell 940

Loads / Stores 34

Page 12: Massively Parallel Phase Field Simulations using HPC ...

13

Phase field algorithm

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

FLOPs per cell 2214

Load/Stores: 168

Page 13: Massively Parallel Phase Field Simulations using HPC ...

14

Roofline Performance Model

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

FLOPs 3154

Loads / Stores 202

Loads from RAM 101

FLOP / double 31.2

performance data per cell:

RAM bandwidth/core 6.4 GB/s

FLOP/s per core @2.7GHz 21.6 GFLOP/s

Balance (FLOP/double) 25

Sandy Bridge Architecture:

compute bound

from cache

Page 14: Massively Parallel Phase Field Simulations using HPC ...

Optimizations of Phase Field algorithm

Page 15: Massively Parallel Phase Field Simulations using HPC ...

16

Optimization Roadmap

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• single core optimizations• based on results of performance model

• save floating point operations, pre-compute and store values where possible

• presented on example of 𝝁-Sweep here

• scaling• performance behavior of parallelization

• challenges related to Input/Output

• performance data presented for SuperMUC

Page 16: Massively Parallel Phase Field Simulations using HPC ...

18

Implementation in waLBerla

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• starting point: general, prototyping code

• new model specific implementation in waLBerla

• performance guided design • no indirect or virtual calls

• optimized traversal over grid

Page 17: Massively Parallel Phase Field Simulations using HPC ...

19

Implementation in waLBerla

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Step 1: Replace / Remove expensive operations

• pre-compute common subexpressions

• fast inverse square root approximation• replace division and sqrt operation with bit level operations and add/muls

• reduce number of divisions using table lookup where possible

Page 18: Massively Parallel Phase Field Simulations using HPC ...

20

Gibbs Energy subterm pre-computation

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• many quantities depend on local temperature only

• in this scenario temperature is a function of one coordinate: T = 𝑇(𝑧)

• these quantities can be computed once for each 𝑥, 𝑦 -slice

z

Page 19: Massively Parallel Phase Field Simulations using HPC ...

21

Gibbs Energy subterm pre-computation

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Page 20: Massively Parallel Phase Field Simulations using HPC ...

22

SIMD

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• single instruction multiple data ( SIMD )

• architecture specific instructions• Intel: SSE, AVX, AVX2

• Blue Gene: QPX

• modern compiler do auto-vectorization

• still beneficial to write SIMD instructions explicitly via intrinsics

• problem: separate code for each architecture

• lightweight SIMD abstraction layer in waLBerla to write portable code

𝑎3 𝑎2 𝑎1 𝑎0

𝑏3 𝑏2 𝑏1 𝑏0

ymm0

ymm1

vaddpd+

𝑐3 𝑐2 𝑐1 𝑐0 ymm0

=

Page 21: Massively Parallel Phase Field Simulations using HPC ...

23

SIMD

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

Page 22: Massively Parallel Phase Field Simulations using HPC ...

24Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• to calculate divergence, values at staggered grid positions are required

• these values can be buffered

• more loads and stores, less floating point operations

• same technique can also be applied in 𝜙 sweep

Buffering of staggered values

pre-computed values

Page 23: Massively Parallel Phase Field Simulations using HPC ...

25

Buffering of staggered values

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

80 x faster compared to original version

Page 24: Massively Parallel Phase Field Simulations using HPC ...

26

Intranode Scaling

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

intranode weak scaling on SuperMUC

Page 25: Massively Parallel Phase Field Simulations using HPC ...

28

Single Node Optimization Summary

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

𝜙-Sweep 21 %

μ-Sweep 27 %

Complete Program 25%

Single Node Optimizations

• replace/remove expensive operations like square roots and divisions

• pre-compute and buffer values where possible

• SIMD intrinsics

Percent Peak on SuperMUC

Why not 100% Peak?

• unbalanced number of multiplications and addition

• divisions counted as 1 FLOP but they cost 43 times as much as a multiplication or addition

Page 26: Massively Parallel Phase Field Simulations using HPC ...

29

Scaling

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• scaling on SuperMUC up to 32,768 cores

• ghost layer based communication

• communication hiding

Page 27: Massively Parallel Phase Field Simulations using HPC ...

31

Managing I/O

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• I/O necessary to store results (frequently) and for checkpointing (seldom)

• for highly parallel simulations the output of results quickly becomes bottleneck

• Example: storing one time step of (940 x 940 x 2080) domain: 87 GB

• Solution: generate surface mesh from voxel data during simulation, locally on each process using a marching cubes algorithm

• one mesh for each phase boundary

Page 28: Massively Parallel Phase Field Simulations using HPC ...

32

Managing I/O

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

• surface meshes still unnecessarily fine resolved: one triangle per interface cell

Page 29: Massively Parallel Phase Field Simulations using HPC ...

33

Managing I/O

Massively Parallel Phase-Field Simulations using HPC Framework waLBerlaMartin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – March 15, 2015

local fine meshes generated by marching cubes

on coarse mesh on root

• quadric edge reduce algorithm ( cglib )

• crucial: mesh reduction step preserves boundary vertices

• hierarchical mesh coarsening and reduction during simulation

• result: one coarse mesh with size in the order of several MB

Page 30: Massively Parallel Phase Field Simulations using HPC ...

Summary

Page 31: Massively Parallel Phase Field Simulations using HPC ...

Summary / Outlook

• efficient phase field algorithm necessary to simulate certain physical effects ( spiral )

• systematic performance engineering several levels

• speedup by factor of 80 compared to original version

• reached around 25% peak performance on SuperMUC

• parallel output data processing during simulation to reduce result file size

• GPU implementation

• coupling to Lattice Boltzmann Method

• improve discretization scheme (implicit method)

Summary

Outlook

A Python Extension for the massively parallel framework waLBerla - PyHPC 14Martin Bauer - Chair for System Simulation, FAU Erlangen-Nürnberg – November 17, 2014

Page 32: Massively Parallel Phase Field Simulations using HPC ...

Thank you!

Questions?