HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456...

18
Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich www.scs.ch HP2C Final Meeting Stencil Library and Dycore Rewrite Tobias Gysi

Transcript of HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456...

Page 1: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich www.scs.ch

HP2C Final Meeting Stencil Library and Dycore Rewrite Tobias Gysi

Page 2: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 2

Outline

•  Recent developments •  Software managed caching

•  Benchmarks & performance

•  Next steps

•  Surprise

•  Goodbye

Page 3: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 3

Recent Developments – Documentation

•  Stencil library documentation •  Describes shared infrastructure and stencil library

•  Detailed reference documentation for stencil library developers

•  Assumes solid C++ and Boost MPL knowledge

•  Documentation phase was used for minor code cleanups

•  https://hpcforge.org

Page 4: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 4

Recent Developments – Dycore

•  Additional HP2C dycore features: •  Relaxation (Carlos Osuna)

•  Saturation Adjustments (Carlos Osuna)

•  Strang splitting support for AdvectionPD

•  Other developments are work in progress, e.g. other advection schemes

Page 5: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 5

Recent Developments – Stencil Library

•  Functionality improvements •  Refactoring temporary fields (Ben Cumming)

•  Dynamic indexing (Ben Cumming)

•  Switch case DSEL extension which supports runtime switches

•  Performance improvements:

•  Re-implementation of software managed caching

•  Do not allocate temporary fields which are not accessed due to caches

Page 6: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 6

Software Managed Caching

•  Software controlled cache management •  Data is placed explicitly in the caches

•  Data is removed from the caches if it is not needed anymore

•  Application knowledge helps to used small caches efficiently

•  Do not cache fields accessed once

•  Do not write back modified cache lines not accessed anymore

•  Immediately remove data from cache if it is not used anymore

•  We have implemented software managed caching for the GPU backend

Page 7: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 7

Software Managed Caching – KCache

•  Sliding window in K direction is stored in GPU registers •  Used to buffer multiple K levels no accesses in IJ

•  Optionally filled and / or flushed from the underlying field

•  E.g. buffer all intermediate fields of z advection

Important for efficient communication between

K loop levels

K /* stage A */ temp(i,j,k+1) = 2.0; /* stage B */ result(i,j,k) = temp(i,j,k+1) + temp(i,j,k) + temp(i,j,k-1);

Stage A

Stage B

Page 8: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 8

Software Managed Caching – KCache Example

Keep all AdvectionPDZ intermediates in registers: define_loops(

define_sweep<cKIncrement>(

define_caches( KCache<data_in, cFill, KWindow<-2,1>, KRange<FullDomain,0,0> >(), KCache<sqrtgrs, cFill, KWindow<-2,2>, KRange<FullDomain,0,0> >(), KCache<icr, cLocal, KWindow<-1,2>, KRange<FullDomain,-2,2> >(), KCache<wcfrac, cLocal, KWindow<-1,2>, KRange<FullDomain,0,1> >(),

KCache<fip, cLocal, KWindow<-1,2>, KRange<FullDomain,0,1> >(), KCache<fim, cLocal, KWindow<-1,2>, KRange<FullDomain,0,1> >(), KCache<fcr, cLocal, KWindow<-1,2>, KRange<FullDomain,0,1> >(), KCache<flux, cLocal, KWindow<-2,1>, KRange<FullDomain,0,1> >() ), define_stages(

StencilStage<ICRStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,-2,0> >(),

StencilStage<FIPAndFIMStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,-1,0> >(),

StencilStage<FCRStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,1,1> >(),

StencilStage<FluxStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,2,2> >(),

StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,2,2> >()

)

)

)

Page 9: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 9

Software Managed Caching – IJCache

•  IJ plane is stored in shared memory •  Used to buffer multiple IJ positions no accesses in K

•  Simple local cache not filled and / or flushed from the underlying field

•  E.g. buffer intermediate values of horizontal diffusion

K if(/* in range of stage A */) { temp(i,j,k) = 2.0; } __syncthreads(); if(/* in range of stage B */) { result(i,j,k) = temp(i,j,k) + temp(i+1,j,k); }

Important for efficient communication between

threads

Stage A

Stage B

Page 10: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 10

Software Managed Caching – IJCache Example

Keep all HorizontalDiffusion intermediates in shared memory: define_loops(

define_sweep<cKIncrement>(

define_caches( IJCache<lap, KRange<FullDomain,0,0> >(), IJCache<flx, KRange<FullDomain,0,0> >(), IJCache<fly, KRange<FullDomain,0,0> >(), IJCache<rxp, KRange<FullDomain,0,0> >(),

IJCache<rxm, KRange<FullDomain,0,0> >() ), define_stages(

StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(),

StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(),

StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(),

StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(),

StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >()

)

)

)

Page 11: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 11

Benchmarks

•  CPU performance was measured on a single socket of Piz Daint •  Sandy Bridge E5-2670 @ 2.60GHz

•  With hyperthreading

•  GPU performance was measured on my Windows PC

•  Tesla K20c (roughly 10-20% slower than a K20x)

•  ECC on

•  CUDA 5.5

•  Single node measurements on a 128x128 data set

Page 12: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 12

Benchmark – Caching vs. no Caching

1.2

1.2

1.3

1.4

1.7

1.3

1.7

1.5

1.6

1.6

1.5

1.6

3.1

0.9

1.8

1.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

HorizontalDiffusionU

HorizontalDiffusionV

AdvectionPDPrepareStepWConRho

HorizontalDiffusionWPP

VerticalDiffusionT

VerticalDiffusionUVW

VerticalDiffusionTracers2

VerticalAdvectionPPTP

AdvectionPDX

VerticalAdvectionUVW

AdvectionPDY

HorizontalDiffusionTracers

AdvectionPDZ

FastWavesUV

FastWavesWPPTP

Total

Caching Speedup

Caching improves the GPU performance by a factor 1.5x

Page 13: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 13

Benchmark – CPU vs. GPU

2.1 2.8

1.8 5.1

4.7 5.5

2.0 7.2

5.9 1.5

4.2 2.6

4.1 2.6

4.4 2.6

3.1

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

VerticalDiffusionTracers2 VerticalDiffusionUVW

VerticalAdvectionPPTP VerticalDiffusionT

HorizontalAdvectionPPTP HorizontalAdvectionUV VerticalAdvectionUVW SaturationAdjustment

HorizontalAdvectionWWCFastWavesDivergence

AdvectionPDX HorizontalDiffusionTracers

AdvectionPDY FastWavesUV AdvectionPDZ

FastWavesWPPTP Total Dycore

Speedup CPU vs. GPU

A tesla K20c shows a 3.1x speedup over a Sandy Bridge socket

Page 14: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 14

Next Steps

Ideally there are two development branches 1.  Continuous development

2.  Next generation

Continuous development:

•  Maintenance of current library

•  Usability improvements

•  Performance improvements

Next Generation:

•  Incremental development based on current library

•  Extend the scope of the library so that it works for other applications

Page 15: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 15

Next Steps – Continuous Development

Usability improvements: •  Remove FlatCoordinates and TerrainCoordinates?

•  Specify for every stage parameter if it is an input or an output

•  More concise caching syntax

•  More concise temporary buffer syntax

•  Allow writes to center position only

Performance improvements:

•  IJK caches

•  Additional CUDA tuning options (read-only caching / launch configuration)

2 known inconsistencies: •  Boundaries (Apply vs. IJRange) •  Temporary field scope (sweep vs. stencil)

Page 16: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 16

Next Steps – Next Generation

Support other applications: •  E.g. ICON

Extensions for regular grids:

•  Support boundary conditions in all 3 dimensions

•  Support different stencils at the domain boundaries in all 3 dimensions

•  Support flexible parallelization e.g. full 3D, sequential K and parallel IJ etc…

Extensions for grids on a sphere:

•  Support for block structured grids

•  Support indirect addressing and neighbor iteration

Page 17: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 17

Surprise

•  SteLLa means “Stencil Loop Language”

Page 18: HP2C Final Meeting Stencil Library and Dycore Rewrite · Supercomputing Systems AG Phone +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich HP2C Final Meeting Stencil

Zürich 16.07.13 © by Supercomputing Systems AG 18

Goodbye