Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing...

18
Sandia National Laboratories is a multi mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. Roshan Quadros, Brian Carnes, Madison Brewer, and Byron Hanks DOE Centers of Excellence Performance Portability Meeting Aug 22-24, 2017 Denver Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE

Transcript of Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing...

Page 1: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

Photos placed in horizontal position with even amount

of white spacebetween photos

and header

Photos placed in horizontal position

with even amount of white space

between photos and header

Sandia National Laboratories is a multi mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

RoshanQuadros,BrianCarnes,MadisonBrewer,andByronHanks

DOECentersofExcellencePerformancePortabilityMeetingAug22-24,2017Denver

ScalingPost-MeshingOperationsonNextGenerationPlatforms

SAND No: SAND2017-8825 PE

Page 2: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

Parallel decomposition,refinement, smoothing, & projection

CUBIT

Percept

CUBITandPerceptcombinedworkflowforgeneratinglargemeshes

Initial Mesh Generation: advanced meshing

algorithms for Tri, Quad, Tet, and Hex mesh

generation

CUBITprovidesextensivecapabilitiesforpreparinggeometryandgeneratinganinitialmeshhttps://cubit.sandia.gov

Page 3: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

Post-MeshingOperations:Refinement->Projection->Smoothing

Mesh refinement workflow:• Generate refined meshes in

memory from an existing mesh

• Project new boundary nodes onto geometry

• Smooth interior mesh nodes to improve mesh quality

Supported mesh types:• block-structured• unstructured• hybrid

Page 4: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

Post-MeshingOperator#1:Projection

Page 5: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

GeometryKernel:OpenNURBS

§ OpenSourcefromwww.rhino3d.com

§ Lightweight&easytoPort

§ Queryoperationsarethreadsafe

§ Supportsvariouscurves&surfacedefinitions

Page 6: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

ParallelKernel:ProjectPointsonaNURBSSurface

Page 7: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

ProgrammingModel:MPI+Kokkos (OpenMP)Threelevelsofparallelismisrequired:

1) DistributedmemoryparallelismviaMPI

2) SharedmemorythreadlevelparallelismontheMICdeviceusingKokkoswithOpenMP runtime

3) VectorizationforVectorProcessingUnit(VPU)

Hardware: Trinity testbed containing 72 core KNL

Image courtesy of http://www.hotchips.org

Page 8: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

ProgrammingModel:MPI+Kokkos (OpenMP){ // MPI distributes data to n processes

ON_Surface *surface;ON_3dPoint *buff_p;MPI_Comm_size( MPI_COMM_WORLD, &numtasks);…if ( rank == 0 ){

for( int r=1; r < numtasks; r++){…ierr = MPI_Send ( p_start, num_pnts*3, MPI_DOUBLE, r, Tag, MPI_COMM_WORLD);

}} else{

ierr = MPI_Recv ( buff_p, num_pnts*3, MPI_DOUBLE, MPI_ANY_SOURCE, Tag, MPI_COMM_WORLD, &status );

}projection_method( surface, buff_p, num_pnts );

}void projection_method( const ON_Surface *surface, ON_3dPoint *buff_p, const int num_pnts ){

…// Kokkos handles thread level parallelismKokkos::parallel_for ( num_pnts, KOKKOS_LAMBDA(const int i ){

...// OpenNURBS API for projecting a point on a surface double u, v;surface->GetClosestPoint( buff_p[i], &u, &v );ON_3dPoint projected_pt = surface->PointAt( u, v );

});}

Page 9: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

PointProjectionScalingResultsonKNL

933.071108.85

10

20

40

80

160

320

640

1 10 100 1000

Time(sec)

Threadsperrank

mpi_1 mpi_2 mpi_4 mpi_8 mpi_16 mpi_32 mpi_64

Speedup (MPI 32 ranks) = 28X Speedup (MPI + Kokkos) = 77X

Page 10: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

ThreadAffinityonKNL

1

10

100

1000

1 10 100 1000Threads per KNL

Tim

e (s

ec)

Blue - ScatterRed - Compact

Page 11: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

Post-meshingOperator#2:RefinementStructured grid refinement was a relatively simple process§ Removed pointers and references to main memory objects§ Replaced structured grid data structures with Kokkos views§ Algorithm :

§ Allocate new mesh (view)§ For each block (in serial)

§ For each element in old mesh (in parallel)§ Interpolate coordinates for new mesh§ Transfer existing node coordinates

Page 12: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

ScalabilityofRefinement

Sequence of meshes• 0.5M, 4M, 33M elements• multiple blocks (12)

Compare• serial• GPU• threading (OpenMP)

Better scalability with increasing problem size

Blocks were refined sequentially

Page 13: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

Post-MeshingOperator#3:SmoothingSmoothingstructuredgridswasmorecomplexprocess§ computeglobalqualitymetricandgradient§ nonlinearCGoptimizationwithlinesearch§ communicationbetweenstructuredblocks

(gradients)Example:smoothingalargecubewithinitialrandomlyperturbednodes

Untangling: invalid elements Smoothing: metric/gradientUntangling: metric/gradient

Page 14: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

§ Abstractmeshinterfaceforbothstructuredandunstructured§ highcostofkernelcalls§ sub-optimalinterfacetostructuredgrid§ functionsnotsafeforthreadsorGPU

§ Suggestionforabstractinterfaces:§ designedwithsharedmemory(Kokkos)fromstart§ otherwise,optforspecializedinterfaces.

SmoothingPitfall:AbstractInterfaces

StructuredMesh

UnstructuredMesh

Smoother<MeshBase>• dependsonmany

virtualfunctionsinMeshType class

MeshBase

Page 15: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

§ Totalmetric(longdouble)wassumofindividualmetrics(double)§ ProblematicforGPUbuildsasCUDAonlyusesupto64bits(double

precision)forfloatingpointrepresentation.§ Causesillegalmemoryaccess:

§ cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered

§ CertainSTLclassesandfunctionsareproblematiconGPUs§ array,unorderedmap,vector,set,…

§ array=>Kokkos::Array§ unorderedmap =>Kokkos::UnorderedMap

§ std::maxwasrewrittenlocally

SmoothingPitfall:HardwareDifferences

Page 16: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

§ MemorylayoutinitiallycausedverypoorperformanceontheGPU§ Smoothingtestcase:perturbedcubefollowedbymeshsmoothing

SmoothingPitfall:MemoryLayout

Page 17: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

FutureWork§ Projection:

• StudyhighwatermarkofmemoryusagefordifferentcombinationofMPIranksandThreadsperrank

§ UnstructuredRefinement:• Morecomplicatedalgorithmthanstructured• Example:determinenumberofnewnodes

• useKokkosmaptostoreneedednodes• valuesstoredforeverymeshedge/face• mapalsousedtointerpolatenewcoordinates

§ Smoothing:• Investigateothersmoothingalgorithms(ellipticsmoother)

Page 18: Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing Operations on Next Generation Platforms SAND No: SAND2017-8825 PE. Parallel decomposition,

ThankYou