Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing...

Post on 07-Aug-2020

0 views 0 download

Transcript of Scaling Post-Meshing Operations on Next Generation Platforms€¦ · Scaling Post-Meshing...

Photos placed in horizontal position with even amount

of white spacebetween photos

and header

Photos placed in horizontal position

with even amount of white space

between photos and header

Sandia National Laboratories is a multi mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

RoshanQuadros,BrianCarnes,MadisonBrewer,andByronHanks

DOECentersofExcellencePerformancePortabilityMeetingAug22-24,2017Denver

ScalingPost-MeshingOperationsonNextGenerationPlatforms

SAND No: SAND2017-8825 PE

Parallel decomposition,refinement, smoothing, & projection

CUBIT

Percept

CUBITandPerceptcombinedworkflowforgeneratinglargemeshes

Initial Mesh Generation: advanced meshing

algorithms for Tri, Quad, Tet, and Hex mesh

generation

CUBITprovidesextensivecapabilitiesforpreparinggeometryandgeneratinganinitialmeshhttps://cubit.sandia.gov

Post-MeshingOperations:Refinement->Projection->Smoothing

Mesh refinement workflow:• Generate refined meshes in

memory from an existing mesh

• Project new boundary nodes onto geometry

• Smooth interior mesh nodes to improve mesh quality

Supported mesh types:• block-structured• unstructured• hybrid

Post-MeshingOperator#1:Projection

GeometryKernel:OpenNURBS

§ OpenSourcefromwww.rhino3d.com

§ Lightweight&easytoPort

§ Queryoperationsarethreadsafe

§ Supportsvariouscurves&surfacedefinitions

ParallelKernel:ProjectPointsonaNURBSSurface

ProgrammingModel:MPI+Kokkos (OpenMP)Threelevelsofparallelismisrequired:

1) DistributedmemoryparallelismviaMPI

2) SharedmemorythreadlevelparallelismontheMICdeviceusingKokkoswithOpenMP runtime

3) VectorizationforVectorProcessingUnit(VPU)

Hardware: Trinity testbed containing 72 core KNL

Image courtesy of http://www.hotchips.org

ProgrammingModel:MPI+Kokkos (OpenMP){ // MPI distributes data to n processes

ON_Surface *surface;ON_3dPoint *buff_p;MPI_Comm_size( MPI_COMM_WORLD, &numtasks);…if ( rank == 0 ){

for( int r=1; r < numtasks; r++){…ierr = MPI_Send ( p_start, num_pnts*3, MPI_DOUBLE, r, Tag, MPI_COMM_WORLD);

}} else{

ierr = MPI_Recv ( buff_p, num_pnts*3, MPI_DOUBLE, MPI_ANY_SOURCE, Tag, MPI_COMM_WORLD, &status );

}projection_method( surface, buff_p, num_pnts );

}void projection_method( const ON_Surface *surface, ON_3dPoint *buff_p, const int num_pnts ){

…// Kokkos handles thread level parallelismKokkos::parallel_for ( num_pnts, KOKKOS_LAMBDA(const int i ){

...// OpenNURBS API for projecting a point on a surface double u, v;surface->GetClosestPoint( buff_p[i], &u, &v );ON_3dPoint projected_pt = surface->PointAt( u, v );

});}

PointProjectionScalingResultsonKNL

933.071108.85

10

20

40

80

160

320

640

1 10 100 1000

Time(sec)

Threadsperrank

mpi_1 mpi_2 mpi_4 mpi_8 mpi_16 mpi_32 mpi_64

Speedup (MPI 32 ranks) = 28X Speedup (MPI + Kokkos) = 77X

ThreadAffinityonKNL

1

10

100

1000

1 10 100 1000Threads per KNL

Tim

e (s

ec)

Blue - ScatterRed - Compact

Post-meshingOperator#2:RefinementStructured grid refinement was a relatively simple process§ Removed pointers and references to main memory objects§ Replaced structured grid data structures with Kokkos views§ Algorithm :

§ Allocate new mesh (view)§ For each block (in serial)

§ For each element in old mesh (in parallel)§ Interpolate coordinates for new mesh§ Transfer existing node coordinates

ScalabilityofRefinement

Sequence of meshes• 0.5M, 4M, 33M elements• multiple blocks (12)

Compare• serial• GPU• threading (OpenMP)

Better scalability with increasing problem size

Blocks were refined sequentially

Post-MeshingOperator#3:SmoothingSmoothingstructuredgridswasmorecomplexprocess§ computeglobalqualitymetricandgradient§ nonlinearCGoptimizationwithlinesearch§ communicationbetweenstructuredblocks

(gradients)Example:smoothingalargecubewithinitialrandomlyperturbednodes

Untangling: invalid elements Smoothing: metric/gradientUntangling: metric/gradient

§ Abstractmeshinterfaceforbothstructuredandunstructured§ highcostofkernelcalls§ sub-optimalinterfacetostructuredgrid§ functionsnotsafeforthreadsorGPU

§ Suggestionforabstractinterfaces:§ designedwithsharedmemory(Kokkos)fromstart§ otherwise,optforspecializedinterfaces.

SmoothingPitfall:AbstractInterfaces

StructuredMesh

UnstructuredMesh

Smoother<MeshBase>• dependsonmany

virtualfunctionsinMeshType class

MeshBase

§ Totalmetric(longdouble)wassumofindividualmetrics(double)§ ProblematicforGPUbuildsasCUDAonlyusesupto64bits(double

precision)forfloatingpointrepresentation.§ Causesillegalmemoryaccess:

§ cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered

§ CertainSTLclassesandfunctionsareproblematiconGPUs§ array,unorderedmap,vector,set,…

§ array=>Kokkos::Array§ unorderedmap =>Kokkos::UnorderedMap

§ std::maxwasrewrittenlocally

SmoothingPitfall:HardwareDifferences

§ MemorylayoutinitiallycausedverypoorperformanceontheGPU§ Smoothingtestcase:perturbedcubefollowedbymeshsmoothing

SmoothingPitfall:MemoryLayout

FutureWork§ Projection:

• StudyhighwatermarkofmemoryusagefordifferentcombinationofMPIranksandThreadsperrank

§ UnstructuredRefinement:• Morecomplicatedalgorithmthanstructured• Example:determinenumberofnewnodes

• useKokkosmaptostoreneedednodes• valuesstoredforeverymeshedge/face• mapalsousedtointerpolatenewcoordinates

§ Smoothing:• Investigateothersmoothingalgorithms(ellipticsmoother)

ThankYou