Advanced OpenMP Constructs, Tuning, and Tools Poster Print ... · Poster Print Size: This poster...

Post on 14-Sep-2020

5 views 0 download

Transcript of Advanced OpenMP Constructs, Tuning, and Tools Poster Print ... · Poster Print Size: This poster...

PosterPrintSize:Thispostertemplateis48”highby36”wide.Itcanbeusedtoprintanyposterwitha4:3aspectraAo.

Placeholders:ThevariouselementsincludedinthisposterareonesweoCenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.

ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.

[Thissidebarareadoesnotprint.]

ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataCertryingsomeofthealternaAves.

PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoC®designthePowerPointsoCware.USandCanada:1-800-790-4001Email:info@genigraphics.com

[Thissidebarareadoesnotprint.]

AdvancedOpenMPConstructs,Tuning,andToolsatNERSC

AhanaRoyChoudhury1,Yun(Helen)He2,andAliceKoniges21Computer&InformaAonSciencesDepartment,UniversityofAlabamaatBirmingham

2NERSC,LawrenceBerkeleyNaAonalLaboratory,Berkeley,CA

1.  hnps://compuAng.llnl.gov/tutorials/openMP/2.  hnp://www.eweek.com/c/a/ApplicaAon-Development/Oracle-and-Java-7-The-Top-10-Developer-Features-6261453.   UsingOpenMPbyBarbaraChapman,GabrieleJost,RuudvanderPas,TheMITPress,MIT,2008.4.  hnps://soCware.intel.com/en-us/get-started-with-advisor-threading-linux5.  hnps://soCware.intel.com/en-us/intel-inspector-xe6.  hnps://www.olcf.ornl.gov/wp-content/uploads/2013/02/Cray_Reveal-HP1.pdf

7.  hnp://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/improving-openmp-scaling/

8.  hnps://soCware.intel.com/en-us/arAcles/avoiding-and-idenAfying-false-sharing-among-threads9.  hnps://soCware.intel.com/en-us/arAcles/finding-your-memory-access-performance-bonlenecks10. hnps://www.nersc.gov/assets/Uploads/Nested-OpenMP-NUG-20151008.pdf11. hnp://www.nersc.gov/users/training/events/advanced-openmp-training-february-4-2016/12. hnps://drive.google.com/a/lbl.gov/file/d/0B9D5EnxRqcaZalR5WEh6bkhxNGs/view

References

•  Whichpartofthecodecanbesafelyparallelized?•  Howtodetectandavoiddataraces?•  Howtodetectandavoidmemoryleaks?•  HowcantoolsbeusedtogetsuggesAonsonvariablescopeandOpenMPcompilerdirecAves?•  HowtodetecttopAme-consumingloops?•  Howtodetectandremovefalsesharing?•  Ensuringdesirableprocessandthreadaffinity•  UsingNestedOpenMP

QuesNonsandChallenges

•  Falsesharingoccurswhenthreadsondifferentprocessorsanempttomodifyvariablesthatresideonthesamecacheline.

•  ItcausesperformancedegradaAonduetocoherenceissuesandshouldbeavoided.•  IntelVTuneAmplifier isaperformanceanalysis tool thathelps toanalyze thealgorithmand idenAfy

whereandhowapplicaAonscanbenefitfromavailablehardwareresources.•  TodetectfalsesharingusingVTuneAmplifier,wewroteacodethatcausesfalsesharing,detectedthe

problemand then resolved itusingpadding.Wealsonoted that somecompileropAmizaAonchoicesremovefalsesharing.

BeforeFalseSharingRemovalARerfalseSharingRemoval

Abstract

•  Thread affinity binds each process or thread to run on a specific subset of processors, to take advantage of memory locality.

•  Improper process/thread affinity could slow down code performance significantly.

UsingtheOMP_PROC_BINDenvironmentvariable

ProcessandThreadAffinity

In order to achieve good performance in OpenMP codes, it is important to use advanced OpenMPconceptslikeprocessandthreadaffinityandnestedOpenMP.Itisequallycrucialtoavoidfalsesharingandover-subscripAon.Wehaveexploredhow tools canbeused to facilitate theprocessof tuningOpenMPcodesaswellasworkedonthreadandprocessaffinityandnestedOpenMP.Futurework involvesusingnestedOpenMPtospeedupfullapplicaAons.

Conclusions

NERSC’snextgeneraAonsupercomputersystems,e.g.,CoriPhase2withKNL (IntelKnightsLanding)architecture,havealargenumberofcorespernode,so,itisimportanttoconsideradvancedOpenMPconceptsinordertoachieveopAmalperformanceonsuchsystems.In thisproject,wehavewrinenand implementedcode snippetsandused themto testa varietyoftools for improving OpenMP performance and detailed their usage on various NERSC systemsincludingKNL.Wehaveexplored thedetecAonof issues such as false sharing anddata racesusingtools and how they can be resolved. We have also explored advanced OpenMP concepts such asprocessandthreadaffinityandnestedOpenMP.

ImprovingPerformanceofSampleCodesusingTools

IdenAfyTopTimeConsumingLoopsusingSurveyReport

InsertAnnotaAonsforParallelRegionsandRecompileCode

BasedontheSuitabilityReport,ModifytheCodebyParallelizingtheLoop

UseSuitabilityReporttoPredicttheSpeed-upoftheApplicaAonbasedonAnnotaAons

CheckDependenciestoRemoveDataSharingProblems

ForkandJoinModelWhentouseNestedOpenMP Thread Affinity for Nested OpenMP

NestedOpenMP

OMP_NUM_THREADS=4,3OMP_PROC_BIND=spread,close OMP_PROC_BIND=spread,spread

Level 2 threadsin same teambindtothreadsonsamecore

Level 2 threadsin same teamevenly spreadon threads ondifferentcores

Scopingtheloop

ResolveunresolvedVariables

0

20

40

60

80

100

120

140

160

2 4 8 16 32 64

Time(s)

#ofthreads

AdvisorEsNmateandActualWallClockTime

EsAmatedAme

ActualTime

•  IntelAdvisorhelpstoensurethatFortran,CandC++applicaAonstakefullperformanceadvantageoftoday'sprocessors.

•  ThreadingAdvisorisathreadingdesignandprototypingtoolthathelpstoanalyze,design,tune,andcheckthreadingdesignopAonswithoutdisrupAngnormaldevelopment.

ThreadingDesignusingIntelAdvisor

Comparison between Advisor esAmated and measuredwall clock Ames. The % variaAon range is 3-15% andincreaseswithincreasingnumbersofthreads.

ParallelizingCodesusingCrayReveal

ThreadingandMemoryErrorDetecNonusingIntelInspector

•  RevealisatooldevelopedbyCraythatispartoftheCrayPerCoolssoCwarepackage.•  HelpstoidenAfytopAme-consumingloops,dependenciesandvectorizaAon.•  Loop scope analysis provides variable scope and compiler direcAve suggesAons for inserAngOpenMP.

•  Intel Inspector is adynamicmemoryand threadingerror checking tool forusersdeveloping serialandmulAthreadedapplicaAons.

ExampleforDetecNngDataRaceExampleforDetecNngMemoryProblems

ThreadingDesignusingIntelAdvisor

ImprovingPerformanceofSampleCodesusingTools(conNnued)

Thislinecausesfalsesharing

Toachievemorefine-grainedthreadparallelism:•  When the top level OpenMP loop does not use all

availablethreads•  WhenmulAplelevelsofOpenMPloopsarenoteasily

collapsed•  ForcertaincomputaAonintensivekernels•  FormulA-threadedMKL(IntelMathKernelLibrary)