Advanced OpenMP Constructs, Tuning, and Tools Poster Print ... · Poster Print Size: This poster...

1
Advanced OpenMP Constructs, Tuning, and Tools at NERSC Ahana Roy Choudhury 1 , Yun (Helen) He 2 , and Alice Koniges 2 1 Computer & InformaAon Sciences Department, University of Alabama at Birmingham 2 NERSC, Lawrence Berkeley NaAonal Laboratory, Berkeley, CA 1. hnps://compuAng.llnl.gov/tutorials/openMP/ 2. hnp://www.eweek.com/c/a/ApplicaAon-Development/Oracle-and-Java-7-The-Top-10-Developer-Features-626145 3. Using OpenMP by Barbara Chapman, Gabriele Jost, Ruud van der Pas, The MIT Press, MIT, 2008. 4. hnps://soCware.intel.com/en-us/get-started-with-advisor-threading-linux 5. hnps://soCware.intel.com/en-us/intel-inspector-xe 6. hnps://www.olcf.ornl.gov/wp-content/uploads/2013/02/Cray_Reveal-HP1.pdf 7. hnp://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/improving-openmp- scaling/ 8. hnps://soCware.intel.com/en-us/arAcles/avoiding-and-idenAfying-false-sharing-among-threads 9. hnps://soCware.intel.com/en-us/arAcles/finding-your-memory-access-performance-bonlenecks 10. hnps://www.nersc.gov/assets/Uploads/Nested-OpenMP-NUG-20151008.pdf 11. hnp://www.nersc.gov/users/training/events/advanced-openmp-training-february-4-2016/ 12. hnps://drive.google.com/a/lbl.gov/file/d/0B9D5EnxRqcaZalR5WEh6bkhxNGs/view References Which part of the code can be safely parallelized? How to detect and avoid data races? How to detect and avoid memory leaks? How can tools be used to get suggesAons on variable scope and OpenMP compiler direcAves? How to detect top Ame-consuming loops? How to detect and remove false sharing? Ensuring desirable process and thread affinity Using Nested OpenMP QuesNons and Challenges False sharing occurs when threads on different processors anempt to modify variables that reside on the same cache line. It causes performance degradaAon due to coherence issues and should be avoided. Intel VTune Amplifier is a performance analysis tool that helps to analyze the algorithm and idenAfy where and how applicaAons can benefit from available hardware resources. To detect false sharing using VTune Amplifier, we wrote a code that causes false sharing, detected the problem and then resolved it using padding. We also noted that some compiler opAmizaAon choices remove false sharing. Before False Sharing Removal ARer false Sharing Removal Abstract Thread affinity binds each process or thread to run on a specific subset of processors, to take advantage of memory locality. Improper process/thread affinity could slow down code performance significantly. Using the OMP_PROC_BIND environment variable Process and Thread Affinity In order to achieve good performance in OpenMP codes, it is important to use advanced OpenMP concepts like process and thread affinity and nested OpenMP. It is equally crucial to avoid false sharing and over-subscripAon. We have explored how tools can be used to facilitate the process of tuning OpenMP codes as well as worked on thread and process affinity and nested OpenMP. Future work involves using nested OpenMP to speed up full applicaAons. Conclusions NERSC’s next generaAon supercomputer systems, e.g., Cori Phase 2 with KNL (Intel Knights Landing) architecture, have a large number of cores per node, so, it is important to consider advanced OpenMP concepts in order to achieve opAmal performance on such systems. In this project, we have wrinen and implemented code snippets and used them to test a variety of tools for improving OpenMP performance and detailed their usage on various NERSC systems including KNL. We have explored the detecAon of issues such as false sharing and data races using tools and how they can be resolved. We have also explored advanced OpenMP concepts such as process and thread affinity and nested OpenMP. Improving Performance of Sample Codes using Tools IdenAfy Top Time Consuming Loops using Survey Report Insert AnnotaAons for Parallel Regions and Recompile Code Based on the Suitability Report, Modify the Code by Parallelizing the Loop Use Suitability Report to Predict the Speed- up of the ApplicaAon based on AnnotaAons Check Dependencies to Remove Data Sharing Problems Fork and Join Model When to use Nested OpenMP Thread Affinity for Nested OpenMP Nested OpenMP OMP_NUM_THREADS=4,3 OMP_PROC_BIND=spread,close OMP_PROC_BIND=spread,spread Level 2 threads in same team bind to threads on same core Level 2 threads in same team evenly spread on threads on different cores Scoping the loop Resolve unresolved Variables 0 20 40 60 80 100 120 140 160 2 4 8 16 32 64 Time(s) # of threads Advisor EsNmate and Actual Wall Clock Time EsAmated Ame Actual Time Intel Advisor helps to ensure that Fortran, C and C++ applicaAons take full performance advantage of today's processors. Threading Advisor is a threading design and prototyping tool that helps to analyze, design, tune, and check threading design opAons without disrupAng normal development. Threading Design using Intel Advisor Comparison between Advisor esAmated and measured wall clock Ames. The % variaAon range is 3-15% and increases with increasing numbers of threads. Parallelizing Codes using Cray Reveal Threading and Memory Error DetecNon using Intel Inspector Reveal is a tool developed by Cray that is part of the Cray PerCools soCware package. Helps to idenAfy top Ame-consuming loops, dependencies and vectorizaAon. Loop scope analysis provides variable scope and compiler direcAve suggesAons for inserAng OpenMP. Intel Inspector is a dynamic memory and threading error checking tool for users developing serial and mulAthreaded applicaAons. Example for DetecNng Data Race Example for DetecNng Memory Problems Threading Design using Intel Advisor Improving Performance of Sample Codes using Tools (conNnued) This line causes false sharing To achieve more fine-grained thread parallelism: When the top level OpenMP loop does not use all available threads When mulAple levels of OpenMP loops are not easily collapsed For certain computaAon intensive kernels For mulA-threaded MKL (Intel Math Kernel Library)

Transcript of Advanced OpenMP Constructs, Tuning, and Tools Poster Print ... · Poster Print Size: This poster...

Page 1: Advanced OpenMP Constructs, Tuning, and Tools Poster Print ... · Poster Print Size: This poster template is 48” high by 36” wide. It can be used to print any poster with a 4:3

PosterPrintSize:Thispostertemplateis48”highby36”wide.Itcanbeusedtoprintanyposterwitha4:3aspectraAo.

Placeholders:ThevariouselementsincludedinthisposterareonesweoCenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.

ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.

[Thissidebarareadoesnotprint.]

ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataCertryingsomeofthealternaAves.

PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoC®designthePowerPointsoCware.USandCanada:1-800-790-4001Email:[email protected]

[Thissidebarareadoesnotprint.]

AdvancedOpenMPConstructs,Tuning,andToolsatNERSC

AhanaRoyChoudhury1,Yun(Helen)He2,andAliceKoniges21Computer&InformaAonSciencesDepartment,UniversityofAlabamaatBirmingham

2NERSC,LawrenceBerkeleyNaAonalLaboratory,Berkeley,CA

1.  hnps://compuAng.llnl.gov/tutorials/openMP/2.  hnp://www.eweek.com/c/a/ApplicaAon-Development/Oracle-and-Java-7-The-Top-10-Developer-Features-6261453.   UsingOpenMPbyBarbaraChapman,GabrieleJost,RuudvanderPas,TheMITPress,MIT,2008.4.  hnps://soCware.intel.com/en-us/get-started-with-advisor-threading-linux5.  hnps://soCware.intel.com/en-us/intel-inspector-xe6.  hnps://www.olcf.ornl.gov/wp-content/uploads/2013/02/Cray_Reveal-HP1.pdf

7.  hnp://www.nersc.gov/users/computaAonal-systems/cori/applicaAon-porAng-and-performance/improving-openmp-scaling/

8.  hnps://soCware.intel.com/en-us/arAcles/avoiding-and-idenAfying-false-sharing-among-threads9.  hnps://soCware.intel.com/en-us/arAcles/finding-your-memory-access-performance-bonlenecks10. hnps://www.nersc.gov/assets/Uploads/Nested-OpenMP-NUG-20151008.pdf11. hnp://www.nersc.gov/users/training/events/advanced-openmp-training-february-4-2016/12. hnps://drive.google.com/a/lbl.gov/file/d/0B9D5EnxRqcaZalR5WEh6bkhxNGs/view

References

•  Whichpartofthecodecanbesafelyparallelized?•  Howtodetectandavoiddataraces?•  Howtodetectandavoidmemoryleaks?•  HowcantoolsbeusedtogetsuggesAonsonvariablescopeandOpenMPcompilerdirecAves?•  HowtodetecttopAme-consumingloops?•  Howtodetectandremovefalsesharing?•  Ensuringdesirableprocessandthreadaffinity•  UsingNestedOpenMP

QuesNonsandChallenges

•  Falsesharingoccurswhenthreadsondifferentprocessorsanempttomodifyvariablesthatresideonthesamecacheline.

•  ItcausesperformancedegradaAonduetocoherenceissuesandshouldbeavoided.•  IntelVTuneAmplifier isaperformanceanalysis tool thathelps toanalyze thealgorithmand idenAfy

whereandhowapplicaAonscanbenefitfromavailablehardwareresources.•  TodetectfalsesharingusingVTuneAmplifier,wewroteacodethatcausesfalsesharing,detectedthe

problemand then resolved itusingpadding.Wealsonoted that somecompileropAmizaAonchoicesremovefalsesharing.

BeforeFalseSharingRemovalARerfalseSharingRemoval

Abstract

•  Thread affinity binds each process or thread to run on a specific subset of processors, to take advantage of memory locality.

•  Improper process/thread affinity could slow down code performance significantly.

UsingtheOMP_PROC_BINDenvironmentvariable

ProcessandThreadAffinity

In order to achieve good performance in OpenMP codes, it is important to use advanced OpenMPconceptslikeprocessandthreadaffinityandnestedOpenMP.Itisequallycrucialtoavoidfalsesharingandover-subscripAon.Wehaveexploredhow tools canbeused to facilitate theprocessof tuningOpenMPcodesaswellasworkedonthreadandprocessaffinityandnestedOpenMP.Futurework involvesusingnestedOpenMPtospeedupfullapplicaAons.

Conclusions

NERSC’snextgeneraAonsupercomputersystems,e.g.,CoriPhase2withKNL (IntelKnightsLanding)architecture,havealargenumberofcorespernode,so,itisimportanttoconsideradvancedOpenMPconceptsinordertoachieveopAmalperformanceonsuchsystems.In thisproject,wehavewrinenand implementedcode snippetsandused themto testa varietyoftools for improving OpenMP performance and detailed their usage on various NERSC systemsincludingKNL.Wehaveexplored thedetecAonof issues such as false sharing anddata racesusingtools and how they can be resolved. We have also explored advanced OpenMP concepts such asprocessandthreadaffinityandnestedOpenMP.

ImprovingPerformanceofSampleCodesusingTools

IdenAfyTopTimeConsumingLoopsusingSurveyReport

InsertAnnotaAonsforParallelRegionsandRecompileCode

BasedontheSuitabilityReport,ModifytheCodebyParallelizingtheLoop

UseSuitabilityReporttoPredicttheSpeed-upoftheApplicaAonbasedonAnnotaAons

CheckDependenciestoRemoveDataSharingProblems

ForkandJoinModelWhentouseNestedOpenMP Thread Affinity for Nested OpenMP

NestedOpenMP

OMP_NUM_THREADS=4,3OMP_PROC_BIND=spread,close OMP_PROC_BIND=spread,spread

Level 2 threadsin same teambindtothreadsonsamecore

Level 2 threadsin same teamevenly spreadon threads ondifferentcores

Scopingtheloop

ResolveunresolvedVariables

0

20

40

60

80

100

120

140

160

2 4 8 16 32 64

Time(s)

#ofthreads

AdvisorEsNmateandActualWallClockTime

EsAmatedAme

ActualTime

•  IntelAdvisorhelpstoensurethatFortran,CandC++applicaAonstakefullperformanceadvantageoftoday'sprocessors.

•  ThreadingAdvisorisathreadingdesignandprototypingtoolthathelpstoanalyze,design,tune,andcheckthreadingdesignopAonswithoutdisrupAngnormaldevelopment.

ThreadingDesignusingIntelAdvisor

Comparison between Advisor esAmated and measuredwall clock Ames. The % variaAon range is 3-15% andincreaseswithincreasingnumbersofthreads.

ParallelizingCodesusingCrayReveal

ThreadingandMemoryErrorDetecNonusingIntelInspector

•  RevealisatooldevelopedbyCraythatispartoftheCrayPerCoolssoCwarepackage.•  HelpstoidenAfytopAme-consumingloops,dependenciesandvectorizaAon.•  Loop scope analysis provides variable scope and compiler direcAve suggesAons for inserAngOpenMP.

•  Intel Inspector is adynamicmemoryand threadingerror checking tool forusersdeveloping serialandmulAthreadedapplicaAons.

ExampleforDetecNngDataRaceExampleforDetecNngMemoryProblems

ThreadingDesignusingIntelAdvisor

ImprovingPerformanceofSampleCodesusingTools(conNnued)

Thislinecausesfalsesharing

Toachievemorefine-grainedthreadparallelism:•  When the top level OpenMP loop does not use all

availablethreads•  WhenmulAplelevelsofOpenMPloopsarenoteasily

collapsed•  ForcertaincomputaAonintensivekernels•  FormulA-threadedMKL(IntelMathKernelLibrary)