Lecture 16: Analytical Modeling of Parallel Programs ... · Topic Overview •Introduction...
Transcript of Lecture 16: Analytical Modeling of Parallel Programs ... · Topic Overview •Introduction...
Lecture16:AnalyticalModelingofParallelPrograms:MetricsandAnalysis
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
Topics
• Introduction• Programmingonsharedmemorysystem(Chapter7)– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems• ExecutionTime,Overhead,Speedup,Efficiency,Cost
– ScalabilityofParallelSystems– Useofperformancetools
2
TopicOverview
• Introduction• PerformanceMetricsforParallelSystems– ExecutionTime,Overhead,Speedup,Efficiency,Cost• Amdahl’sLaw• ScalabilityofParallelSystems– Isoefficiency MetricofScalability• MinimumExecutionTimeandMinimumCost-OptimalExecutionTime
• AsymptoticAnalysisofParallelPrograms• OtherScalabilityMetrics– Scaledspeedup,Serialfraction
3
AnalyticalModeling:SequentialExecutionTime
• Theexecutiontimeofasequentialalgorithm– Asymptoticexecutiontimeasafunctionofinputsize• identicalonanyserialplatform
4
Countthenumberofoperations
• Big-ONotation– O(1)– O(N)– O(N2)– O(NlogN)– O(N3)– …
ParallelExecutionTime
• Parallelexecutiontimeisafunctionof:– inputsize– numberofprocessors(machineperformance)– communicationparametersoftargetplatform(network)
• Implications– mustanalyzeparallelprogramforaparticulartargetplatform• communicationcharacteristicscandifferbymorethanO(1)
– parallelprogram=parallelalgorithm+platform
5
OverheadinParallelPrograms
Ifusingtwoprocessors,shouldn’taprogramruntwiceasfast?– Notallpartsoftheprogramareparallelized– Certainamountofoverheadsincurredwhendoingitinparallel
6
OverheadsinParallelPrograms
• Interprocess interactions:– Communication• Datamovement
– Synchronization/contention
• Idling:– Loadimbalance– Synchronization• Syncitselfhasoverhead
– Serialcomponents
• Excesscomputation– computationnotperformedbytheserialversion• E.g.replicatedcomputationtominimizecommunication.
7
TopicOverview
• Introduction• FivePerformanceMetricsforParallelSystems– ExecutionTime,Overhead,Speedup,Efficiency,Cost• Amdahl’sLaw• ScalabilityofParallelSystems– Isoefficiency MetricofScalability• MinimumExecutionTimeandMinimumCost-OptimalExecutionTime
• AsymptoticAnalysisofParallelPrograms• OtherScalabilityMetrics– Scaledspeedup,Serialfraction
8
PerformanceMetrics#1:ExecutionTime
Doesaparallelprogramrunfasterthanitssequentialversion?• Serialtime:TS– timeelapsedbetweenthestartandendofserialexecution• Paralleltime:Tp– timeelapsedbetweenfirstprocessstartandlastprocessend
9
PerformanceMetrics#2:ParallelOverhead
Whatarethecosttoenableparallelism?
• Tall: thetotaltimecollectivelyspentbyalltheprocessors– Tall =pTP (p isthenumberofprocessors).
• TS :serialexecutiontime
• TotalparalleloverheadTo– To =Tall - TS– To =pTP – TS
overhead
10
PerformanceMetrics#3:Speedup
Whatisthebenefitfromincreasingparallelism?• Speedup(S):TS /TP– Theratioofthetimetakentosolveaproblemonasingle
processortothetimerequiredtosolvethesameproblemonaparallelcomputerwithpidenticalprocessingelements.
11
PerformanceMetrics:Example
Addingn numbers• Sequential:Θ (n)• Usingn processingelements.– Ifn isapoweroftwo,inlog n stepsbypropagatingpartial
sumsupalogicalbinarytreeofprocessors.
12
PerformanceMetrics:Example– cont’d• Σji denotesthesumof
numberswithconsecutivelabelsfromi toj
• Analysis:– Anadditiontakestc– Communicationtakests +tw– tc and(ts +tw)areconstant
• Sequentialandparalleltime:– TS=Θ (n)– TP =Θ (log n)
• SpeedupS:– S =Θ (n/ log n)
13
PerformanceMetrics#3:Speedup
• Theyardstick:Ts– Manyserialalgorithmsavailable,eachwithdifferent
asymptoticexecutiontime– Theparallelizationofthosealgorithmsvariestoo
http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations
14
SpeedupExample:Sorting
15http://en.wikipedia.org/wiki/Sorting_algorithm
• Theserialexecutiontimeforbubblesort:150seconds.• Odd-evenparallelbubblesort:is40seconds.• Thespeedup:150/40=3.75.– Butisthisreallyafairassessmentofthesystem?
• Whatifserialquicksortonlytook30seconds?• Inthiscase,thespeedupis30/40=0.75– Amorerealisticassessment
• Inreality,considerthebestsequentialprogramasbaseline– Noteventheparallelprogramrunningwith1PE• Wedothisinourassignment
SpeedupExample:Sorting– cont’d
16
PerformanceMetrics:SpeedupBounds• Speedup,intheory,shouldbeupperboundedbyp– Wecanonlyexpectap-foldspeedupifweusep timesas
manyresources.
• Theoretically,aspeedupgreaterthanp ispossibleonlyifeachprocessorspendslessthanTS/ p timesolvingtheproblem.– Violatetherulesofusing
thebestsequentialasbaseline
• Speedups:– Linear– Sublinear– Superlinear
• Inpractice,superlinear ispossible17
S
PerformanceMetrics:Superlinear Speedups
Parallelalgorithmdoeslessworkthanitsserialversions• Searchingnode‘S’inanunstructuredtree• ParallelwithtwoPEsusingdepth-firsttraversal– PE0searchingtheleftsubtree expandsonlytheshadednodes
beforethesolutionisfound byPE1– PE1searchingtherightsubtree• Serialalgorithmexpandstheentiretree– Doesmoreworkthan
theparallelalgorithm.
18
PerformanceMetrics:SuperlinearSpeedups
Resource-basedsuperlinearity• Parallelexecution:– Thehigheraggregatecache/memorybandwidthcanresultin
bettercache-hitratios,andthereforesuperlinearity.
• Example:Aprocessorwith64KBofcacheyieldsan80%hitratio.Iftwoprocessorsareused,sincetheproblemsize/processorissmaller,thehitratiogoesupto90%.Oftheremaining10%access,8%comefromlocalmemoryand2%fromremotememory.
• IfDRAMaccesstimeis100ns,cacheaccesstimeis2ns,andremotememoryaccesstimeis400ns,thiscorrespondstoaspeedupof2.43!
19
PerformanceMetrics#4:Efficiency
• Fractionoftimeforwhichaprocessperformusefulwork
• Bounds– Theoretically,0≤E≤ 1• Thelarger,thebetter• E=1:0overhead
– Practically,E>1ifsuperlinear speedupisachieved
• Previousexample:addingNnumbersusingNPEs– Speedup:S=Θ (N /logN)– Efficiency:E=S/N=Θ (N/logN)/N=Θ (1/logN)• VerylowwhenNisbig
20
Example:ImageFiltering(e.g.EdgeDetection)
• Apply3x3templatetoeachpixeloftheimages– Stencilcomputation
• Serialperformance:TS=9tc n2– Eachpixelhas9multiply-add(MA)• EachMAtakesconstanttc time
– Ann xn imageforn2pixels
21http://en.wikipedia.org/wiki/Edge_detection
EdgeDetection:ParallelVersion
• Partitionstheimageequallyintoverticalsegments,eachwithn2 /p pixels.
• ComputationbyeachPE:TS=9tcn2 /p
• CommunicationsbyeachPE:2(ts +twn)– Theboundaryofeachsegmentis2n pixels• Twoboundaries:leftandright
– Eachboundaryexchangetakests +twn
• Parallelperformance:
22
EdgeDetection:ParallelSpeedupandEfficiency
• Serialperformance:TS=9tc n2
• Parallelperformance:
• Speedup:S=Ts/Tp
• Efficiency:E=S/p
23
PerformanceMetrics#5:Cost
ProductofparallelexecutiontimeandnumberofPEs:p*TP• ThetotalamountoftimebyallPEstosolvetheproblem
• Cost-optimal :parallelcost≅ serialcost– ~0overhead– E=Θ (1),sinceE=TS /p*TP
24
Cost:AnExample
AddingnnumbersonnPEs• Serialperformance:TS =Θ(n)• Parallelperformance:TP =Θ(log n)• Cost:p TP =Θ(nlog n)• Optimalornot:– E=n/n*logn=Θ(1/logn)– Notcost-optimal.
• Whynotoptimal– WasteofCPUcyclesafterstep1• Onlycore0isdoingalltheusefulworkinlogN times
log n
25
TopicOverview
• Introduction• PerformanceMetricsforParallelSystems– ExecutionTime,Overhead,Speedup,Efficiency,Cost• Amdahl’sLaw• ScalabilityofParallelSystems– Isoefficiency MetricofScalability• MinimumExecutionTimeandMinimumCost-OptimalExecutionTime
• AsymptoticAnalysisofParallelPrograms• OtherScalabilityMetrics– Scaledspeedup,Serialfraction
35
Amdahl’sLaw
• Amdahl'sargument
36
• Theword“law” isoftenusedbycomputerscientistswhenitisanobservedphenomena(e.g,Moore’sLaw)andnotatheoremthathasbeenproveninastrictsense.
GeneAmdahl,"Validityofthesingleprocessorapproachtoachievinglarge-scalecomputingcapabilities",AFIPSConferenceProceedings,30:483-485,1967.
UsingAmdahl’sLaw
37
Amdahl’sLawforParallelism
• TheenhancedfractionFisthroughparallelism,perfectparallelismwithlinearspeedup– ThespeedupforFisNforNprocessors• Overallspeedup
• Speedupupperbound(whenNà∞):– 1-F:thesequentialportionofaprogram
38
Amdahl’sLawforParallelism
39
Amdahl’sLawUsefulness
• Amdahl’slawisvalidfortraditionalproblemsandhasseveralusefulinterpretations.
• SometextbooksshowhowAmdahl’slawcanbeusedtoincreasetheefficientofparallelalgorithms– E=(1/((1-F)+F/N))/N=1/(N(1-F)+F)• IfweincreaseN,andtheproblemsizeincertainrate(soFincreased),wecanstillkeepEconstant
• Amdahl’slawshowsthateffortsrequiredtofurtherreducethefractionofthecodethatissequentialmaypayoffinlargeperformancegains.
• Hardwarethatachievesevenasmalldecreaseinthepercentofthingsexecutedsequentiallymaybeconsiderablymoreefficient.
40
Amdahl’sLawforParallelism
• However:for longtime,Amdahl’slawwasviewedasafatalflawtotheusefulnessofparallelism– Focusesaparticularalgorithmandproblemsizes,anddoesnot
considerthatotheralgorithmswithmoreparallelismmayexist,orscalabilityissues
– Amdahl’slawappliesonlyto“standard” problemsweresuperlinearity cannotoccur
– Gustafon’sLaw: Theproportionofthecomputationsthataresequentialnormallydecreasesastheproblemsizeincreases.
• Currently,itisgenerallyacceptedbyparallelcomputingprofessionalsthatAmdahl’slawisnotaseriouslimitthebenefitandfutureofparallelcomputing.
41CompilersandMore:IsAmdahl’sLawStillRelevant?MichaelWolfe,http://www.hpcwire.com/2015/01/22/compilers-amdahls-law-still-relevant/,01/22/2015
References
• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanth Grama
• “AnalyticalModelingofParallelSystems”,Chapter5inAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipinKumar,IntroductiontoParallelComputing'',“AddisonWesley,2003.
42