Lecture 16: Analytical Modeling of Parallel Programs ... · Topic Overview •Introduction...

Lecture16:AnalyticalModelingofParallelPrograms:MetricsandAnalysis

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems• ExecutionTime,Overhead,Speedup,Efficiency,Cost

– ScalabilityofParallelSystems– Useofperformancetools

2

TopicOverview

• Introduction• PerformanceMetricsforParallelSystems– ExecutionTime,Overhead,Speedup,Efficiency,Cost• Amdahl’sLaw• ScalabilityofParallelSystems– Isoefficiency MetricofScalability• MinimumExecutionTimeandMinimumCost-OptimalExecutionTime

• AsymptoticAnalysisofParallelPrograms• OtherScalabilityMetrics– Scaledspeedup,Serialfraction

3

AnalyticalModeling:SequentialExecutionTime

• Theexecutiontimeofasequentialalgorithm– Asymptoticexecutiontimeasafunctionofinputsize• identicalonanyserialplatform

4

Countthenumberofoperations

• Big-ONotation– O(1)– O(N)– O(N2)– O(NlogN)– O(N3)– …

ParallelExecutionTime

• Parallelexecutiontimeisafunctionof:– inputsize– numberofprocessors(machineperformance)– communicationparametersoftargetplatform(network)

• Implications– mustanalyzeparallelprogramforaparticulartargetplatform• communicationcharacteristicscandifferbymorethanO(1)

– parallelprogram=parallelalgorithm+platform

5

OverheadinParallelPrograms

Ifusingtwoprocessors,shouldn’taprogramruntwiceasfast?– Notallpartsoftheprogramareparallelized– Certainamountofoverheadsincurredwhendoingitinparallel

6

OverheadsinParallelPrograms

• Interprocess interactions:– Communication• Datamovement

– Synchronization/contention

• Idling:– Loadimbalance– Synchronization• Syncitselfhasoverhead

– Serialcomponents

• Excesscomputation– computationnotperformedbytheserialversion• E.g.replicatedcomputationtominimizecommunication.

7

TopicOverview

• Introduction• FivePerformanceMetricsforParallelSystems– ExecutionTime,Overhead,Speedup,Efficiency,Cost• Amdahl’sLaw• ScalabilityofParallelSystems– Isoefficiency MetricofScalability• MinimumExecutionTimeandMinimumCost-OptimalExecutionTime


8

PerformanceMetrics#1:ExecutionTime

Doesaparallelprogramrunfasterthanitssequentialversion?• Serialtime:TS– timeelapsedbetweenthestartandendofserialexecution• Paralleltime:Tp– timeelapsedbetweenfirstprocessstartandlastprocessend

9

PerformanceMetrics#2:ParallelOverhead

Whatarethecosttoenableparallelism?

• Tall: thetotaltimecollectivelyspentbyalltheprocessors– Tall =pTP (p isthenumberofprocessors).

• TS :serialexecutiontime

• TotalparalleloverheadTo– To =Tall - TS– To =pTP – TS

overhead

10

PerformanceMetrics#3:Speedup

Whatisthebenefitfromincreasingparallelism?• Speedup(S):TS /TP– Theratioofthetimetakentosolveaproblemonasingle

processortothetimerequiredtosolvethesameproblemonaparallelcomputerwithpidenticalprocessingelements.

11

PerformanceMetrics:Example

Addingn numbers• Sequential:Θ (n)• Usingn processingelements.– Ifn isapoweroftwo,inlog n stepsbypropagatingpartial

sumsupalogicalbinarytreeofprocessors.

12

PerformanceMetrics:Example– cont’d• Σji denotesthesumof

numberswithconsecutivelabelsfromi toj

• Analysis:– Anadditiontakestc– Communicationtakests +tw– tc and(ts +tw)areconstant

• Sequentialandparalleltime:– TS=Θ (n)– TP =Θ (log n)

• SpeedupS:– S =Θ (n/ log n)

13

PerformanceMetrics#3:Speedup

• Theyardstick:Ts– Manyserialalgorithmsavailable,eachwithdifferent

asymptoticexecutiontime– Theparallelizationofthosealgorithmsvariestoo

http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations

14

SpeedupExample:Sorting

15http://en.wikipedia.org/wiki/Sorting_algorithm

• Theserialexecutiontimeforbubblesort:150seconds.• Odd-evenparallelbubblesort:is40seconds.• Thespeedup:150/40=3.75.– Butisthisreallyafairassessmentofthesystem?

• Whatifserialquicksortonlytook30seconds?• Inthiscase,thespeedupis30/40=0.75– Amorerealisticassessment

• Inreality,considerthebestsequentialprogramasbaseline– Noteventheparallelprogramrunningwith1PE• Wedothisinourassignment

SpeedupExample:Sorting– cont’d

16

PerformanceMetrics:SpeedupBounds• Speedup,intheory,shouldbeupperboundedbyp– Wecanonlyexpectap-foldspeedupifweusep timesas

manyresources.

• Theoretically,aspeedupgreaterthanp ispossibleonlyifeachprocessorspendslessthanTS/ p timesolvingtheproblem.– Violatetherulesofusing

thebestsequentialasbaseline

• Speedups:– Linear– Sublinear– Superlinear

• Inpractice,superlinear ispossible17

S

PerformanceMetrics:Superlinear Speedups

Parallelalgorithmdoeslessworkthanitsserialversions• Searchingnode‘S’inanunstructuredtree• ParallelwithtwoPEsusingdepth-firsttraversal– PE0searchingtheleftsubtree expandsonlytheshadednodes

beforethesolutionisfound byPE1– PE1searchingtherightsubtree• Serialalgorithmexpandstheentiretree– Doesmoreworkthan

theparallelalgorithm.

18

PerformanceMetrics:SuperlinearSpeedups

Resource-basedsuperlinearity• Parallelexecution:– Thehigheraggregatecache/memorybandwidthcanresultin

bettercache-hitratios,andthereforesuperlinearity.

• Example:Aprocessorwith64KBofcacheyieldsan80%hitratio.Iftwoprocessorsareused,sincetheproblemsize/processorissmaller,thehitratiogoesupto90%.Oftheremaining10%access,8%comefromlocalmemoryand2%fromremotememory.

• IfDRAMaccesstimeis100ns,cacheaccesstimeis2ns,andremotememoryaccesstimeis400ns,thiscorrespondstoaspeedupof2.43!

19

PerformanceMetrics#4:Efficiency

• Fractionoftimeforwhichaprocessperformusefulwork

• Bounds– Theoretically,0≤E≤ 1• Thelarger,thebetter• E=1:0overhead

– Practically,E>1ifsuperlinear speedupisachieved

• Previousexample:addingNnumbersusingNPEs– Speedup:S=Θ (N /logN)– Efficiency:E=S/N=Θ (N/logN)/N=Θ (1/logN)• VerylowwhenNisbig

20

Example:ImageFiltering(e.g.EdgeDetection)

• Apply3x3templatetoeachpixeloftheimages– Stencilcomputation

• Serialperformance:TS=9tc n2– Eachpixelhas9multiply-add(MA)• EachMAtakesconstanttc time

– Ann xn imageforn2pixels

21http://en.wikipedia.org/wiki/Edge_detection

EdgeDetection:ParallelVersion

• Partitionstheimageequallyintoverticalsegments,eachwithn2 /p pixels.

• ComputationbyeachPE:TS=9tcn2 /p

• CommunicationsbyeachPE:2(ts +twn)– Theboundaryofeachsegmentis2n pixels• Twoboundaries:leftandright

– Eachboundaryexchangetakests +twn

• Parallelperformance:

22

EdgeDetection:ParallelSpeedupandEfficiency

• Serialperformance:TS=9tc n2

• Parallelperformance:

• Speedup:S=Ts/Tp

• Efficiency:E=S/p

23

PerformanceMetrics#5:Cost

ProductofparallelexecutiontimeandnumberofPEs:p*TP• ThetotalamountoftimebyallPEstosolvetheproblem

• Cost-optimal :parallelcost≅ serialcost– ~0overhead– E=Θ (1),sinceE=TS /p*TP

24

Cost:AnExample

AddingnnumbersonnPEs• Serialperformance:TS =Θ(n)• Parallelperformance:TP =Θ(log n)• Cost:p TP =Θ(nlog n)• Optimalornot:– E=n/n*logn=Θ(1/logn)– Notcost-optimal.

• Whynotoptimal– WasteofCPUcyclesafterstep1• Onlycore0isdoingalltheusefulworkinlogN times

log n

25

TopicOverview

• Introduction• PerformanceMetricsforParallelSystems– ExecutionTime,Overhead,Speedup,Efficiency,Cost• Amdahl’sLaw• ScalabilityofParallelSystems– Isoefficiency MetricofScalability• MinimumExecutionTimeandMinimumCost-OptimalExecutionTime


35

Amdahl’sLaw

• Amdahl'sargument

36

• Theword“law” isoftenusedbycomputerscientistswhenitisanobservedphenomena(e.g,Moore’sLaw)andnotatheoremthathasbeenproveninastrictsense.

GeneAmdahl,"Validityofthesingleprocessorapproachtoachievinglarge-scalecomputingcapabilities",AFIPSConferenceProceedings,30:483-485,1967.

UsingAmdahl’sLaw

37

Amdahl’sLawforParallelism

• TheenhancedfractionFisthroughparallelism,perfectparallelismwithlinearspeedup– ThespeedupforFisNforNprocessors• Overallspeedup

• Speedupupperbound(whenNà∞):– 1-F:thesequentialportionofaprogram

38


39

Amdahl’sLawUsefulness

• Amdahl’slawisvalidfortraditionalproblemsandhasseveralusefulinterpretations.

• SometextbooksshowhowAmdahl’slawcanbeusedtoincreasetheefficientofparallelalgorithms– E=(1/((1-F)+F/N))/N=1/(N(1-F)+F)• IfweincreaseN,andtheproblemsizeincertainrate(soFincreased),wecanstillkeepEconstant

• Amdahl’slawshowsthateffortsrequiredtofurtherreducethefractionofthecodethatissequentialmaypayoffinlargeperformancegains.

• Hardwarethatachievesevenasmalldecreaseinthepercentofthingsexecutedsequentiallymaybeconsiderablymoreefficient.

40


• However:for longtime,Amdahl’slawwasviewedasafatalflawtotheusefulnessofparallelism– Focusesaparticularalgorithmandproblemsizes,anddoesnot

considerthatotheralgorithmswithmoreparallelismmayexist,orscalabilityissues

– Amdahl’slawappliesonlyto“standard” problemsweresuperlinearity cannotoccur

– Gustafon’sLaw: Theproportionofthecomputationsthataresequentialnormallydecreasesastheproblemsizeincreases.

• Currently,itisgenerallyacceptedbyparallelcomputingprofessionalsthatAmdahl’slawisnotaseriouslimitthebenefitandfutureofparallelcomputing.

41CompilersandMore:IsAmdahl’sLawStillRelevant?MichaelWolfe,http://www.hpcwire.com/2015/01/22/compilers-amdahls-law-still-relevant/,01/22/2015

References

• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanth Grama

• “AnalyticalModelingofParallelSystems”,Chapter5inAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipinKumar,IntroductiontoParallelComputing'',“AddisonWesley,2003.

42

Lecture 16: Analytical Modeling of Parallel Programs ... · Topic Overview •Introduction...

Documents

Transcript of Lecture 16: Analytical Modeling of Parallel Programs ... · Topic Overview •Introduction...