Parallel Processing Chapter - 2

ParallelProcessingProf. MitulK.PatelAssistantProfessor,ComputerEngineeringDepartmentShreeSwamiAtmanandSaraswatiInstituteofTechnology,SuratJanuary2013PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 1/135Outline1Introductiontosubject2ParallelProgrammingPlatformsImplicitParallelism: TrendsinMicroprocessorArchitecturesLimitationsofMemorySystemPerformanceDichotomyofParallelComputingPlatformsCommunicationModelofParallelPlatformsPhysicalOrganizationofParallelPlatformsCommunicationCostsinParallelMachinesMessagingCostModelsandRoutingMechanismsMappingTechniquesPreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 2/135IntroductiontosubjectIntroductiontosubjectSubjectName: ParallelProcessingSubjectCode: 180702EvaluationScheme:PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 3/135IntroductiontosubjectReferenceBooksTextBooks:1Introductionto Parallel Computing, AnanthGrama, Anshul Gupta,GeorgeKarypis,VipinKumar,ByPearsonPublication2IntroductiontoParallel Processing, M. SasiKumar, DineshShikhare,P.RaviprakashByPHIPublicationReferenceBooks:1IntroductionToParallelProgramming-ByStevenBrawer2Introduction To Parallel Processing By M.Sasikumar, Dinesh ShikhareAndP.RaviPrakash3Parallel Computers Architecture And Programming By V. RajaramanAndC.SivaRamMurthyPreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 4/135Parallel ProgrammingPlatformsTopicOverviewImplicitParallelism: TrendsinMicroprocessorArchitecturesLimitationsofMemorySystemPerformanceDichotomyofParallelComputingPlatformsCommunicationModelofParallelPlatformsPhysicalOrganizationofParallelPlatformsCommunicationCostsinParallelMachinesMessagingCostModelsandRoutingMechanismsMappingTechniquesPreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 5/135Parallel ProgrammingPlatformsScopeofParallelismConventional architecturescoarselycompriseof aprocessor, memorysystem,andthedatapath.Each of these components present signicant performance bottlenecks.Parallelismaddresseseachofthesecomponentsinsignicantways.Dierentapplicationsutilizedierentaspectsofparallelism-e.g.,dataitensiveapplicationsutilizehighaggregatethroughput,serverapplica-tionsutilizehighaggregatenetworkbandwidth,andscienticapplica-tions typically utilize high processing and memory system performance.Itisimportanttounderstandeachoftheseperformancebottlenecks.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 6/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesOutline1Introductiontosubject2ParallelProgrammingPlatformsImplicitParallelism: TrendsinMicroprocessorArchitecturesLimitationsofMemorySystemPerformanceDichotomyofParallelComputingPlatformsCommunicationModelofParallelPlatformsPhysicalOrganizationofParallelPlatformsCommunicationCostsinParallelMachinesMessagingCostModelsandRoutingMechanismsMappingTechniquesPreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 7/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesImplicitParallelism: TrendsinMicroprocessorArchitecturesMicroprocessor clock speeds have posted impressive gains over the pasttwodecades(twotothreeordersofmagnitude).Higher levels of device integration have made available a large numberoftransistors.The question of how best to utilize these resources is an important one.Current processors usetheseresources inmultiplefunctionalunits andexecutemultipleinstructionsinthesamecycle.Theprecisemannerinwhichtheseinstructionsareselectedandexe-cutedprovidesimpressivediversityinarchitectures.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 8/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesPipeliningandSuperscalarExecutionPipeliningoverlapsvariousstagesof instructionexecutiontoachieveperformance.At a high level of abstraction, an instruction can be executed while thenextoneisbeingdecodedandthenextoneisbeingfetched.Thisisakintoanassemblylineformanufactureofcars.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 9/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesPipeliningandSuperscalarExecutionPipelining,however,hasseverallimitations.Thespeedofapipelineiseventuallylimitedbythesloweststage.For this reason, conventional processors rely on very deep pipelines (20stagepipelinesinstate-of-the-artPentiumprocessors).However,intypicalprogramtraces,every5-6thinstructionisacondi-tionaljump! Thisrequiresveryaccuratebranchprediction.Thepenaltyof amispredictiongrowswiththedepthof thepipeline,sincealargernumberofinstructionswillhavetobeushed.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 10/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesPipeliningandSuperscalarExecutionOne simple way of alleviating these bottlenecks is to use multiplepipelines.Thequestionthenbecomesoneofselectingtheseinstructions.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 11/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesSuperscalarExecution: AnExamplePreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 12/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesSuperscalarExecution: AnExampleIntheaboveexample,thereissomewastageofresourcesduetodatadependencies.Theexamplealsoillustratesthatdierentinstructionmixeswithiden-ticalsemanticscantakesignicantlydierentexecutiontime.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 13/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesSuperscalarExecutionSchedulingofinstructionsisdeterminedbyanumberoffactors:TrueDataDependency: Theresult ofoneoperationis aninput tothenext.ResourceDependency: Twooperationsrequirethesameresource.Branch Dependency: Scheduling instructions across conditional branchstatementscannotbedonedeterministicallya-priori.The scheduler, a piece of hardware looks at a large number of in-structionsinaninstructionqueueandselectsappropriatenumber ofinstructionstoexecuteconcurrentlybasedonthesefactors.Thecomplexityofthishardwareisanimportantconstraintonsuper-scalarprocessors.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 14/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesSuperscalarExecution: IssueMechanismsInthesimpler model, instructionscanbeissuedonlyintheorder inwhichtheyareencountered. Thatis,ifthesecondinstructioncannotbeissuedbecauseit has adatadependencywiththerst, onlyoneinstructionisissuedinthecycle. Thisiscalledin-orderissue.Inamoreaggressivemodel, instructionscanbeissuedout of order.Inthiscase, ifthesecondinstructionhasdatadependencieswiththerst,butthethirdinstructiondoesnot,therstandthirdinstructionscanbeco-scheduled. Thisisalsocalleddynamicissue.Performanceofin-orderissueisgenerallylimited.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 15/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesSuperscalarExecution: EciencyConsiderationsNotallfunctionalunitscanbekeptbusyatalltimes.Ifduringacycle,nofunctional unitsareutilized,thisisreferredtoasverticalwaste.Ifduringacycle,onlysomeofthefunctional unitsareutilized,thisisreferredtoashorizontalwaste.Duetolimitedparallismintypicalinstructiontraces,dependencies,orthe inability of the scheduler to extract parallelism, the performance ofsuperscalarprocessorsiseventuallylimited.Conventional microprocessors typically support four-way superscalar ex-ecution.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 16/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesVeryLongInstructionWord(VLIW)ProcessorsThehardwarecost andcomplexityof thesuperscalar scheduler is amajorconsiderationinprocessordesign.Toaddress this issues, VLIWprocessors relyoncompiletimeanaly-sistoidentifyandbundletogetherinstructionsthatcanbeexecutedconcurrently.Theseinstructionsarepackedanddispatchedtogether, andthusthenameverylonginstructionword.ThisconceptwasusedwithsomecommercialsuccessintheMultiowTracemachine(circa1984).VariantsofthisconceptareemployedintheIntelIA64processors.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 17/135Parallel ProgrammingPlatforms ImplicitParallelism: TrendsinMicroprocessorArchitecturesVeryLongInstructionWord(VLIW)Processors:ConsiderationsIssuehardwareissimpler.Compiler hasabigger context fromwhichtoselect co-scheduledin-structions.Compilers, however, donot haveruntimeinformationsuchas cachemisses. Schedulingis,therefore,inherentlyconservative.Branchandmemorypredictionismoredicult.VLIWperformanceishighlydependentonthecompiler. Anumberoftechniquessuchasloopunrolling, speculativeexecution, branchpre-dictionarecritical.TypicalVLIWprocessorsarelimitedto4-wayto8-wayparallelism.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 18/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceOutline1Introductiontosubject2ParallelProgrammingPlatformsImplicitParallelism: TrendsinMicroprocessorArchitecturesLimitationsofMemorySystemPerformanceDichotomyofParallelComputingPlatformsCommunicationModelofParallelPlatformsPhysicalOrganizationofParallelPlatformsCommunicationCostsinParallelMachinesMessagingCostModelsandRoutingMechanismsMappingTechniquesPreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 19/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceLimitationsofMemorySystemPerformanceMemorysystem, andnotprocessorspeed, isoftenthebottleneckformanyapplications.Memorysystemperformance is largelycapturedbytwoparameters,latencyandbandwidth.Latency is the time from the issue of a memory request to the time thedataisavailableattheprocessor.Bandwidthistherateatwhichdatacanbepumpedtotheprocessorbythememorysystem.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 20/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceMemorySystemPerformance: BandwidthandLatencyItisveryimportanttounderstandthedierencebetweenlatencyandbandwidth.Consider the example of a re-hose. If the water comes out of the hosetwosecondsafterthehydrantisturnedon, thelatencyofthesystemistwoseconds.Oncethewaterstartsowing,ifthehydrantdeliverswaterattherateof5gallons/second,thebandwidthofthesystemis5gallons/second.If youwantimmediateresponsefromthehydrant, itisimportanttoreducelatency.Ifyouwanttoghtbigres,youwanthighbandwidth.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 21/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceMemoryLatency: AnExampleConsider a processor operating at 1 GHz (1 ns clock) connected to a DRAMwithalatencyof100ns(nocaches). Assumethattheprocessorhastwomultiply-add units and is capable of executing four instructions in each cycleof1ns. Thefollowingobservationsfollow:Thepeakprocessorratingis4GFLOPS.Sincethememorylatencyisequal to100cyclesandblocksizeisoneword, everytimeamemoryrequestismade, theprocessormustwait100cyclesbeforeitcanprocessthedata.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 22/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceMemoryLatency: AnExampleOn the above architecture, consider the problem of computing a dot-productoftwovectors.A dot-product computation performs one multiply-add on a single pairof vector elements, i.e., each oating point operation requires one datafetch.It follows that thepeakspeedof this computationis limitedtooneoatingpoint operationevery100ns, or aspeedof 10MFLOPS, averysmallfractionofthepeakprocessorrating!PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 23/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImprovingEectiveMemoryLatencyUsingCachesCaches are small and fast memory elements between the processor andDRAM.Thismemoryactsasalow-latencyhigh-bandwidthstorage.If a piece of data is repeatedly used, the eective latency of this memorysystemcanbereducedbythecache.The fraction of data references satised by the cache is called the cachehitratioofthecomputationonthesystem.Cachehitratioachievedbyacodeonamemorysystemoftendeter-minesitsperformance.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 24/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofCaches: ExampleConsider the architecture from the previous example. In this case, we intro-duce a cache of size 32 KB with a latency of 1 ns or one cycle. We use thissetuptomultiplytwomatricesAandBofdimensions32 32. WehavecarefullychosenthesenumberssothatthecacheislargeenoughtostorematricesAandB,aswellastheresultmatrixC.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 25/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofCaches: Example(continued)Thefollowingobservationscanbemadeabouttheproblem:Fetchingthetwomatricesintothecachecorrespondstofetching2Kwords,whichtakesapproximately200s.Multiplying two n nmatrices takes 2n3operations. For our problem,thiscorrespondsto64Koperations, whichcanbeperformedin16Kcycles(or16s)atfourinstructionspercycle.The total time for the computation is therefore approximately the sumof timefor load/storeoperations andthetimefor thecomputationitself,i.e.,200 + 16s.This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 26/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofCachesRepeatedreferences tothe same dataitemcorrespondtotemporallocality.Inourexample,wehadO(n2)dataaccessesandO(n3)computation.This asymptotic dierence makes the above example particularly desir-ableforcaches.Datareuseiscriticalforcacheperformance.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 27/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofMemoryBandwidthMemory bandwidth is determined by the bandwidth of the memory busaswellasthememoryunits.Memorybandwidthcanbeimprovedbyincreasingthesizeofmemoryblocks.The underlying system takes l time units (where l is the latency of thesystem)todeliverbunitsofdata(wherebistheblocksize).PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 28/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofMemoryBandwidth: ExampleConsiderthesamesetupasbefore, exceptinthiscase, theblocksizeis4wordsinsteadof 1word. Werepeatthedot-productcomputationinthisscenario:Assuming that the vectors are laid out linearly in memory, eight FLOPs(fourmultiply-adds)canbeperformedin200cycles.Thisisbecauseasinglememoryaccessfetchesfourconsecutivewordsinthevector.Therefore, twoaccesses canfetchfour elements of eachof thevec-tors. ThiscorrespondstoaFLOPevery25ns,forapeakspeedof40MFLOPS.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 29/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofMemoryBandwidthItisimportanttonotethatincreasingblocksizedoesnotchangela-tencyofthesystem.Physically, thescenarioillustratedherecanbeviewedasawidedatabus(4wordsor128bits)connectedtomultiplememorybanks.Inpractice,suchwidebusesareexpensivetoconstruct.Inamorepracticalsystem,consecutivewordsaresentonthememorybusonsubsequentbuscyclesaftertherstwordisretrieved.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 30/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofMemoryBandwidthTheaboveexamplesclearlyillustratehowincreasedbandwidthresultsinhigherpeakcomputationrates.The data layouts were assumed to be such that consecutive data wordsinmemorywereusedbysuccessiveinstructions(spatiallocalityofref-erence).Ifwetakeadata-layoutcentricview,computationsmustbereorderedtoenhancespatiallocalityofreference.PreparedBy: Prof. Mitul K.Patel Parallel Processing January2013 31/135Parallel ProgrammingPlatforms LimitationsofMemorySystemPerformanceImpactofMemoryBandwidth: ExampleConsiderthefollowingcodefragment:for(i=0;i

Parallel Processing Chapter - 2

Documents

Transcript of Parallel Processing Chapter - 2