ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations
Lecture 02: Parallel Architecture
Transcript of Lecture 02: Parallel Architecture
Lecture02:ParallelArchitectureILP:Instruc4onLevelParallelism,TLP:ThreadLevel
ParallelismandDLP:DataLevelParallelism
CSCE790:ParallelProgrammingModelsforMul4coreandManycoreProcessors
DepartmentofComputerScienceandEngineering
h:p://cse.sc.edu/~yanyh
1
Flynn’sTaxonomyofParallelArchitectures
h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy
SISD:SingleInstruc4onSingleData
• AtoneIme,oneinstrucIonoperatesononedata• BasedontradiIonalVonNeumannuniprocessorarchitecture– instrucIonsareexecutedsequenIallyorserially,onestepaOerthe
next.• UnIlmostrecently,mostcomputersareofSISDtype.
SIMD:SingleInstruc4onMul4pleData
• Alsoknownasarray-processorsfromearlyon• AsingleinstrucIonstreamisbroadcastedtomulIpleprocessors,eachhavingitsowndatastream– SIllusedinsomegraphicscardstoday
InstrucIonsstream
processor processor processor processor
Data Data Data Data
Controlunit
MIMD:Mul4pleInstruc4onsMul4pleData
• EachprocessorhasitsowninstrucIonstreamandinputdata
• Verygeneralcase– everyotherscenariocanbemappedtoMIMD• FurtherbreakdownofMIMDusuallybasedonthememoryorganizaIon– Sharedmemorysystems– Distributedmemorysystems
ParallelisminHardwareArchitecture
• SISD:inherentlysequenIal– InstrucIonLevelParallel:overlappingexecuIonof
instrucIonsthroughpipeliningsincewecansplitaninstrucIonexecuIonintomulIplestages
– Out-of-OrderexecuIon– SpeculaIon– Superscalar• SIMD:Inherentlyparallelwithconstraints– DataLevelParallel:OneinstrucIonstreammulIpledata• MIMD:Inherentlyparallel– ThreadLevelParallel:mulIpleinstrucIonstreams
independently
6
Abstrac4on:LevelsofRepresenta4on/Interpreta4on
lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)
HighLevelLanguageProgram(e.g.,C)
AssemblyLanguageProgram(e.g.,MIPS)
MachineLanguageProgram(MIPS)
HardwareArchitectureDescrip4on(e.g.,blockdiagrams)
Compiler
Assembler
MachineInterpreta4on
temp=v[k];v[k]=v[k+1];v[k+1]=temp;
0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !
LogicCircuitDescrip4on(CircuitSchema4cDiagrams)
ArchitectureImplementa4on
Anythingcanberepresentedasanumber,
i.e.,dataorinstrucIons
7
Instruc4onLevelParallelism
• InstrucIonexecuIoncanbedividedintomulIplestages– 5stagesinRISC– Instruc4onfetchcycle(IF):sendPCtomemory,fetchthecurrent
instrucIonfrommemory,andupdatePCtothenextsequenIalPCbyadding4tothePC.
– Instruc4ondecode/registerfetchcycle(ID):decodetheinstrucIon,readtheregisterscorrespondingtoregistersourcespecifiersfromtheregisterfile.
– Execu4on/effec4veaddresscycle(EX):performMemoryaddresscalculaIonforLoad/Store,Register-RegisterALUinstrucIonandRegister-ImmediateALUinstrucIon.
– Memoryaccess(MEM):Performmemoryaccessforload/storeinstrucIons.
– Write-backcycle(WB):WritebackresultstothedestoperandsforRegister-RegisterALUinstrucIonorLoadinstrucIon.
8
PipelinedInstruc4onExecu4on
9
I n s t r. O r d e r
Time (clock cycles)
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
Pipelining: Its Natural!
• Laundry Example • Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold – Washer takes 30 minutes – Dryer takes 40 minutes – “Folder” takes 20 minutes
• One load: 90 minutes
A B C D
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6PM 7 8 9 10 11 Midnight
TaskOrder
Time
Pipelined Laundry Start Work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6PM 7 8 9 10 11 Midnight
TaskOrder
Time
30 40 40 40 40 20
Sequential laundry takes 6 hours for 4 loads
Classic5-StagePipelineforaRISC
• EachcyclethehardwarewilliniIateanewinstrucIonandwillbeexecuIngsomepartofthefivedifferentinstrucIons.– OnecycleperinstrucIonvs5cycleperinstrucIon
Clock number
Instruction number 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB
PipelineandSuperscalar
14
AdvancedILP
• DynamicSchedulingàOut-of-orderExecuIon• SpeculaIonàIn-orderCommit• SuperscalaràMulIpleIssue
Techniques Goals Implementa4on Addressing Approaches
DynamicScheduling
Out-of-orderexecu4on
Reserva4onSta4ons,Load/StoreBufferandCDB
Datahazards(RAW,WAW,WAR)
Registerrenaming
Specula4on In-ordercommit
BranchPredic4on(BHT/BTB)andReorderBuffer
Controlhazards(branch,func,excep4on)
Predic4onandmispredic4onrecovery
Superscalar/VLIW
Mul4pleissue
SocwareandHardware ToIncreaseCPI Bycompilerorhardware
Problemsoftradi4onalILPscaling
• FundamentalcircuitlimitaIons1– delays⇑asissuequeues⇑andmulI-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue• LimitedamountofinstrucIon-levelparallelism1
– inefficientforcodeswithdifficult-to-predictbranches
• Powerandheatstallclockfrequencies
16
[1]Thecaseforasingle-chipmulIprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.
ILPimpacts
17
Simula4onsof8-issueSuperscalar
18
Power/heatdensitylimitsfrequency
19
• Somefundamentalphysicallimitsarebeingreached
Wewillhavethis…
20
21
Revolu4onishappeningnow• Chipdensityis
conInuingincrease~2xevery2years– Clockspeedisnot– Numberofprocessor
coresmaydoubleinstead
• Thereisli:leornohiddenparallelism(ILP)tobefound
• ParallelismmustbeexposedtoandmanagedbysoOware– Nofreelunch
Source:Intel,MicrosoO(Su:er)andStanford(Olukotun,Hammond)
CurrentTrendsinArchitecture
• CannotconInuetoleverageInstrucIon-Levelparallelism(ILP)– Singleprocessorperformanceimprovementendedin2003
• Recentmodelsforperformance:– ToexploreData-levelparallelism(DLP)viaSIMDarchitectureandGPUs
– ToexploreThread-levelparallelism(TLP)viaMIMD
– Others
22
SIMD:SingleInstruc4on,Mul4pleData(DataLevelParallelism)
• SIMDarchitecturescanexploitsignificantdata-levelparallelismfor:– matrix-orientedscienIficcompuIng– media-orientedimageandsoundprocessors• SIMDismoreenergyefficientthanMIMD– OnlyneedstofetchoneinstrucIonperdataoperaIon
processingmulIpledataelements– MakesSIMDa:racIveforpersonalmobiledevices• SIMDallowsprogrammertoconInuetothinksequenIally
InstrucIonsstream
processor processor processor processor
Data Data Data Data
Controlunit
SIMDParallelism
• Threevaria4ons– Vectorarchitectures(earlyage)– SIMDextensions– GraphicsProcessorUnits(GPUs)(dedicatedweeksforGPUs)
• Forx86processors:– ExpecttwoaddiIonalcoresperchipperyear(MIMD)– SIMDwidthtodoubleeveryfouryears– PotenIalspeedupfromSIMDtobetwicethatfromMIMD!
VectorArchitectures
• VectorprocessorsabstractoperaIonsonvectors,e.g.replacethefollowingloop
by• Somelanguagesofferhigh-levelsupportfortheseoperaIons(e.g.Fortran90ornewer)
for (i=0; i<n; i++) { a[i] = b[i] + c[i];
}
a = b + c; ADDV.D V10, V8, V6
VectorProgrammingModel
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions ADDV v3, v1, v2 v3
v2 v1
Scalar Registers
r0
r15 Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1] VLR Vector Length Register
v1 Vector Load and Store Instructions LV v1, (r1, r2)
Base, r1 Stride in r2 Memory
Vector Register
VectorwasSupercomputers• Epitomy:Cray-1,1976• ScalarUnit– Load/StoreArchitecture
• VectorExtension– VectorRegisters– VectorInstrucIons
• ImplementaIon– HardwiredControl– HighlyPipelinedFuncIonalUnits– InterleavedMemorySystem– NoDataCaches– NoVirtualMemory
AXPY(64elements)(Y=a*X+Y)inMIPSandVMIPS
• #instrs:– 6vs~600• Pipelinestalls– 64xhigherbyMIPS• Vectorchaining(forwarding)– V1,V2,V3andV4
for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];
ThestarIngaddressesofXandYareinRxandRy,respecIvely
SIMDInstruc4ons
• OriginallydevelopedforMulImediaapplicaIons• SameoperaIonexecutedformulIpledataitems• UsesafixedlengthregisterandparIIonsthecarrychaintoallowuIlizingthesamefuncIonalunitformulIpleoperaIons– E.g.a64bitaddercanbeuIlizedfortwo32-bitaddoperaIonssimultaneously
SIMDInstruc4ons
• MMX(Mult-MediaExtension)-1996– ExisIng64bitfloaIngpointregistercouldbeusedforeight8-
bitoperaIonsorfour16-bitoperaIons• SSE(StreamingSIMDExtension)–1999– SuccessortoMMXinstrucIons– Separate128-bitregistersaddedforsixteen8-bit,eight16-bit,
orfour32-bitoperaIons• SSE2–2001,SSE3–2004,SSE4-2007– AddedsupportfordoubleprecisionoperaIons• AVX(AdvancedVectorExtensions)-2010– 256-bitregistersadded
AXPY
• 256-bitSIMDexts– 4doubleFP
• MIPS:578insts• SIMDMIPS:149– 4×reducIon
• VMIPS:6instrs– 100×reducIon
for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];
StateoftheArt:IntelXeonPhiManycoreVectorCapability
• IntelXeonPhiKnightCorner,2012,~60cores,4-waySMT• IntelXeonPhiKnightLanding,2016,~60cores,4-waySMTandHBM– h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-
Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf
h:p://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf
StateoftheArt:ARMScalableVectorExtensions(SVE)
• AnnouncedinAugust2016– h:ps://community.arm.com/groups/processors/blog/
2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
– h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf
• Beyondvectorarchitecturewelearned– Vectorloop,predictandspeculaIon– VectorLengthAgnosIc(VLA)programming
– Checktheslide
Limita4onsofop4mizingasingleinstruc4onstream
• Problem:withinasingleinstrucIonstreamwedonotfindenoughindependentinstrucIonstoexecutesimultaneouslydueto– datadependencies– limitaIonsofspeculaIveexecuIonacrossmulIplebranches– difficulIestodetectmemorydependenciesamonginstrucIon
(aliasanalysis)• Consequence:significantnumberoffuncIonalunitsareidlingat
anygivenIme• QuesIon:CanwemaybeexecuteinstrucIonsfromanother
instrucIonsstream– Anotherthread?– Anotherprocess?
Thread-levelparallelism
• ProblemsforexecuInginstrucIonsfrommulIplethreadsatthesameIme– TheinstrucIonsineachthreadmightusethesameregister
names– Eachthreadhasitsownprogramcounter• VirtualmemorymanagementallowsfortheexecuIonofmulIplethreadsandsharingofthemainmemory
• Whentoswitchbetweendifferentthreads:– FinegrainmulIthreading:switchesbetweeneveryinstrucIon– CoursegrainmulIthreading:switchesonlyoncostlystalls(e.g.
level2cachemisses)
ConvertThread-levelparallelismtoinstruc4on-levelparallelism
Time (
proc
esso
r cyc
le) Superscalar Fine-Grained Coarse-Grained
Simultaneous Multithreading
Thread 1 Thread 2
Thread 3 Thread 4
Thread 5 Idle slot
ILPtoDoTLP:e.g.SimultaneousMul4-Threading
• Workswellif– Numberofcomputeintensivethreadsdoesnotexceedthenumberof
threadssupportedinSMT– ThreadshavehighlydifferentcharacterisIcs(e.g.onethreaddoingmostly
integeroperaIons,anothermainlydoingfloaIngpointoperaIons)• Doesnotworkwellif– ThreadstrytouIlizethesamefuncIonunits• e.g.adualprocessorsystem,eachprocessorsupporIng2threadssimultaneously(OSthinksthereare4processors)
• 2computeintensiveapplicaIonprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)
Power,FrequencyandILP
Note:EvenMoore’sLawisendingaround2021:h:p://spectrum.ieee.org/semiconductors/devices/transistors-could-stop-shrinking-in-2021h:ps://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/h:p://www.forbes.com/sites/Imworstall/2016/07/26/economics-is-important-the-end-of-moores-law
CPUfrequencyincreasewasfla:enedaround2000-2005Twomainreasons:1. LimitedILPand2. PowerconsumpIonandheat
dissipaIon
History–Past(2000)andToday
Flynn’sTaxonomy
h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy
✔✔
✖
ExamplesofMIMDMachines• SymmetricShared-Memory
MulIprocessor(SMP)– MulIpleprocessorsinboxwith
sharedmemorycommunicaIon– CurrentMulIcorechipslikethis– EveryprocessorrunscopyofOS• Distributed/Non-uniformShared-
MemoryMulIprocessor– MulIpleprocessors• Eachwithlocalmemory• generalscalablenetwork
– Extremelylight“OS”onnodeprovidessimpleservices• Scheduling/synchronizaIon
– Network-accessiblehostforI/O• Cluster– Manyindependentmachine
connectedwithgeneralnetwork– CommunicaIonthroughmessages
P P P P
Bus
Memory
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
Host
Network
Symmetric(Shared-Memory)Mul4processors(SMP)
• Smallnumbersofcores– Typicallyeightorfewer,andno
morethan32inmostcases• Shareasinglecentralizedmemorythatallprocessorshaveequalaccessto,– Hencethetermsymmetric.• AllexisIngmulIcoresareSMPs.
• Alsocalleduniformmemoryaccess(UMA)mulIprocessors– allprocessorshaveauniform
latency
CentralizedSharedMemorySystem(I)
• MulI-coreprocessors– Typicallyconnectedoveracache,– PreviousSMPsystemsweretypicallyconnectedoverthemain
memory• IntelX7350quad-core(Tigerton)– PrivateL1cache:32KBinstrucIon,32KBdata– SharedL2cache:4MBunifiedcache
CoreL1
CoreL1
sharedL2
CoreL1
CoreL1
sharedL2
1066MHzFSB
CentralizedSharedMemorySystem(SMP)(II)
• IntelX7350quad-core(Tigerton)mulI-processorconfiguraIon
C0
C1
L2
C8
C9
L2
C2
C3
L2
C10
C11
L2
C4
C5
L2
C12
C13
L2
C6
C7
L2
C14
C15
L2
Socket0 Socket1 Socket2 Socket3
MemoryControllerHub(MCH)
Memory Memory Memory Memory
8GB/s8GB/s8GB/s8GB/s
DistributedShared-MemoryMul4processor• Largeprocessorcount– 64to1000s• Distributedmemory– Remotevslocalmemory– Longvsshortlatency– Highvslowlatency
§ Interconnec4onnetwork– Bandwidth,topology,etc
§ Nonuniformmemoryaccess(NUMA)
§ EachprocessormayhaslocalI/O
DistributedShared-MemoryMul4processor(NUMA)
• Reducesthememorybo:leneckcomparedtoSMPs• Moredifficulttoprogramefficiently– E.g.firsttouchpolicy:dataitemwillbelocatedinthememory
oftheprocessorwhichusesadataitemfirst• Toreduceeffectsofnon-uniformmemoryaccess,cachesareoOenused– ccNUMA:cache-coherentnon-uniformmemoryaccess
architectures• Largestexampleasoftoday:SGIOriginwith512processors
Shared-MemoryMul4processor
• SMPandDSMareallsharedmemorymulIprocessors– UMAorNUMA• MulIcoreareSMPsharedmemory• MostmulI-CPUmachinesareDSM– NUMA
• SharedAddressSpace(VirtualAddressSpace)– Notalwayssharedmemory
CurrentTrendsinComputerArchitecture• CannotconInuetoleverageILP– Singleprocessorperformanceimprovementendedin2003
• Currentmodelsforperformance:– ToexploreData-levelparallelism(DLP)viaSIMDarchitecture(vector,SIMD
extensionsandGPUs)– ToexploreThread-levelparallelism(TLP)viaMIMD
– Heterogeneity:integratemul4pleanddifferentarchitecturestogetherinchip/systemlevel
• Emergingarchitectures– Domain-specificarchitectures:DeepLearningPU(e.g.TPU,etc)– E.g.MachineLearningPullsProcessorArchitecturesontoNewPath• hJps://www.top500.org/news/machine-learning-pulls-processor-architectures-onto-new-path/
Theserequireexplicitrestructuringoftheapplica4onßParallelProgramming
48
The“Future”ofMoore’sLaw
• ThechipsaredownforMoore’slaw– h:p://www.nature.com/news/the-chips-are-down-for-moore-
s-law-1.19338• SpecialReport:50YearsofMoore'sLaw– h:p://spectrum.ieee.org/staIc/special-report-50-years-of-
moores-law• Moore’slawreallyisdeadthisIme– h:p://arstechnica.com/informaIon-technology/2016/02/
moores-law-really-is-dead-this-Ime/• RebooIngtheITRevoluIon:ACalltoAcIon(SIA/SRC,2015)– h:ps://www.semiconductors.org/clientuploads/Resources/
RITR%20WEB%20version%20FINAL.pdf
49