Lecture 02: Parallel Architecture

49
Lecture 02: Parallel Architecture ILP: Instruc4on Level Parallelism, TLP: Thread Level Parallelism and DLP: Data Level Parallelism CSCE 790: Parallel Programming Models for Mul4core and Manycore Processors Department of Computer Science and Engineering Yonghong Yan [email protected] h:p://cse.sc.edu/~yanyh 1

Transcript of Lecture 02: Parallel Architecture

Page 1: Lecture 02: Parallel Architecture

Lecture02:ParallelArchitectureILP:Instruc4onLevelParallelism,TLP:ThreadLevel

ParallelismandDLP:DataLevelParallelism

CSCE790:ParallelProgrammingModelsforMul4coreandManycoreProcessors

DepartmentofComputerScienceandEngineering

[email protected]

h:p://cse.sc.edu/~yanyh

1

Page 2: Lecture 02: Parallel Architecture

Flynn’sTaxonomyofParallelArchitectures

h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Page 3: Lecture 02: Parallel Architecture

SISD:SingleInstruc4onSingleData

•  AtoneIme,oneinstrucIonoperatesononedata•  BasedontradiIonalVonNeumannuniprocessorarchitecture–  instrucIonsareexecutedsequenIallyorserially,onestepaOerthe

next.•  UnIlmostrecently,mostcomputersareofSISDtype.

Page 4: Lecture 02: Parallel Architecture

SIMD:SingleInstruc4onMul4pleData

•  Alsoknownasarray-processorsfromearlyon•  AsingleinstrucIonstreamisbroadcastedtomulIpleprocessors,eachhavingitsowndatastream–  SIllusedinsomegraphicscardstoday

InstrucIonsstream

processor processor processor processor

Data Data Data Data

Controlunit

Page 5: Lecture 02: Parallel Architecture

MIMD:Mul4pleInstruc4onsMul4pleData

•  EachprocessorhasitsowninstrucIonstreamandinputdata

•  Verygeneralcase–  everyotherscenariocanbemappedtoMIMD•  FurtherbreakdownofMIMDusuallybasedonthememoryorganizaIon–  Sharedmemorysystems–  Distributedmemorysystems

Page 6: Lecture 02: Parallel Architecture

ParallelisminHardwareArchitecture

•  SISD:inherentlysequenIal–  InstrucIonLevelParallel:overlappingexecuIonof

instrucIonsthroughpipeliningsincewecansplitaninstrucIonexecuIonintomulIplestages

–  Out-of-OrderexecuIon–  SpeculaIon–  Superscalar•  SIMD:Inherentlyparallelwithconstraints–  DataLevelParallel:OneinstrucIonstreammulIpledata•  MIMD:Inherentlyparallel–  ThreadLevelParallel:mulIpleinstrucIonstreams

independently

6

Page 7: Lecture 02: Parallel Architecture

Abstrac4on:LevelsofRepresenta4on/Interpreta4on

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescrip4on(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpreta4on

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !

LogicCircuitDescrip4on(CircuitSchema4cDiagrams)

ArchitectureImplementa4on

Anythingcanberepresentedasanumber,

i.e.,dataorinstrucIons

7

Page 8: Lecture 02: Parallel Architecture

Instruc4onLevelParallelism

•  InstrucIonexecuIoncanbedividedintomulIplestages–  5stagesinRISC–  Instruc4onfetchcycle(IF):sendPCtomemory,fetchthecurrent

instrucIonfrommemory,andupdatePCtothenextsequenIalPCbyadding4tothePC.

–  Instruc4ondecode/registerfetchcycle(ID):decodetheinstrucIon,readtheregisterscorrespondingtoregistersourcespecifiersfromtheregisterfile.

–  Execu4on/effec4veaddresscycle(EX):performMemoryaddresscalculaIonforLoad/Store,Register-RegisterALUinstrucIonandRegister-ImmediateALUinstrucIon.

–  Memoryaccess(MEM):Performmemoryaccessforload/storeinstrucIons.

–  Write-backcycle(WB):WritebackresultstothedestoperandsforRegister-RegisterALUinstrucIonorLoadinstrucIon.

8

Page 9: Lecture 02: Parallel Architecture

PipelinedInstruc4onExecu4on

9

I n s t r. O r d e r

Time (clock cycles)

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Page 10: Lecture 02: Parallel Architecture

Pipelining: Its Natural!

•  Laundry Example •  Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold –  Washer takes 30 minutes –  Dryer takes 40 minutes –  “Folder” takes 20 minutes

•  One load: 90 minutes

A B C D

Page 11: Lecture 02: Parallel Architecture

Sequential Laundry

•  Sequential laundry takes 6 hours for 4 loads •  If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6PM 7 8 9 10 11 Midnight

TaskOrder

Time

Page 12: Lecture 02: Parallel Architecture

Pipelined Laundry Start Work ASAP

•  Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6PM 7 8 9 10 11 Midnight

TaskOrder

Time

30 40 40 40 40 20

Sequential laundry takes 6 hours for 4 loads

Page 13: Lecture 02: Parallel Architecture

Classic5-StagePipelineforaRISC

•  EachcyclethehardwarewilliniIateanewinstrucIonandwillbeexecuIngsomepartofthefivedifferentinstrucIons.–  OnecycleperinstrucIonvs5cycleperinstrucIon

Clock number

Instruction number 1 2 3 4 5 6 7 8 9

Instruction i IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB

Instruction i+2 IF ID EX MEM WB

Instruction i+3 IF ID EX MEM WB

Instruction i+4 IF ID EX MEM WB

Page 14: Lecture 02: Parallel Architecture

PipelineandSuperscalar

14

Page 15: Lecture 02: Parallel Architecture

AdvancedILP

•  DynamicSchedulingàOut-of-orderExecuIon•  SpeculaIonàIn-orderCommit•  SuperscalaràMulIpleIssue

Techniques Goals Implementa4on Addressing Approaches

DynamicScheduling

Out-of-orderexecu4on

Reserva4onSta4ons,Load/StoreBufferandCDB

Datahazards(RAW,WAW,WAR)

Registerrenaming

Specula4on In-ordercommit

BranchPredic4on(BHT/BTB)andReorderBuffer

Controlhazards(branch,func,excep4on)

Predic4onandmispredic4onrecovery

Superscalar/VLIW

Mul4pleissue

SocwareandHardware ToIncreaseCPI Bycompilerorhardware

Page 16: Lecture 02: Parallel Architecture

Problemsoftradi4onalILPscaling

•  FundamentalcircuitlimitaIons1–  delays⇑asissuequeues⇑andmulI-portregisterfiles⇑–  increasingdelayslimitperformancereturnsfromwiderissue•  LimitedamountofinstrucIon-levelparallelism1

–  inefficientforcodeswithdifficult-to-predictbranches

•  Powerandheatstallclockfrequencies

16

[1]Thecaseforasingle-chipmulIprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

Page 17: Lecture 02: Parallel Architecture

ILPimpacts

17

Page 18: Lecture 02: Parallel Architecture

Simula4onsof8-issueSuperscalar

18

Page 19: Lecture 02: Parallel Architecture

Power/heatdensitylimitsfrequency

19

•  Somefundamentalphysicallimitsarebeingreached

Page 20: Lecture 02: Parallel Architecture

Wewillhavethis…

20

Page 21: Lecture 02: Parallel Architecture

21

Revolu4onishappeningnow•  Chipdensityis

conInuingincrease~2xevery2years–  Clockspeedisnot–  Numberofprocessor

coresmaydoubleinstead

•  Thereisli:leornohiddenparallelism(ILP)tobefound

•  ParallelismmustbeexposedtoandmanagedbysoOware–  Nofreelunch

Source:Intel,MicrosoO(Su:er)andStanford(Olukotun,Hammond)

Page 22: Lecture 02: Parallel Architecture

CurrentTrendsinArchitecture

•  CannotconInuetoleverageInstrucIon-Levelparallelism(ILP)–  Singleprocessorperformanceimprovementendedin2003

•  Recentmodelsforperformance:–  ToexploreData-levelparallelism(DLP)viaSIMDarchitectureandGPUs

–  ToexploreThread-levelparallelism(TLP)viaMIMD

–  Others

22

Page 23: Lecture 02: Parallel Architecture

SIMD:SingleInstruc4on,Mul4pleData(DataLevelParallelism)

•  SIMDarchitecturescanexploitsignificantdata-levelparallelismfor:–  matrix-orientedscienIficcompuIng–  media-orientedimageandsoundprocessors•  SIMDismoreenergyefficientthanMIMD–  OnlyneedstofetchoneinstrucIonperdataoperaIon

processingmulIpledataelements–  MakesSIMDa:racIveforpersonalmobiledevices•  SIMDallowsprogrammertoconInuetothinksequenIally

InstrucIonsstream

processor processor processor processor

Data Data Data Data

Controlunit

Page 24: Lecture 02: Parallel Architecture

SIMDParallelism

•  Threevaria4ons–  Vectorarchitectures(earlyage)–  SIMDextensions–  GraphicsProcessorUnits(GPUs)(dedicatedweeksforGPUs)

•  Forx86processors:–  ExpecttwoaddiIonalcoresperchipperyear(MIMD)–  SIMDwidthtodoubleeveryfouryears–  PotenIalspeedupfromSIMDtobetwicethatfromMIMD!

Page 25: Lecture 02: Parallel Architecture

VectorArchitectures

•  VectorprocessorsabstractoperaIonsonvectors,e.g.replacethefollowingloop

by•  Somelanguagesofferhigh-levelsupportfortheseoperaIons(e.g.Fortran90ornewer)

for (i=0; i<n; i++) { a[i] = b[i] + c[i];

}

a = b + c; ADDV.D V10, V8, V6

Page 26: Lecture 02: Parallel Architecture

VectorProgrammingModel

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions ADDV v3, v1, v2 v3

v2 v1

Scalar Registers

r0

r15 Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1] VLR Vector Length Register

v1 Vector Load and Store Instructions LV v1, (r1, r2)

Base, r1 Stride in r2 Memory

Vector Register

Page 27: Lecture 02: Parallel Architecture

VectorwasSupercomputers•  Epitomy:Cray-1,1976•  ScalarUnit–  Load/StoreArchitecture

•  VectorExtension–  VectorRegisters–  VectorInstrucIons

•  ImplementaIon–  HardwiredControl–  HighlyPipelinedFuncIonalUnits–  InterleavedMemorySystem–  NoDataCaches–  NoVirtualMemory

Page 28: Lecture 02: Parallel Architecture

AXPY(64elements)(Y=a*X+Y)inMIPSandVMIPS

•  #instrs:–  6vs~600•  Pipelinestalls–  64xhigherbyMIPS•  Vectorchaining(forwarding)–  V1,V2,V3andV4

for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];

ThestarIngaddressesofXandYareinRxandRy,respecIvely

Page 29: Lecture 02: Parallel Architecture

SIMDInstruc4ons

•  OriginallydevelopedforMulImediaapplicaIons•  SameoperaIonexecutedformulIpledataitems•  UsesafixedlengthregisterandparIIonsthecarrychaintoallowuIlizingthesamefuncIonalunitformulIpleoperaIons– E.g.a64bitaddercanbeuIlizedfortwo32-bitaddoperaIonssimultaneously

Page 30: Lecture 02: Parallel Architecture

SIMDInstruc4ons

•  MMX(Mult-MediaExtension)-1996–  ExisIng64bitfloaIngpointregistercouldbeusedforeight8-

bitoperaIonsorfour16-bitoperaIons•  SSE(StreamingSIMDExtension)–1999–  SuccessortoMMXinstrucIons–  Separate128-bitregistersaddedforsixteen8-bit,eight16-bit,

orfour32-bitoperaIons•  SSE2–2001,SSE3–2004,SSE4-2007–  AddedsupportfordoubleprecisionoperaIons•  AVX(AdvancedVectorExtensions)-2010–  256-bitregistersadded

Page 31: Lecture 02: Parallel Architecture

AXPY

•  256-bitSIMDexts–  4doubleFP

•  MIPS:578insts•  SIMDMIPS:149–  4×reducIon

•  VMIPS:6instrs–  100×reducIon

for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];

Page 32: Lecture 02: Parallel Architecture

StateoftheArt:IntelXeonPhiManycoreVectorCapability

•  IntelXeonPhiKnightCorner,2012,~60cores,4-waySMT•  IntelXeonPhiKnightLanding,2016,~60cores,4-waySMTandHBM–  h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-

Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf

h:p://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf

Page 33: Lecture 02: Parallel Architecture

StateoftheArt:ARMScalableVectorExtensions(SVE)

•  AnnouncedinAugust2016–  h:ps://community.arm.com/groups/processors/blog/

2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

–  h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf

•  Beyondvectorarchitecturewelearned–  Vectorloop,predictandspeculaIon–  VectorLengthAgnosIc(VLA)programming

–  Checktheslide

Page 34: Lecture 02: Parallel Architecture

Limita4onsofop4mizingasingleinstruc4onstream

•  Problem:withinasingleinstrucIonstreamwedonotfindenoughindependentinstrucIonstoexecutesimultaneouslydueto–  datadependencies–  limitaIonsofspeculaIveexecuIonacrossmulIplebranches–  difficulIestodetectmemorydependenciesamonginstrucIon

(aliasanalysis)•  Consequence:significantnumberoffuncIonalunitsareidlingat

anygivenIme•  QuesIon:CanwemaybeexecuteinstrucIonsfromanother

instrucIonsstream–  Anotherthread?–  Anotherprocess?

Page 35: Lecture 02: Parallel Architecture

Thread-levelparallelism

•  ProblemsforexecuInginstrucIonsfrommulIplethreadsatthesameIme–  TheinstrucIonsineachthreadmightusethesameregister

names–  Eachthreadhasitsownprogramcounter•  VirtualmemorymanagementallowsfortheexecuIonofmulIplethreadsandsharingofthemainmemory

•  Whentoswitchbetweendifferentthreads:–  FinegrainmulIthreading:switchesbetweeneveryinstrucIon–  CoursegrainmulIthreading:switchesonlyoncostlystalls(e.g.

level2cachemisses)

Page 36: Lecture 02: Parallel Architecture

ConvertThread-levelparallelismtoinstruc4on-levelparallelism

Time (

proc

esso

r cyc

le) Superscalar Fine-Grained Coarse-Grained

Simultaneous Multithreading

Thread 1 Thread 2

Thread 3 Thread 4

Thread 5 Idle slot

Page 37: Lecture 02: Parallel Architecture

ILPtoDoTLP:e.g.SimultaneousMul4-Threading

•  Workswellif–  Numberofcomputeintensivethreadsdoesnotexceedthenumberof

threadssupportedinSMT–  ThreadshavehighlydifferentcharacterisIcs(e.g.onethreaddoingmostly

integeroperaIons,anothermainlydoingfloaIngpointoperaIons)•  Doesnotworkwellif–  ThreadstrytouIlizethesamefuncIonunits•  e.g.adualprocessorsystem,eachprocessorsupporIng2threadssimultaneously(OSthinksthereare4processors)

•  2computeintensiveapplicaIonprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)

Page 38: Lecture 02: Parallel Architecture

Power,FrequencyandILP

Note:EvenMoore’sLawisendingaround2021:h:p://spectrum.ieee.org/semiconductors/devices/transistors-could-stop-shrinking-in-2021h:ps://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/h:p://www.forbes.com/sites/Imworstall/2016/07/26/economics-is-important-the-end-of-moores-law

CPUfrequencyincreasewasfla:enedaround2000-2005Twomainreasons:1.  LimitedILPand2.  PowerconsumpIonandheat

dissipaIon

Page 39: Lecture 02: Parallel Architecture

History–Past(2000)andToday

Page 40: Lecture 02: Parallel Architecture

Flynn’sTaxonomy

h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy

✔✔

Page 41: Lecture 02: Parallel Architecture

ExamplesofMIMDMachines•  SymmetricShared-Memory

MulIprocessor(SMP)–  MulIpleprocessorsinboxwith

sharedmemorycommunicaIon–  CurrentMulIcorechipslikethis–  EveryprocessorrunscopyofOS•  Distributed/Non-uniformShared-

MemoryMulIprocessor–  MulIpleprocessors•  Eachwithlocalmemory•  generalscalablenetwork

–  Extremelylight“OS”onnodeprovidessimpleservices•  Scheduling/synchronizaIon

–  Network-accessiblehostforI/O•  Cluster–  Manyindependentmachine

connectedwithgeneralnetwork–  CommunicaIonthroughmessages

P P P P

Bus

Memory

P/M P/M P/M P/M

P/M P/M P/M P/M

P/M P/M P/M P/M

P/M P/M P/M P/M

Host

Network

Page 42: Lecture 02: Parallel Architecture

Symmetric(Shared-Memory)Mul4processors(SMP)

•  Smallnumbersofcores–  Typicallyeightorfewer,andno

morethan32inmostcases•  Shareasinglecentralizedmemorythatallprocessorshaveequalaccessto,–  Hencethetermsymmetric.•  AllexisIngmulIcoresareSMPs.

•  Alsocalleduniformmemoryaccess(UMA)mulIprocessors–  allprocessorshaveauniform

latency

Page 43: Lecture 02: Parallel Architecture

CentralizedSharedMemorySystem(I)

•  MulI-coreprocessors–  Typicallyconnectedoveracache,–  PreviousSMPsystemsweretypicallyconnectedoverthemain

memory•  IntelX7350quad-core(Tigerton)–  PrivateL1cache:32KBinstrucIon,32KBdata–  SharedL2cache:4MBunifiedcache

CoreL1

CoreL1

sharedL2

CoreL1

CoreL1

sharedL2

1066MHzFSB

Page 44: Lecture 02: Parallel Architecture

CentralizedSharedMemorySystem(SMP)(II)

•  IntelX7350quad-core(Tigerton)mulI-processorconfiguraIon

C0

C1

L2

C8

C9

L2

C2

C3

L2

C10

C11

L2

C4

C5

L2

C12

C13

L2

C6

C7

L2

C14

C15

L2

Socket0 Socket1 Socket2 Socket3

MemoryControllerHub(MCH)

Memory Memory Memory Memory

8GB/s8GB/s8GB/s8GB/s

Page 45: Lecture 02: Parallel Architecture

DistributedShared-MemoryMul4processor•  Largeprocessorcount–  64to1000s•  Distributedmemory–  Remotevslocalmemory–  Longvsshortlatency–  Highvslowlatency

§  Interconnec4onnetwork–  Bandwidth,topology,etc

§  Nonuniformmemoryaccess(NUMA)

§  EachprocessormayhaslocalI/O

Page 46: Lecture 02: Parallel Architecture

DistributedShared-MemoryMul4processor(NUMA)

•  Reducesthememorybo:leneckcomparedtoSMPs•  Moredifficulttoprogramefficiently–  E.g.firsttouchpolicy:dataitemwillbelocatedinthememory

oftheprocessorwhichusesadataitemfirst•  Toreduceeffectsofnon-uniformmemoryaccess,cachesareoOenused–  ccNUMA:cache-coherentnon-uniformmemoryaccess

architectures•  Largestexampleasoftoday:SGIOriginwith512processors

Page 47: Lecture 02: Parallel Architecture

Shared-MemoryMul4processor

•  SMPandDSMareallsharedmemorymulIprocessors–  UMAorNUMA•  MulIcoreareSMPsharedmemory•  MostmulI-CPUmachinesareDSM–  NUMA

•  SharedAddressSpace(VirtualAddressSpace)–  Notalwayssharedmemory

Page 48: Lecture 02: Parallel Architecture

CurrentTrendsinComputerArchitecture•  CannotconInuetoleverageILP–  Singleprocessorperformanceimprovementendedin2003

•  Currentmodelsforperformance:–  ToexploreData-levelparallelism(DLP)viaSIMDarchitecture(vector,SIMD

extensionsandGPUs)–  ToexploreThread-levelparallelism(TLP)viaMIMD

–  Heterogeneity:integratemul4pleanddifferentarchitecturestogetherinchip/systemlevel

•  Emergingarchitectures–  Domain-specificarchitectures:DeepLearningPU(e.g.TPU,etc)–  E.g.MachineLearningPullsProcessorArchitecturesontoNewPath•  hJps://www.top500.org/news/machine-learning-pulls-processor-architectures-onto-new-path/

Theserequireexplicitrestructuringoftheapplica4onßParallelProgramming

48

Page 49: Lecture 02: Parallel Architecture

The“Future”ofMoore’sLaw

•  ThechipsaredownforMoore’slaw–  h:p://www.nature.com/news/the-chips-are-down-for-moore-

s-law-1.19338•  SpecialReport:50YearsofMoore'sLaw–  h:p://spectrum.ieee.org/staIc/special-report-50-years-of-

moores-law•  Moore’slawreallyisdeadthisIme–  h:p://arstechnica.com/informaIon-technology/2016/02/

moores-law-really-is-dead-this-Ime/•  RebooIngtheITRevoluIon:ACalltoAcIon(SIA/SRC,2015)–  h:ps://www.semiconductors.org/clientuploads/Resources/

RITR%20WEB%20version%20FINAL.pdf

49