CS 61C: Great Ideas in Computer Architecture Lecture 18...
Transcript of CS 61C: Great Ideas in Computer Architecture Lecture 18...
![Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/1.jpg)
CS61C:GreatIdeasinComputerArchitecture
Lecture18:ParallelProcessing– SIMD
KrsteAsanović &RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
![Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/2.jpg)
61CSurvey
Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus
howthingsfitinthebiggerpicture
CS61c Lecture18:ParallelProcessing- SIMD 2
![Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/3.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 3
![Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/4.jpg)
61CTopicssofar…• Whatwelearned:
1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint
• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!
CS61c Lecture18:ParallelProcessing- SIMD 4
![Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/5.jpg)
ReferenceProblem•Matrixmultiplication
−Basicoperationinmanyengineering,data,andimagingprocessingtasks
− Imagefiltering,noisereduction,…−Manycloselyrelatedoperations
§ E.g.stereovision(project4)•dgemm
−double-precisionfloating-pointmatrixmultiplication
CS61c Lecture18:ParallelProcessing- SIMD 5
![Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/6.jpg)
ApplicationExample:DeepLearning• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying
CS61c Lecture18:ParallelProcessing- SIMD 6
![Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/7.jpg)
Matrices
CS61c Lecture18:ParallelProcessing- SIMD 7
𝑖
𝑗N-1
N-1
0
0
• SquarematrixofdimensionNxN
![Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/8.jpg)
MatrixMultiplication
CS61c 8
𝑖
𝑗
𝑘
𝑘C=A*BCij =Σk (Aik *Bkj)
A
B
C
![Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/9.jpg)
Reference:Python• MatrixmultiplicationinPython
CS61c Lecture18:ParallelProcessing- SIMD 9
N Python[Mflops]32 5.4160 5.5480 5.4960 5.3
• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3 flops
![Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/10.jpg)
C• c=a*b• a,b,careNxNmatrices
CS61c Lecture18:ParallelProcessing- SIMD 10
![Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/11.jpg)
TimingProgramExecution
CS61c Lecture18:ParallelProcessing- SIMD 11
![Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/12.jpg)
CversusPython
CS61c Lecture18:ParallelProcessing- SIMD 12
N C[GFLOPS] Python[GFLOPS]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053
Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
![Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/13.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 13
![Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/14.jpg)
WhyParallelProcessing?• CPUClockRatesarenolongerincreasing
−Technical&economicchallenges§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications
§ Energycostsareprohibitive
• Parallelprocessingisonlypathtohigherspeed−Compareairlines:
§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…
CS61c Lecture18:ParallelProcessing- SIMD 14
![Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/15.jpg)
UsingParallelismforPerformance• Twobasicways:
−Multiprogramming§ runmultipleindependentprogramsinparallel§ “Easy”
−Parallelcomputing§ runoneprogramfaster§ “Hard”
•We’llfocusonparallelcomputinginthenextfewlectures
15CS61c Lecture18:ParallelProcessing- SIMD
![Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/16.jpg)
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssignedtocomputere.g.,Search“Katz”• ParallelThreadsAssignedtocoree.g.,Lookup,Ads• ParallelInstructions>[email protected].,5pipelinedinstructions• ParallelData>[email protected].,Addof4pairsofwords• HardwaredescriptionsAllgates@onetime• ProgrammingLanguages
16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…Memory(Cache)Input/Output
Computer
CacheMemory
CoreInstructionUnit(s) Functional
Unit(s)A3+B3A2+B2A1+B1A0+B0
Today’sLecture
![Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/17.jpg)
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines
E.g.ourtrustedRISC-Vpipeline
17
ProcessingUnit
CS61c Lecture18:ParallelProcessing- SIMD
Thisiswhatwediduptonowin61C
![Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/18.jpg)
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
18CS61c Lecture18:ParallelProcessing- SIMD
Today’stopic.
![Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/19.jpg)
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
19
InstructionPool
PU
PU
PU
PU
DataPoo
l
CS61c Lecture18:ParallelProcessing- SIMD
TopicofLecture19andbeyond.
![Page 20: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/20.jpg)
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance
20CS61c Lecture18:ParallelProcessing- SIMD
Thishasfewapplications.Notcoveredin61C.
![Page 21: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/21.jpg)
Flynn*Taxonomy,1966
• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!
• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives
21CS61c Lecture18:ParallelProcessing- SIMD
![Page 22: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/22.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 22
![Page 23: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/23.jpg)
SIMD– “SingleInstructionMultipleData”
23CS61c Lecture18:ParallelProcessing- SIMD
![Page 24: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/24.jpg)
SIMDApplications&Implementations• Applications
− Scientificcomputing§ Matlab,NumPy
− Graphicsandvideoprocessing§ Photoshop,…
− BigData§ Deeplearning
− Gaming− …
• Implementations− x86− ARM− RISC-Vvectorextensions
CS61c Lecture18:ParallelProcessing- SIMD 24
![Page 25: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/25.jpg)
25
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
CS61c
![Page 26: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/26.jpg)
x86SIMDEvolution
CS61c Lecture18:ParallelProcessing- SIMD 26
http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf
• Newinstructions• New,wider,moreregisters• Moreparallelism
![Page 27: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/27.jpg)
LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4
machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0RDRAND F16C
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS
CS61c Lecture18:ParallelProcessing- SIMD 27
![Page 28: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/28.jpg)
SIMDRegisters
CS61c Lecture18:ParallelProcessing- SIMD 28
![Page 29: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/29.jpg)
SIMDDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 29
(NowalsoAVX-512available)
![Page 30: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/30.jpg)
SIMDVectorMode
CS61c Lecture18:ParallelProcessing- SIMD 30
![Page 31: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/31.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 31
![Page 32: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/32.jpg)
Problem• Today’scompilers(largely)donotgenerateSIMDcode• Backtoassembly…• x86(notusingRISC-Vasnovectorhardwareyet)
−Over1000instructionstolearn…−GreenBook
• Canweusethecompilertogenerateallnon-SIMDinstructions?
CS61c Lecture18:ParallelProcessing- SIMD 32
![Page 33: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/33.jpg)
x86IntrinsicsAVXDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 33
Intrinsics: Directaccesstoregisters&assemblyfromCRegister
![Page 34: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/34.jpg)
IntrinsicsAVXCodeNomenclature
CS61c Lecture18:ParallelProcessing- SIMD 34
![Page 35: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/35.jpg)
https://software.intel.com/sites/landingpage/IntrinsicsGuide/CS61c Lecture18:ParallelProcessing- SIMD 35
4parallelmultiplies
2instructionsperclockcycle(CPI=0.5)
assemblyinstruction
x86SIMD“Intrinsics”
![Page 36: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/36.jpg)
RawDouble-PrecisionThroughput
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructions perclock(mul_pd) 2
Parallel multipliesperinstruction 4
PeakdoubleFLOPS 24.8GFLOPS
CS61c Lecture18:ParallelProcessing- SIMD 36
Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
![Page 37: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/37.jpg)
VectorizedMatrixMultiplication
CS61c 37
𝑖
𝑗
𝑘
𝑘InnerLoop:
fori …;i+=4forj...
i+=4
![Page 38: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/38.jpg)
“Vectorized”dgemm
CS61c Lecture18:ParallelProcessing- SIMD 38
![Page 39: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/39.jpg)
Performance
NGflops
scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c Lecture18:ParallelProcessing- SIMD 39
• 4xfaster• Butstill<<theoretical25GFLOPS!
![Page 40: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/40.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 40
![Page 41: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/41.jpg)
AtriptoLA
Get toSFO&check-in SFOà LAX Getto destination3hours 1hour 3 hours
CS61c Lecture18:ParallelProcessing- SIMD 41
Commercialairline:
Supersonicaircraft:
Get toSFO&check-in SFOà LAX Getto destination3hours 6min 3 hours
Totaltime:7hours
Totaltime:6.1hoursSpeedup:
Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x
![Page 42: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/42.jpg)
Amdahl’sLaw• GetenhancementE foryournewPC− E.g.floating-pointrocketbooster• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”
CS61c Lecture18:ParallelProcessing- SIMD 42
1-F F
1-F F/ SE
ExecutionTime:
Speedup:
T0 (noE)TE (withE)
nospeedup speedupsection
Speedup=TO /TE =1/((1-F)+F/SE)
![Page 43: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/43.jpg)
BigIdea:Amdahl’sLaw
43
Partnotspedup Partspedup
Example: Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?
CS61c Lecture18:ParallelProcessing- SIMD
Speedup=TO /TE =1/((1-F)+F/SE)
TO /TE =1/((1-F)+F/SE)=1/((1- 0.5)+0.5/2)=1.33<<2
![Page 44: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/44.jpg)
Maximum“Achievable”Speed-Up
44
Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)
a)Maximumspeedup: Smax =1/(1-0.95)=20butneedsSE=∞!
b) Reasonable“engineering”compromise:Equaltimeinsequentialandparallelcode:(1-F)=F/SE
SE =F/(1-F)=0.95/0.05=20S=1/(0.05+0.05)=10
CS61c
Speedup=1/((1-F)+F/SE)(SE->∞)=1/(1-F)
![Page 45: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/45.jpg)
45
Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited
Inthisregion,thesequentialportionlimitstheperformance
500processorsfor19x
20processorsfor10x
CS61c
![Page 46: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/46.jpg)
StrongandWeakScaling• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem
− Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors
• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf
46CS61c Lecture18:ParallelProcessing- SIMD
![Page 47: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/47.jpg)
PeerInstruction
47
Supposeaprogramspends80%ofitstimeinasquare-rootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
CS61c Lecture18:ParallelProcessing- SIMD
Speedup=1/((1-F)+F/SE)RED: 5
GREEN: 20ORANGE: 500
Noneofabove
![Page 48: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/48.jpg)
Administrivia• Midterm2isTuesdayduringclasstime!
− ReviewSession:Saturday10-12pm@HearstAnnexA1−Onedouble-sidedcribsheet;we’llprovidetheRISCVcard− BringyourID!We’llcheckit− RoomassignmentspostedonPiazza
• Project2pre-labdueFriday• HW4due11/3,butadvisabletofinishbythemidterm• Project2-2due11/6,butalsogoodtofinishbeforeMT2
48CS61c Lecture19:ThreadLevalParallelProcessing
![Page 49: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/49.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 49
![Page 50: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/50.jpg)
Amdahl’sLawappliedtodgemm• Measureddgemm performance
− Peak 5.5GFLOPS− Largematrices 3.6GFLOPS− Processor 24.8GFLOPS
• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:
§ Cache§ Hazards§ Let’slookatboth!
CS61c Lecture18:ParallelProcessing- SIMD 50
![Page 51: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/51.jpg)
PipelineHazards– dgemm
CS61c Lecture18:ParallelProcessing- SIMD 51
![Page 52: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/52.jpg)
LoopUnrolling
CS61c Lecture18:ParallelProcessing- SIMD 52
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
![Page 53: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/53.jpg)
Performance
NGflops
scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c Lecture18:ParallelProcessing- SIMD 53
![Page 54: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/54.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 54
![Page 55: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/55.jpg)
FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)
• Howmanymemoryload/stores?− M=3xN2 (forA,B,C)
• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…
CS61c Lecture18:ParallelProcessing- SIMD 55
![Page 56: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/56.jpg)
Butmemoryisaccessedrepeatedly
• q=F/M=1!(2loadsand2floating-pointoperations)
CS61c Lecture18:ParallelProcessing- SIMD 56
Innerloop:
![Page 57: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/57.jpg)
CS61c Lecture18:ParallelProcessing- SIMD 57
Second-LevelCache(SRAM)
TypicalMemoryHierarchyControl
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’sSize(bytes): 100’s 10K’sM’sG’sT’s
• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!
Cost/bit:highest lowest
Third-LevelCache(SRAM)
![Page 58: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/58.jpg)
Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw
CS61c 58
![Page 59: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/59.jpg)
Blocking• Idea:
− Rearrangecodetousevaluesloadedincachemanytimes−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation
−à throughputlimitedbyFPhardwareandcache,notslowDRAM
− P&H,RISC-Veditionp.465
CS61c Lecture18:ParallelProcessing- SIMD 59
![Page 60: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/60.jpg)
MemoryAccessBlocking
CS61c Lecture18:ParallelProcessing- SIMD 60
![Page 61: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/61.jpg)
Performance
NGflops
scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82
CS61c Lecture18:ParallelProcessing- SIMD 61
![Page 62: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/62.jpg)
Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 62
![Page 63: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel](https://reader033.fdocuments.in/reader033/viewer/2022042205/5ea7aa4c9924e161f81b8a55/html5/thumbnails/63.jpg)
AndinConclusion,…• ApproachestoParallelism
− SISD,SIMD,MIMD(nextlecture)• SIMD
− Oneinstructionoperatesonmultipleoperandssimultaneously• Example:matrixmultiplication
− Floatingpointheavyà exploitMoore’slawtomakefast• Amdahl’sLaw:
− Serialsectionslimitspeedup− Cache
§ Blocking− Hazards
§ Loopunrolling
63CS61c Lecture18:ParallelProcessing- SIMD