GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 ·...
Transcript of GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 ·...
![Page 1: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/1.jpg)
GPGPU in HPC
1
GPGPU
Scien+ficApplica+ons
![Page 2: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/2.jpg)
So, Errors
2
=0001 =0101
Baumanetal.[TDMR,2005]
![Page 3: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/3.jpg)
Tradi3onal Method: DMR
3
DualModularRedundancy(DMR)
• Run2copies
• Comparefordivergence
Toomuchenergyconsump+on!
![Page 4: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/4.jpg)
So,ware Solu3ons
4
Advantages
• Nohardwaremodifica+on
• Errorscanbemasked
• Allowselec+veprotec+on
Applica+onLevel
Opera+ngSystemLevel
ArchitecturalLevel
Device/CircuitLevel
Impact
![Page 5: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/5.jpg)
Challenges for GPGPU Resilience
• DifferentarchitectureandprogrammingmodelfromCPUs
• Noscalablefaultinjec+ontoolsforHPCGPGPUapplica+ons
5
![Page 6: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/6.jpg)
Our Contribu3ons
6
LLFI-GPU:ScalableFault
Injector
Characteriza+onofError
Propaga+on
Implica+onsonErrorMi+ga+ons
![Page 7: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/7.jpg)
Exis3ng Publicly Available GPU Fault Injectors
• Hauberk[IPDPS,2011]• Sourcecodelevelfaultinjec+on
• Notrepresenta+veforhardwareerrors
• GPU-Qin[ISPASS,2014]• Debugger-based
• Execu+onisslow
• GPGPU-Simbasedfaultinjector[DSN,2015]
• Notfullsystemsimula+on
7
![Page 8: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/8.jpg)
Goals of LLFI-GPU
• Na)veSpeed• Program-levelfaultinjec+on
• Compiletobinary
• Fullsystemsimula)on
• Executeonrealhardware
• Abletosimulatedifferentfailureoutcomes
• Representa)veness• LLVMIRlevelfaultinjec+on
• Closetoassembly,yetpreservehigh-levelprogramsymbols
8
![Page 9: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/9.jpg)
LLVM (Low Level Virtual Machine)
9
LLFIforCPU
LLFIforCPU:hfps://github.com/DependableSystemsLab/LLFI
![Page 10: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/10.jpg)
LLFI-GPU: Overview
10
.cufiles
LLVMIR
PTXAssembly
SASSAssembly
Instrumenta+onPasses
NVC
CCo
mpila+o
n
LLFI-GPU…
R0=addR1,R2
R0=injectFault(R0)
R4=mulR0,R3
…
![Page 11: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/11.jpg)
Advantages of LLFI-GPU
• CompileonlargeGPGPUprograms
• 1000xfastercomparedtoGPU-Qin(MatrixMul)
• Representedsimula+on
• Fullsystemsimula+onofsoierrors
• Open-source
• hfps://github.com/DependableSystemsLab/LLFI-GPU
11
![Page 12: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/12.jpg)
Experiment Setup: Nvidia K20
• 12Benchmarks• Rodinia&Parboilsuites
• Lulesh(LLNL),Barns-Hut(TexasStateUniv.),Fiber(NortheasternUniv.),CircuitSolver(RiceUniv.)andNMF(UCBerkley)
• FaultInjec+on• 10,000perapplica+on(Errorbar:0.22%-2.99%,95%confidencelevel)
• FaultModel• Singlebit-flip
• Transientfaultsinexecu+onunits
12
![Page 13: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/13.jpg)
Failure Outcomes
• SilentDataCorrup+on(SDC)• Mismatchinprogramoutputsfromgoldenrunandfaultinjec+onrun
• Crash• CUDAexcep+ons(e.g,illegalmemoryaddress)
• Causekernelexecu+ontohalt
• Benign• Noeffectonprogramoutput
13
![Page 14: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/14.jpg)
Our Contribu3ons
14
LLFI-GPU:ScalableFault
Injector
Characteriza+onofError
Propaga+on
Implica+onsonErrorMi+ga+ons
![Page 15: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/15.jpg)
Research Ques3on 1
WhatisthepercentageofSDCsindifferent
memorystates?
15
![Page 16: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/16.jpg)
Memory State
16
TotalMemory(TM)
ResultMemory(RM)
OutputMemory(OM)
…cudaMalloc(M1)cudaMalloc(M2)cudaMemcpy(M1,…)cudaMemcpy(M2,…)…Kernel<<<>>>,……cudaMemcpy(…,M2)…Foreach(M2):if(ele>0){print(ele)}…
![Page 17: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/17.jpg)
SDC in Different Memory States
17
bfs barneshut nmf
SDCTM-SDCRM 0.00% 0.20% 0.00% AverageofSDC(TM-RM)
inallbenchmarks:
0.09%
bfs barneshut nmf
RM 14.29% 37.50% 0.03%
TM 100% 100% 100%
SizeofStates
SDCofStates
AveragesizeofRM
inallbenchmarks:
13.56%CheckingRMreduces~86%overhead
whileretainingcoverage
MostofthefaultsinTMpropagateRM
![Page 18: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/18.jpg)
Example of Checkers
• Checkvaluerangeofpar+cularstates• Calcula+ngangle:if(angle>60orangle<0){errordetected}
• Overheadisdirectlypropor+onaltothenumberofstateschecked
• CheckingRMreduces~86%overhead
• Smalllossofcoverage
18
Pafabiramanetal.[TDSC,2011]Harietal.[DSN,2012]
![Page 19: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/19.jpg)
Research Ques3on 2
HowlongdoerrorstaketopropagatetotheRM?
19
![Page 20: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/20.jpg)
Metrics: Kernel Call
20
…tomeasurepropaga+on+meoferror
GPUExecu)on
CPUExecu)on
Kernel1 Kernel2 Kernel3
ErrorDetector ErrorDetector ErrorDetector
ErrorDetected
ErrorOccurred
Errordetec+onlatencyis2
![Page 21: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/21.jpg)
Tracking Error Propaga3on
21
…Kernel1<<<>>>DumpToDisk(TM)DumpToDisk(RM)DumpToDisk(OM)Kernel2<<<>>>…
Comparedwithgoldencopyforanydatacorrup+ons
![Page 22: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/22.jpg)
Propaga3on Latency to RM
22
CheckingRMprovides
shortdetec+onlatency
![Page 23: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/23.jpg)
Implica3ons
• RMisanarrowtunnelwherefaultsfrequentlypropagatethrough• CheckingRMforSDCisabefertrade-off
• Crash-causingfaultsrarelypropagateacrosskernelcalls• DeployinghighfrequencycheckpointsforGPGPUcanavoidcheckpointcorrup+ons
• Studiedon2GPGPUplavorms(NvidiaGTX960&NvidiaK20)• Resultsaresta+s+callyindis+nguishable
• Inves+gatedinerrorspread&maskingetc• … moreinteres+ngfindingscanbefoundinthepaper!
23
![Page 24: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/24.jpg)
Summary
• DesignedascalablefaultinjectorforGPGPUs:LLFI-GPU• Characterizederrorpropaga+onpafernsinGPGPUapplica+ons• Discussedtheirimplica+onsonerrormi+ga+ontechniques
24
• Name:Guanpeng(Jus+n)Li([email protected])• Website:ece.ubc.ca/~gpli• LLFI-GPU:
• hfps://github.com/DependableSystemsLab/LLFI-GPU• Results:
• hfps://www.dropbox.com/s/xrvojidskkcrj4y/FI_data.xlsx?dl=0
![Page 25: GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Different architecture and programming model from](https://reader033.fdocuments.in/reader033/viewer/2022050300/5f691c9062de90445407f390/html5/thumbnails/25.jpg)
Acknowledgements
25