Post on 23-Jul-2020
GPGPU in HPC
1
GPGPU
Scien+ficApplica+ons
So, Errors
2
=0001 =0101
Baumanetal.[TDMR,2005]
Tradi3onal Method: DMR
3
DualModularRedundancy(DMR)
• Run2copies
• Comparefordivergence
Toomuchenergyconsump+on!
So,ware Solu3ons
4
Advantages
• Nohardwaremodifica+on
• Errorscanbemasked
• Allowselec+veprotec+on
Applica+onLevel
Opera+ngSystemLevel
ArchitecturalLevel
Device/CircuitLevel
Impact
Challenges for GPGPU Resilience
• DifferentarchitectureandprogrammingmodelfromCPUs
• Noscalablefaultinjec+ontoolsforHPCGPGPUapplica+ons
5
Our Contribu3ons
6
LLFI-GPU:ScalableFault
Injector
Characteriza+onofError
Propaga+on
Implica+onsonErrorMi+ga+ons
Exis3ng Publicly Available GPU Fault Injectors
• Hauberk[IPDPS,2011]• Sourcecodelevelfaultinjec+on
• Notrepresenta+veforhardwareerrors
• GPU-Qin[ISPASS,2014]• Debugger-based
• Execu+onisslow
• GPGPU-Simbasedfaultinjector[DSN,2015]
• Notfullsystemsimula+on
7
Goals of LLFI-GPU
• Na)veSpeed• Program-levelfaultinjec+on
• Compiletobinary
• Fullsystemsimula)on
• Executeonrealhardware
• Abletosimulatedifferentfailureoutcomes
• Representa)veness• LLVMIRlevelfaultinjec+on
• Closetoassembly,yetpreservehigh-levelprogramsymbols
8
LLVM (Low Level Virtual Machine)
9
LLFIforCPU
LLFIforCPU:hfps://github.com/DependableSystemsLab/LLFI
LLFI-GPU: Overview
10
.cufiles
LLVMIR
PTXAssembly
SASSAssembly
Instrumenta+onPasses
NVC
CCo
mpila+o
n
LLFI-GPU…
R0=addR1,R2
R0=injectFault(R0)
R4=mulR0,R3
…
Advantages of LLFI-GPU
• CompileonlargeGPGPUprograms
• 1000xfastercomparedtoGPU-Qin(MatrixMul)
• Representedsimula+on
• Fullsystemsimula+onofsoierrors
• Open-source
• hfps://github.com/DependableSystemsLab/LLFI-GPU
11
Experiment Setup: Nvidia K20
• 12Benchmarks• Rodinia&Parboilsuites
• Lulesh(LLNL),Barns-Hut(TexasStateUniv.),Fiber(NortheasternUniv.),CircuitSolver(RiceUniv.)andNMF(UCBerkley)
• FaultInjec+on• 10,000perapplica+on(Errorbar:0.22%-2.99%,95%confidencelevel)
• FaultModel• Singlebit-flip
• Transientfaultsinexecu+onunits
12
Failure Outcomes
• SilentDataCorrup+on(SDC)• Mismatchinprogramoutputsfromgoldenrunandfaultinjec+onrun
• Crash• CUDAexcep+ons(e.g,illegalmemoryaddress)
• Causekernelexecu+ontohalt
• Benign• Noeffectonprogramoutput
13
Our Contribu3ons
14
LLFI-GPU:ScalableFault
Injector
Characteriza+onofError
Propaga+on
Implica+onsonErrorMi+ga+ons
Research Ques3on 1
WhatisthepercentageofSDCsindifferent
memorystates?
15
Memory State
16
TotalMemory(TM)
ResultMemory(RM)
OutputMemory(OM)
…cudaMalloc(M1)cudaMalloc(M2)cudaMemcpy(M1,…)cudaMemcpy(M2,…)…Kernel<<<>>>,……cudaMemcpy(…,M2)…Foreach(M2):if(ele>0){print(ele)}…
SDC in Different Memory States
17
bfs barneshut nmf
SDCTM-SDCRM 0.00% 0.20% 0.00% AverageofSDC(TM-RM)
inallbenchmarks:
0.09%
bfs barneshut nmf
RM 14.29% 37.50% 0.03%
TM 100% 100% 100%
SizeofStates
SDCofStates
AveragesizeofRM
inallbenchmarks:
13.56%CheckingRMreduces~86%overhead
whileretainingcoverage
MostofthefaultsinTMpropagateRM
Example of Checkers
• Checkvaluerangeofpar+cularstates• Calcula+ngangle:if(angle>60orangle<0){errordetected}
• Overheadisdirectlypropor+onaltothenumberofstateschecked
• CheckingRMreduces~86%overhead
• Smalllossofcoverage
18
Pafabiramanetal.[TDSC,2011]Harietal.[DSN,2012]
Research Ques3on 2
HowlongdoerrorstaketopropagatetotheRM?
19
Metrics: Kernel Call
20
…tomeasurepropaga+on+meoferror
GPUExecu)on
CPUExecu)on
Kernel1 Kernel2 Kernel3
ErrorDetector ErrorDetector ErrorDetector
ErrorDetected
ErrorOccurred
Errordetec+onlatencyis2
Tracking Error Propaga3on
21
…Kernel1<<<>>>DumpToDisk(TM)DumpToDisk(RM)DumpToDisk(OM)Kernel2<<<>>>…
Comparedwithgoldencopyforanydatacorrup+ons
Propaga3on Latency to RM
22
CheckingRMprovides
shortdetec+onlatency
Implica3ons
• RMisanarrowtunnelwherefaultsfrequentlypropagatethrough• CheckingRMforSDCisabefertrade-off
• Crash-causingfaultsrarelypropagateacrosskernelcalls• DeployinghighfrequencycheckpointsforGPGPUcanavoidcheckpointcorrup+ons
• Studiedon2GPGPUplavorms(NvidiaGTX960&NvidiaK20)• Resultsaresta+s+callyindis+nguishable
• Inves+gatedinerrorspread&maskingetc• … moreinteres+ngfindingscanbefoundinthepaper!
23
Summary
• DesignedascalablefaultinjectorforGPGPUs:LLFI-GPU• Characterizederrorpropaga+onpafernsinGPGPUapplica+ons• Discussedtheirimplica+onsonerrormi+ga+ontechniques
24
• Name:Guanpeng(Jus+n)Li(gpli@ece.ubc.ca)• Website:ece.ubc.ca/~gpli• LLFI-GPU:
• hfps://github.com/DependableSystemsLab/LLFI-GPU• Results:
• hfps://www.dropbox.com/s/xrvojidskkcrj4y/FI_data.xlsx?dl=0
Acknowledgements
25