Post on 04-Jun-2020
Tolera'ngHardwareFaultsInCommoditySo8ware:Problems,
Solu'ons,andaRoadmap
KarthikPa*abiraman(h*p://blogs.ubc.ca/karthik),
ElectricalandComputerEngineeringUniversityofBri'shColumbia(UBC)
Mo'va'on:HardwareErrors• Errorsarebecomingmorecommoninprocessors
– So8errorsanddevicevaria'ons('mingerrors)– Processorsexperiencewear-outandthermalhotspots
2
Source:ShekarBorkar(Intel)-Stanfordtalkin2005
HardwareErrors:Tradi'onalSolu'ons• Guard-banding • DuplicaGon
Average Worst-case
Guard-bandingwastespowerasgapbetweenaverageandworst-casewidensduetovaria'ons
Guard-band
Hardwareduplica'on(DMR)canresultin2Xslowdownand/orenergyconsump'on
3
Analterna'veapproach
4
Architecture
Opera'ngSystem
Applica'on
Devices/Circuits
User interacts with the application
SoHware
Hardware
User
Allowerrorsacrossthehardware-
soHwareboundary,butmakesureuserdoesnotperceiveit
Device/CircuitLevel
ArchitecturalLevel
OperaGngSystemLevel
ApplicaGonLevel
Whydoso8waretechniqueswork?
ImpacRulErrors5
So8wareTechniques
6
ApplicaGonProperGes
TargetedprotecGonmechanisms
Leverage the properties of the application to provide targeted protection, only for the errors that matter to it
Device/CircuitLevel
ArchitecturalLevel
OperaGngSystemLevel
ApplicaGonLevel
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
7
EDCs: Soft Computing Applications
Ø Applica'onsinmachinelearning,mul'mediaprocessingØ Expectedtodominatefutureworkloads[Dubey’07]
8
Originalimage(le8)versusfaultyimage:JPEGdecoder
EDCs:EgregiousDataCorrup'ons
9
Ø Largeorunacceptabledevia'oninoutput
EDCimage(PSNR11.37)Vs.Non-EDCimage(PSNR44.79)
EDCs:Goal
Ø Selec'velydetectEDCcausingfaults,butnotothers
10
Non-EDC
EDC
Detector
Benign
Applica'onExecu'on
EDCs:Faultmodel
• Transienthardwarefaults– Causedbypar'clestrikes,supplynoise
• OurFaultModel– Assumeonefaultperapplica'onexecu'on– Processorregistersandexecu'onunits– MemoryandcacheprotectedwithECC– Controllogicprotectedwithothermethods
11
EDCs:MainIdea
12
Corruptedbyhardwarefaults
Cri'calData
ApplicaGonData
Ourpriorwork:EDCsarecausedbycorrup'onofasmallfrac'onofprogramdata[Flikker-ASPLOS’11]Thiswork:Cri'caldatacanbeiden'fiedusingsta'canddynamicanalysis,withoutanyprogrammerannota'ons
IniGalStudy
HeurisGc Algorithm
EDCs:Ini'alStudy
13
MonitorControl/PointerData
Ø InstrumentcodeØ FaultInjec'on
Ø Correla'onbetweenprogramdatause&faultoutcome
PerformedusingLLFIfaultinjector[DSN’14],attheLLVMIRcodelevel
Ini'alStudy
Heuris'c Algorithm
EDCs:Ini'alStudy
14
6%
43%
23%
28%
Ini'alStudy
Heuris'c Algorithm
15
voidconv422to444(char*src,char*dst,intheight,intwidth,intoffset){for(j=0;j<height;j++){for(i=0;i<width;i++){im1=(i<1)?0:i–1……}if(j+1<offset){src+=w;dst+=width;}}}
HighEDCLikelihood
Ø FaultinoffsetØ BranchFlip
Ini'alStudy
Heuris'c Algorithm
LowEDCLikelihood
EDCs:ExampleHeuris'cFaultsaffec+ngbrancheswithlargeamountofdatawithintheirbodieshaveahigherlikelihoodofresul+nginEDCoutcomes
EDCs:Algorithm
16
Compiler EDCRankingAlgorithm
Selec'onAlgorithm
IR
Applica'onSourceCode
PerformanceOverhead
DataVariablesorLoca'onstoProtect
Representa'veinputs
Backwardslicereplica'on
Ini'alStudy
Heuris'c Algorithm
EDCs:Detec'onCoverage
17
AverageEDCCoverageof82%at10%performanceoverhead
Higherisbeuer
𝐸𝐷𝐶 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝐸𝐷𝐶𝑠/𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐸𝐷𝐶𝑠
EDCs:Selec'vity
18
AverageBenignandNon-EDCCoverageof10to15%foroverheadsfrom10to25%
Lowerisbeuer
SDCTune:SilentDataCorrup'on(SDCs)
19
Faultoccurs
Errorac'vated
ErrorMaskedBenign
Crash/Hang
SDC
Program
Finished
Correct output SDC Output
Results lost:
SDCTune:Goals
• ProtecGngcriGcaldatainsoH-compuGngapplicaGonsfromEDCs– CanweextendthistoSilentDataCorrup'ons(SDCs)ingeneral-purposeapplica'ons?
• Challenge:– Notfeasibletoiden'fySDCsbasedontheamountofdataaffectedbythefaultaswasthecasewithEDCs
– Needforcomprehensivemodelforpredic'ngSDCsbasedonsta'canddynamicprogramfeatures
20
SDCTune:MainIdea
• StartfromStoreandCmpinstrucGonsandgobackwardthroughprogram’sdatadependencies
• Usemachinelearning(CART)topredicttheSDCpronenessofStoreandCmpinstrucGons – Extract the related features by static/dynamic analysis– Quantify the effects by classification and regression – Estimate SDC rates of different Stores and Cmp instructions
21
SDCTune:ExampleModel
22
NotusedinMaskingopera'ons
LinearRegressionforSDC-proneness
SDCTune:Benchmarks
23
(a) Data dependency ofdetector-free code
(b) Basic detector in-strumented
(c) concatenate dupli-cated instructions
Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker
1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit
predicationto
simulateinstruction�levelbehaviour .
7 }8
(a) Detector-free code
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if(flag != dup_flag)9 Assert();
10 // inconsistent11 if ( == 1)12 break;13 }
(b) Basic detector in-strumented
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if ( == 1)9 break;
10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent
(c) Lazy checking ap-plied
Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody
5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-
urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks
We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics
To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage
Table 5: Training programs
Program Description Benchmarksuite Input Stores Compar-
isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110
Bzip2 Compression SPEC test 681 646
Swaptions Price portfolioof swaptions
PARSEC Sim-large 36 101
Water Moleculardynamics
SPLASH2 test 187 224
CGConjugategradientmethod
NAS default 32 97
Table 6: Testing programs
Program Description Benchmarksuite Input Stores Compar-
isons
Lbm Fluiddynamics
Parboil short 71 34
Gzip Compression SPEC test 251 399
OceanLarge-scale
oceanmovements
SPLASH test 322 813
Bfs Breadth-Firstsearch
Parboil 1M 36 57
Mcf Combinatorialoptimization
SPEC test 87 158
Libquantum Quantumcomputing
SPEC test 39 136
for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation
Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The
Trainingprograms TesGngprograms(a) Data dependency ofdetector-free code
(b) Basic detector in-strumented
(c) concatenate dupli-cated instructions
Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker
1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit
predicationto
simulateinstruction�levelbehaviour .
7 }8
(a) Detector-free code
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if(flag != dup_flag)9 Assert();
10 // inconsistent11 if ( == 1)12 break;13 }
(b) Basic detector in-strumented
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if ( == 1)9 break;
10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent
(c) Lazy checking ap-plied
Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody
5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-
urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks
We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics
To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage
Table 5: Training programs
Program Description Benchmarksuite Input Stores Compar-
isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110
Bzip2 Compression SPEC test 681 646
Swaptions Price portfolioof swaptions
PARSEC Sim-large 36 101
Water Moleculardynamics
SPLASH2 test 187 224
CGConjugategradientmethod
NAS default 32 97
Table 6: Testing programs
Program Description Benchmarksuite Input Stores Compar-
isons
Lbm Fluiddynamics
Parboil short 71 34
Gzip Compression SPEC test 251 399
OceanLarge-scale
oceanmovements
SPLASH test 322 813
Bfs Breadth-Firstsearch
Parboil 1M 36 57
Mcf Combinatorialoptimization
SPEC test 87 158
Libquantum Quantumcomputing
SPEC test 39 136
for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation
Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The
SDCTune:Evalua'onMethod
24
Featuresextractedbasedonheuris'c
knowledgefromtrainingprograms
SDCrateforeachinstruc'on
P(SDC|I)fromtrainingprograms
Training(CARTMethod)
P(SDC|I)Predictor
Es'matetheSDCpronenessofdifferentprogram
instruc'ons
Findthesetofinstruc'onsforanoverheadbound(∑P(I))
RandomFaultInjec'onResultsfromtesGngprograms
ActualSDCcoveragefor
tesGngprograms
FeaturesextractedfromtesGngprograms
Trainingphase
TesGngandusingphase
EvaluaGonPhase
UsagePhase
SDCTune:ModelValida'on
25
Trainingprograms TesGngprograms
Rankcorrela'on* 0.9714 0.8286P-value** 0.00694 0.0125
0
2
4
6
8
0 1 2 3 4 5 6 7Rank
ofo
verallSD
Cratesb
yesGm
aGon
RankofoverallSDCratesbyfaultinjecGonexperiment
Trainingprograms
TesingprogramTes'ngprograms
SDCTune:SDCCoverage
26
Trainingprograms: TesGngprograms:
Overhead Coverage
10% 44.8%
20% 78.6%
30% 86.8%
Overhead Coverage
10% 39%
20% 63.7%
30% 74.9%
SDCTune:FullDuplica'onandHot-PathDuplica'onOverheads
27
Fullduplica'onoverhead:53.7%to73.6%Hot-pathduplica'onoverhead:43.5to57.6%
NormalizedDetecGonEfficiency
10%overhead
20%overhead
30%overhead
Trainingprograms 2.38 2.09 1.54Tes'ngprograms 2.87 2.34 1.84
EDCsandSDCTune:Summary
• SoHwareleveltechniquesfortunableandselecGveprotecGonfromEDCsandSDCs[DSN’13][DSN’14][CASES’14][TECS1][TECS2]
• Completelyautomated–noprogrammerintervenGonorannotaGonsareneeded
• SignificantefficiencygainoverfullduplicaGon
28
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
29
HistoryofS/Wtechniques:Pre-2000
• LonghistoryofsoHwaretechniquesforhighreliabilitysystemsgoingbacktoIBMMVS,TandemGuardian– Reliedonarchitecturalsupportfromthehardware– Assumedso8warewaswriuenintransac'onalstyle
• AlgorithmBasedFaultTolerance–1984[HuangandAbraham]:specializedapplica'onsinlinearalgebra
• Manycontrol-flowcheckingtechniquesfrom1980’s– Onlyprotectedtheprogram’scontrol-flowinstruc'ons
30
HistoryofS/Wtechniques:2000-2005
• SoHerrorproblem[SunServer–Baumann2000]• ARGOSprojectfromStanford(McCluksey,2001)
– EDDI–so8ware-basedinstruc'onduplica'on– CFCSS–Lightweightcontrol-flowchecking
• ReliabilityandSecurityEngine(RSE)fromUIUC(2004)– Targetedcheckingofapplica'onproper'esatrun'me
• SWIFTfromPrinceton(2005)– Low-overheadcheckingthroughcompilerop'miza'ons
• FirstSELSEworkshoplaunched(2005)– Focusonen'resystemspanningso8wareandhardware
31
SELSEPapers(2009-2017)
Dataunavailableforyears2005to2008.Basedon'tleandabstractsonly.
32
0% 20% 40% 60% 80%
100%
2009 2010 2011 2012 2013 2014 2015 2016 2017
PapersatSELSE2009-2017 (source:SELSEwebsite)
Software Hardware Both
0% 20% 40% 60% 80%
100%
2009 2010 2011 2012 2013 2014 2015 2016 2017
PapersatSELSE2009-2017 (source:SELSEwebsite)
Software Hardware Both
0% 20% 40% 60% 80%
100%
2009 2010 2011 2012 2013 2014 2015 2016 2017
PapersatSELSE2009-2017 (source:SELSEwebsite)
Software Hardware Both
HistoryofS/Wtechniques:2010-today
• Cross-LayerResiliencebecomesabuzzword:Manygroupsworkingonthisproblemincludingours
• MulGpledomains:HPCsystems,EmbeddedSystems
• Callsfromdifferentfundingagencies(DoE,NSF,etc.)forCross-LayerResilienceTechniques–whitepapers
• Conjoinedtwin:ApproximateCompuGngtakesoff– PapersattopPL/architectureconferences
33
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
34
WhataboutSo8wareResearchers?
• PapersinthetopsoHwareengineering/tesGng/reliabilityconferencesabouthardwarefaultsanderrorsoverthelast10years(2006onwards)– ICSE:5papers(IEEEDL)– FSE:6papers(ACMDL)– ASE:7papers(ACMDL)– ISSTA:3papers(ACMDL)– ICST:2papers(IEEEDL)– ISSRE:10papers(IEEEDL)– Total:33outofover3000papers(about1%)
35
Exampleconversa'onswithSo8wareDevelopersinIndustry
• Developer1(larges/wcompanyyou’veheardof)
• Me:Howdoyouhandlehardwarefaults?
• D1:Dotheseevenoccurintherealworld?
• Me:Showinghimdatagatheredbyhisowncompanyonh/wfaults
• D1:Hmmm…soundslikeaproblemforQAfolks.Wedon’tdealwithfaults.
• Tester1(larges/w-h/wcompanyyou’veheardof)
• Me:Howdoyouhandlehardwarefaults?
• T1:OurhardwarefolksputinvariousmechanismssuchasECCmemorytomaskthese
• H/wguy:Notreally,wedon’thandleeverything
• T1:Oh,well–that’snotpartofourrequirementsdoc.Maybeifwemeetourbugtargets...
36
So8wareDevelopers
• MostsoHwaredevelopers(andtesters)ignorehardwarefaults,orassumefaultswillbehandledbyhardware(e.g.,ECCmemory)
• Eveniftheyrecognizetheimportanceoftheproblem,manythinkit’snottheirproblem– QAortes'ngpeopleshouldtakecareofit– Notpartofrequirements/specifica'ondocument
37
Shouldwecareaboutdevelopers?
UlGmately,developersaretheoneswhodriveadopGonandassimilaGonwithinthebroadersoHwareecosystem
38
BarrierstoAdop'on:PossibleReasons
• Reason1:So8waredevelopersdon’tcareaboutanythingtodowithhardware
• Reason2:Toomuch'meandeffort–manyotherpriori'esinso8waredevelopment
• Reason3:Lackofhigh-levelabstrac'ons
• Reason4:Noeasy-to-usetoolsthatintegratewiththeso8waredevelopmentworkflow
39
BarrierstoAdop'on:Reason1
• SoHwareengineersdon’tcareaboutanythingtodowithhardware
• Nottrue.Manycounter-examples:– Parallelism,bothcoarseandfine-grained– Cacheconsciousdata-structuresandalgorithms– Energyefficiencyandenergy-awareprogramming– Determinism,memorymodels,etc.
40
BarrierstoAdop'on:Reason2
• ToomuchGmeandeffortconsuming:manyotherprioriGesinsoHwaredevelopment
• ParGallytrue,butnotalways– Manyother'me-consumingac'vi'esareusede.g.,con'nuoustes'ng,sta'canalysis
– Techniquesfortolera'nghardwarefaultsdon’tneedtobe'meconsumingoreffort-intensive
41
BarrierstoAdop'on:Reason3
• Lackofhigh-levelabstracGons
• Myexperience:MostlyTrue– Developerswanttoreasonwithquan''esthey’refamiliarwith(e.g.,run'me,defectratesetc.),notnecessarilythingslikeFITrates,orevencoverage
– Needtobeabletoreasonaboutcost-benefittradeoffsofdifferenttechniquesattheabstractlevelwithoutgoingintodetails
42
BarrierstoAdop'on:Reason4
• Noeasy-to-usetoolsthatintegratewiththesoHwaredevelopmentworkflow
• Myexperience:Oneofthemainreasons– Needtounderstandso8waredevelopers’workflowandintegratewithit–nodevia'on
– Mustbeabletohandlelegacycode,weirdsetups,andmul'plelanguagesandlibraries
– S/Wmaintenanceaccountsfor60%ofcosts
43
Whatwegotright(inmyopinion)
• Automatedworkflow,soli*letonoeffortonthepartoftheprogrammerwasneeded
• AbstracGonintermsthatordinaryprogrammerscanunderstand(e.g.,performance,coverage)– Candobeueronthisfrontthough
• Useofpopularopen-sourcetool(LLVM),whichiseasytointegratewithworkflow(intheory)
44
Whatwegotwrong(inmyopinion)
• OurabstracGonwassGlltoolowlevel– manyprogrammersdidn’tunderstandcoverage
• Legacycode:LLVMcan’tcompileoldcode,inlineassembly,customizedbuildsystems
• DidnothaveaneasypathtoQAtesGngorsoHwaremaintenanceinourlongtermplan
45
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
46
OpenChallenges
• NeedtobuildsoHware-basedtechniquesthatordinaryprogrammerscanreasonabout
• NeedsoHwaretechniquesthatcanintegrateseamlesslywithoverallsoHwareworkflow– ShouldnotimpedeQAandtes'ngprocess– So8waremaintenanceshouldbeconsidered– Legacycodeandbuildsystems(ifrelevant)
47
TheOpportunity:EvereuRoger’sModelforDisrup'veInnova'on
WearesGllintheinnovator/earlyadopterstage-needtocrosschasmandmoveto“earlymajority”
48
Poten'alResearchRoadmap
49
Understandtheissuesfacingso8waredevelopersandtestersinadop'ngso8waretechniquesfortolera'nghardwarefaults
Buildtechniquestoaddresstheseissuesinacommonresearchframeworkforthecommunitytoavoideffortduplica'ons
Havedevelopersusetheframeworkinactualprac'ceandiden'fytheissuesthatcomeupduringtheuseoftheframework
Conclusions
• SoHwaretechniquesfortoleraGnghardwarefaultsincommoditysystemshavemushroomed– Canbetunedbasedontheneedsoftheapplica'on– Canoffersignificantefficiencyoverfullduplica'on
• Unfortunately,thetechniqueshaveseenlimitedadopGoninindustry&bysoHwarecommunity– We,intheSELSEcommunity,allneedtoaddressthis– Emphasisshouldbeoncompleteso8warelife-cycle– Takeourmessagetotheso8wareengineeringvenues
50
AcknowledgementsMygraduatestudents(7current,10graduated):AnnaThomas,QiningLu,LayaliRashid,MajidDadashi,BoFang,JieshengWei,FrolinOcariza,Kar'kBajaj,ShabnamMirshokraie,XinChen,FaridTabrizi,SabaAlimadadi,SheldonSequira,NithyaMurthy,AbrahamChan,MaryamRaiyat,Jus'nLi
Collaboratorsandfundingagencies(selectedonesbelow)
51
h:p://blogs.ubc.ca/karthik/