Automated and Interactive Debugging of Big Data...
Transcript of Automated and Interactive Debugging of Big Data...
AutomatedandInteractiveDebuggingofBigDataAnalytics
MuhammadAliGulzarUniversityofCalifornia,LosAngeles
1
2
Developlocally Hopeitworks Runincloud Bug!
Guesswork
BigDataDebuggingintheDark
Map Reduce
1 2 3 4
5
3
• InteractiveDebuggingPrimitivesforBigDataProcessinginSparkICSE’16
• AutomatedDebugginginDataIntensiveScalableComputingSystemsSoCC ’17
• White-boxTestingofDataIntensiveScalableComputingApplicationswithuserdefinedfunctionsOngoing
WhyTraditionalInteractiveDebuggingisHardforApacheSpark?
Enablinginteractivedebuggingrequiresusto re-thinkthefeaturesoftraditionaldebuggersuchasGDB
• Pausingtheentirecomputationonthecloudcouldreducethroughput
• Itisclearlyinfeasibleforausertoinspectbillionofrecordsthrougharegularwatchpoint
• EvenlaunchingremoteJVMdebuggerstoindividualworkernodescannotscaleforbigdatacomputing
InteractiveDISCDebugPrimitives[ICSE‘16,FSE’16Demo,SIGMOD’17Demo]
5
4.BackwardandForwardTracing
1.SimulatedBreakpoint 2.OnDemandGuardedWatchpoint
3.CrashCulpritIdentification
OurInsightsforInteractiveDISCDebugging
• Wedonotpauseprogramexecution butsimulateabreakpointthroughon-demandstateregeneration
• Wedeliverselectedprogramstatestoauserinastreamingprocessingfashion.
• Were-architecttheunderlyingbigdatasystemruntimewithnativein-memorydataprovenancesupport
6
Whatistheperformanceoverheadofdebuggingprimitives?
Program Datasetsize(GB)
Max Maxw/oLatencyAlert
Watchpoint CrashCulprit
Tracing
WordCount 0.5 - 1000 2.5X 1.34X 1.09X 1.18X 1.22X
Grep 20- 90 1.76X 1.07X 1.05X 1.04X 1.05X
PigMix-L1 1- 200 1.38X 1.29X 1.03X 1.19X 1.24X
BigDebug posesatmost2.5Xoverheadwiththemaximuminstrumentationsetting.
Max:AllthefeaturesofBigDebug areenabled
7
8
• InteractiveDebuggingPrimitivesforBigDataProcessinginSparkICSE’16
• AutomatedDebugginginDataIntensiveScalableComputingSystemsSoCC ’17
• White-boxTestingofDataIntensiveScalableComputingApplicationswithuserdefinedfunctionsOngoing
MotivatingExample
• AlicewritesaSparkprogramthatidentifies,foreachstateintheUS,thedeltabetweentheminimumandthemaximumsnowfallreadingforeachdayofanyyearandforanyparticularyear.
• Aninputdatarecordthatmeasures1footofsnowfallonJanuary1stofYear1992,inthe99504zipcode(Anchorage,AK)area,appearsas
99504 ,01/01/1992,1ft
ProblemDefinition
10
99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm
AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]
TextFile FlatMap GroupByKey Map Output
AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245
…. ….
AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0
Givenatestfunction,thegoalistoidentifyaminimumsubsetoftheinputthatisabletoreproducethesametestfailure.
def test(key:String, delta: Float) : Boolean = {delta < 6000
}
• Usingatestfunction,ausercanspecifyincorrectresults
11
99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm
AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]
TextFile FlatMap GroupByKey Map Output
AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245
…. ….
AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0
ExistingApproach1:DataProvenanceforSpark[VLDB2015]
Itover-approximatesthescopeoffailure-inducinginputsi.e.recordsinthefaultykey-groupareallmarkedasfaulty
ExistingApproach2:DeltaDebugging[Zeller1999]• DeltaDebuggingperformsasystematicbinarysearch-like
procedureontheinputdatasetusingatestoraclefunction
12
99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994,245mm99504,01/01/1993,85mm90031,02/01/1991,0mm
AK,01/01,304.8AK,1992 , 304.8AK,03/01 ,30.5AK,1992 ,30.5AK,01/01 ,21336AK,1993 , 21336AK,03/01 ,145AK,1993 ,145AK,01/01 ,245AK,1994 ,245
…. ….
AK ,01/01,[304.8,21336,245,85]AK ,03/01 ,[30.5,145]AK ,1992 ,[304.8,30.5]AK ,1993 ,[21336,145, 85]AK ,1994 ,[245]CA,02/01 ,[0]CA,1991 ,[0]
AK , 01/01 ,21251AK , 03/01 ,114.5AK , 1992 ,274.3AK , 1993 ,21251AK , 1994 ,0CA, 02/01 ,0CA, 1991 ,0
TextFile FlatMap GroupByKey Map Output
1
2
Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators
AutomatedDebugginginDISCwithBigSift[SoCC 2017]
13
Test PredicatePushdown
PrioritizingBackwardTraces
BitmapbasedTest
Memoization
Input:ASparkProgram,ATestFunction Output:MinimumFault-InducingInputRecords
DataProvenance+DeltaDebugging
14
Optimization1: TestPredicatePushdown
Ifapplicable,BigSift pushesdownthetestfunctiontotesttheoutputofcombinersinordertoisolatethefaultypartitions.
• Observation: Duringbackwardtracing,dataprovenancetracesthroughallthepartitionseventhoughonlyafewpartitionsarefaulty
Test
Test
Test
Test
Test
Test
Test
DebuggingTime
1
10
100
1000
10000
100000
1000000
10000000
1000000001E+09
0 2000 4000 6000 8000 10000 12000 14000
#offault-ind
ucinginpu
trecords
FaultLocalizationTime(s)
SequenceCount
DeltaDebugging BigSift
TestDrivenDataProvenance DataProvenance
Onaverage,BigSift takes62%lesstimetodebugasinglefaultyoutput thanthetimetakenforasinglerunontheentiredata.
16
• InteractiveDebuggingPrimitivesforBigDataProcessinginSparkICSE’16
• AutomatedDebugginginDataIntensiveScalableComputingSystemsSoCC ’17
• White-boxTestingofDataIntensiveScalableComputingApplicationswithuserdefinedfunctionsOngoing
TestingChallengesofBigDataAnalytics
• Howcanweselectaminimalsample fromacompletedatasettoperformefficienttestingofDISCapplications?
• HowcanwegenerateadatathatexercisesallexecutionpathsinaDISCapplicationtofacilitatecompletetesting?
• Duetodataflowoperators andcomplexuserdefinedfunctionsinDISCapplication,itisextremelyhardtoanswerthetwomentionedquestions.
sc.textFile(“hdfs://..”).flatMap( s=>s.split(“.”) ).map( s => (s,1) ).reduceByKey( (a,b) => a+b )
DataflowOperators Userdefinedfunctions
Ongoingwork:White-boxTestingofDISCApplications
udf1
udf2
udf3
DISCApplication
JavaPathFinder
map
filter
reduce
Stage1
Stage2
Udf1symPath & Effect
map
filter
reduce
Stage1
Stage2
Udf2symPath & Effect
Udf3symPath & Effect
PathConstraint
Effects
X>5&y= .. z=x*…
X<3 &y>… z=x/…
… ….
ADISCapplicationisdecomposedintoUDFsanddataflowoperators.
1 EachcomplexUDFissymbolicallyexecutedinisolationwithboundedpathexploration.
2 UDFPathconstraintsareintegratedw.r.t thelogicalspecificationsofdataflowoperators
3
PathConstraint
Effects
X>5&y= .. z=x*…
X<3 &y>… z=x/… Z3TestData
PathconstraintsareconvertedintoSMT2whichisusedtogeneratetestdatatheoremsolver
4
Testcoveragefromgeneratedtestingdata
100 100 100 100 100 100 100
16.7
40
14.3 18.225
13.325
66.760
28.6
54.5
75 76.7
100
0
20
40
60
80
100
120
Income Movie Airport Commute PigMix Grade Word
%ofJDU
Paths
JointDataFlowandUDFPathCoverage
BigTest Sedge OrginialDataset
BigTest outperformspreviouslyknowntechniqueandprovides100%JDUpathcoverageisalmosteverybenchmarkprogram
Conclusion• Bysynthesizinginsightsfromsoftwareengineeringand
databasesystems,wecandesignscalable, interactive,andautomateddebuggingalgorithmsforbigdataanalytics.
• Demo:DebuggingBigDataAnalyticsinSparkwithBigDebughttps://www.youtube.com/watch?v=aZ91EyC5-Yc
• Demo:AutomatedDebuggingofBigDataAnalyticswithBigSifthttps://www.youtube.com/watch?v=_HR3VJ2dPbE
21
99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm
AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]
TextFile FlatMap GroupByKey Map Output
AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245
…. ….
AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0
Optimization2:PrioritizingBackwardTraces
Incaseofmultiplefaultyoutputs,BigSift overlapstwobackwardtracestominimizethescopeoffault-inducinginputrecords
• Observation:Thesamefaultyinputrecordmaycontributetomultipleoutputrecordsfailingthetest.
TestDataSizefromBigTest
6 5 14 11 430
4
4000000000
521344
448000000 32000010040000000 40000000 111359852
1.00E+00
1.00E+02
1.00E+04
1.00E+06
1.00E+08
1.00E+10
Income Movie Airport Commute PigMix Grade Word
#ofinpu
trecords
TestDataGeneratedtoachieve100%JDUPathCoverage
BigTest InputDataset
BigTest generatestestingdateofsizethatisseveralordersofmagnitude(106-1010)smallerthantheoriginalinputdataset
BigDebug:InteractiveDebugger[FSE2016Demo,SIGMOD2017Demo]
• BigDebug ispublicallyavailableathttps://sites.google.com/site/sparkbigdebug/
FaultLocalizabilityoverDataProvenance
143796
6487290
520904
234115800
15003060
2554788
350
2
1350
15 13
1 1 1 1 1 12
1
10
100
1000
10000
100000
1000000
10000000
100000000
MovieHistorgram
InvertedIndex
RatingHistogram
SequenceCount
RatingFrequency
CollegeStudents
WeatherAnalysis
#offault-ind
ucinginpu
trecords
DataProvenance TestDrivenDataProvenance BigSift&DD
BigSift leveragesDDafterDPtocontinuefaultisolation,achievingseveralordersofmagnitude103 to107 betterprecision.