Seven Deadly Sins - Fall Technical Conference · Seven Deadly Sins of Big Data Richard De Veaux...
Transcript of Seven Deadly Sins - Fall Technical Conference · Seven Deadly Sins of Big Data Richard De Veaux...
SevenDeadlySinsofBigData
RichardDeVeauxWilliamsCollege
FTCOctober5,2017
1
TheOriginalSevenSins
• Lust• Gluttony• Greed• Sloth• Wrath• Envy• Pride
ThankstoEwan’sCorner,SourceUnknown
2
WhatisBigData?
3
HowBigIsBig?
4
WhatmakesBigdataBig?
5
OriginsofBigData
6
Whatisitreally?
7
Whatisitreally?
8
Whatisitreally?
So,WhataretheSevenDeadlySinsofBigData?
9
1.FailingtoDefinetheProblem
10
ProductionAnalysis
✦Statisticianstudiespropertiesfor74000recentsamplesfromproduction✦Viscosity✦Maxtemperaturetofailure✦Maxpressuretofailure✦Density✦Loadatfailure✦Stressatfailure
11
12
13
14
ThreeGroups!!!
“Doyouknowwemake3products?”15
NotFullyUnderstandingtheProblem
§Ingot cracking•3935 30,000 lb. Ingots•Up to 25% cracking rate•$30,000 per recast•90 potential explanatory variables
•Water composition (reduced)•Metal composition•Process variables•Other environmental variables
16
Ingots–FirstTree
CountMeanStd Dev
39350.23560.4244
All Rows
CountMeanStd Dev
30050.1590.3661
Alloy (6045,7348,8234,2345,3234)CountMeanStd Dev
9300.4817
0.4999
Alloy (5434,5894,2439)
We know that!! – some alloys are hard to make. That’s why we gave you the data in
the first place.
17
SecondTree
CountMeanStd Dev
3935All Rows
CountMeanStd Dev
30550.16730.3723
MG<3.9
CountMeanStd Dev
8800.47270.4999
MG>=3.9
What do you think is in those alloys?
0.42440.2356
18
OneMoreTime
Oh, that’s funny!-Issac Asimov
§Looks like Chrome (!?)
§Really?
§Did that “solve” the problem?•No, but... experimental design•Enabled us to focus on the important variables
19
MosaicPlot
20
ByHour…To
0.00
0.25
0.50
0.75
1.00
3578910 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2Hour
21
2.UnderestimatingDataPreparation
22
DataPreparation
60%to95%ofthetimeisspentpreparingthedata
87.1%ofallstatisticsaremadeup
23
WhyisThisHard?
ªPVAisaphilanthropicorganization,sanctionedbytheUSGovttorepresentthedisabledveterans
ªTheysendout4million“freegifts”,every6weeks- Andhopefordonations
ªDatawereusedfortheKDD1998cup- 200,000donors(100,000training,100,000test)- 479(!)demographicvariables
-Pastgiving,income,ageetcetcetc- Recentcampaign(onlyfortrainingset)
-Didtheygive?(TargetB)
ªWhoshouldgetthecurrentmailing?-What’sacosteffectivestrategy-Howmuchdidtheygive(TargetD)
24
What’sHardExactly?
25
T-Code
26
ZoomofBoxplot
27
Transformation?
28
MaybeCategories?
29
Whatisit,Anyway?T-Code
0 _ 16 DEAN 48 CORPORAL 109 LIC. 1 MR. 17 JUDGE 50 ELDER 111 SA.
1001 MESSRS. 17002 JUDGE & MRS. 56 MAYOR 114 DA. 1002 MR. & MRS. 18 MAJOR 59002 LIEUTENANT & MRS. 116 SR.
2 MRS. 18002 MAJOR & MRS. 62 LORD 117 SRA. 2002 MESDAMES 19 SENATOR 63 CARDINAL 118 SRTA.
3 MISS 20 GOVERNOR 64 FRIEND 120 YOUR MAJESTY 3003 MISSES 21002 SERGEANT & MRS. 65 FRIENDS 122 HIS HIGHNESS
4 DR. 22002 COLNEL & MRS. 68 ARCHDEACON 123 HER HIGHNESS 4002 DR. & MRS. 24 LIEUTENANT 69 CANON 124 COUNT 4004 DOCTORS 26 MONSIGNOR 70 BISHOP 125 LADY
5 MADAME 27 REVEREND 72002 REVEREND & MRS. 126 PRINCE 6 SERGEANT 28 MS. 73 PASTOR 127 PRINCESS 9 RABBI 28028 MSS. 75 ARCHBISHOP 128 CHIEF
10 PROFESSOR 29 BISHOP 85 SPECIALIST 129 BARON 10002 PROFESSOR & MRS. 31 AMBASSADOR 87 PRIVATE 130 SHEIK 10010 PROFESSORS 31002 AMBASSADOR & MRS. 89 SEAMAN 131 PRINCE AND PRINCESS
11 ADMIRAL 33 CANTOR 90 AIRMAN 132 YOUR IMPERIAL MAJESTY11002 ADMIRAL & MRS. 36 BROTHER 91 JUSTICE 135 M. ET MME.
12 GENERAL 37 SIR 92 MR. JUSTICE 210 PROF.12002 GENERAL & MRS. 38 COMMODORE 100 M.
13 COLONEL 40 FATHER 103 MLLE. 13002 COLONEL & MRS. 42 SISTER 104 CHANCELLOR
14 CAPTAIN 43 PRESIDENT 106 REPRESENTATIVE 14002 CAPTAIN & MRS. 44 MASTER 107 SECRETARY
15 COMMANDER 46 MOTHER 108 LT. GOVERNOR 15002 COMMANDER & MRS. 47 CHAPLAIN
Title
30
3.IgnoringWhat’sNotThere
31
DepressionClinicalTrialStudy
• Designedtostudyantidepressantefficacy•MeasuredviaHamiltonRatingScale
• Sideeffects• Sexualdysfunction•Miscsafetyandtolerabilityissues
• 428patients• Twoantidepressants+placebo
32
TheUsualSuspects
33
What’sMissing?
34
NowAutomatic
35
4.FallinginLoveWithYourModels
36
PredictingMalignancy• Breastcancerdatafrommammograms• Errorratesbytrainedradiologistsarenear25%forbothfalsepositivesandfalsenegatives
• Newerequipmentisprohibitivelyexpensiveforthedevelopingworld• Earlydetectionofbreastcanceriscrucial• CumulativetypeIerroroveradecadeisnear100%leadingto
needlessbiopsies• Treesdidevenworsethanradiologists
37
CombiningModels• Bagging(BootstrapAggregation)• Bootstrapadatasetrepeatedly• Takemanyversionsofsamemodel(e.g.tree)
• RandomForestVariation
• Formacommitteeofmodels• Takemajorityruleofpredictions
• Boosting• Createrepeatedsamplesofweighteddata• Weightsbasedonmisclassification• Combinebymajorityrule,orlinearcombinationofpredictions
38
RandomForestWins
False Positives False NegativesSimple Tree 32.20% 33.70%Neural Network 25.50% 31.70%Boosted Trees 24.90% 32.50%Boostrap Forest 19.30% 28.80%Radiologists 22.40% 35.80%
39
5.UsingBadData
40
WhereDoTheyComeFrom?
✦TheTimesofLondonreports63.2kgbat• Themysteriousmovingdecimal
✦40%ofdoctorsbornonVeteran’sDay1911• DataEntry
✦MarsLanderlost--$125M• Confusionofaccelerationunits(metric–newtons/secvs.Englishpound/sec)Everywhere
✦OvarianCancercurepublishedinLancet• Differencesduetolabpractice,nottreatment
41
DataQuality
42
DataQuality
43
DataQuality
44
WhatCanTheyDo?
✦StudybyLargeCreditCardIssuer
✦Entireeffectwasduetoonecustomerwhocharged$3M
45
6.ConfusingCorrelationandCausation
PCMI2016 7/1/201646
PoorGeorgeGeorgeBox
“Allmodelsarewrong,butsomeareuseful”
PeterNorvig “Allmodelsarewrong,andincreasinglyyoucansucceedwithoutthem.”
Really?
ChrisAnderson “Withenoughdata,thenumbersspeakforthemselves”
47
TheEndofScience
48
Didyouknow?
• Shorter men have greater risk of heart attack
49
Didyouknow?• Who is more likely to have a heart attack in 10 yrs?
50
Didyouknow?• High popsicle sales are strongly correlated with shark
51
• People who believe in alien abductions prefer Pepsi to Coke
Didyouknow?
52
7.NotTakingYourAnti-Hubristines
53
DataScience
54
Or?
55
GooglePredictsFluBeforeCDC!!
56
Or…DidThey?
57
TheSevenVirtues
✦DefinetheProblem✦PreparetheData• UseDomainKnowledge
✦BeOpentoNewMethodsandModels✦BeAwareofMissingData✦EnsureDataQualityandEthicalUseofData✦UseModels,notjustAssociations✦WorkinTeams• AcknowledgelimitstoBigDataAnalysis
58
Thankyou!!
59