Seven Deadly Sins - Fall Technical Conference · Seven Deadly Sins of Big Data Richard De Veaux...

Post on 26-Aug-2018

224 views 1 download

Transcript of Seven Deadly Sins - Fall Technical Conference · Seven Deadly Sins of Big Data Richard De Veaux...

SevenDeadlySinsofBigData

RichardDeVeauxWilliamsCollege

FTCOctober5,2017

1

TheOriginalSevenSins

• Lust• Gluttony• Greed• Sloth• Wrath• Envy• Pride

ThankstoEwan’sCorner,SourceUnknown

2

WhatisBigData?

3

HowBigIsBig?

4

WhatmakesBigdataBig?

5

OriginsofBigData

6

Whatisitreally?

7

Whatisitreally?

8

Whatisitreally?

So,WhataretheSevenDeadlySinsofBigData?

9

1.FailingtoDefinetheProblem

10

ProductionAnalysis

✦Statisticianstudiespropertiesfor74000recentsamplesfromproduction✦Viscosity✦Maxtemperaturetofailure✦Maxpressuretofailure✦Density✦Loadatfailure✦Stressatfailure

11

12

13

14

ThreeGroups!!!

“Doyouknowwemake3products?”15

NotFullyUnderstandingtheProblem

§Ingot cracking•3935 30,000 lb. Ingots•Up to 25% cracking rate•$30,000 per recast•90 potential explanatory variables

•Water composition (reduced)•Metal composition•Process variables•Other environmental variables

16

Ingots–FirstTree

CountMeanStd Dev

39350.23560.4244

All Rows

CountMeanStd Dev

30050.1590.3661

Alloy (6045,7348,8234,2345,3234)CountMeanStd Dev

9300.4817

0.4999

Alloy (5434,5894,2439)

We know that!! – some alloys are hard to make. That’s why we gave you the data in

the first place.

17

SecondTree

CountMeanStd Dev

3935All Rows

CountMeanStd Dev

30550.16730.3723

MG<3.9

CountMeanStd Dev

8800.47270.4999

MG>=3.9

What do you think is in those alloys?

0.42440.2356

18

OneMoreTime

Oh, that’s funny!-Issac Asimov

§Looks like Chrome (!?)

§Really?

§Did that “solve” the problem?•No, but... experimental design•Enabled us to focus on the important variables

19

MosaicPlot

20

ByHour…To

0.00

0.25

0.50

0.75

1.00

3578910 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2Hour

21

2.UnderestimatingDataPreparation

22

DataPreparation

60%to95%ofthetimeisspentpreparingthedata

87.1%ofallstatisticsaremadeup

23

WhyisThisHard?

ªPVAisaphilanthropicorganization,sanctionedbytheUSGovttorepresentthedisabledveterans

ªTheysendout4million“freegifts”,every6weeks- Andhopefordonations

ªDatawereusedfortheKDD1998cup- 200,000donors(100,000training,100,000test)- 479(!)demographicvariables

-Pastgiving,income,ageetcetcetc- Recentcampaign(onlyfortrainingset)

-Didtheygive?(TargetB)

ªWhoshouldgetthecurrentmailing?-What’sacosteffectivestrategy-Howmuchdidtheygive(TargetD)

24

What’sHardExactly?

25

T-Code

26

ZoomofBoxplot

27

Transformation?

28

MaybeCategories?

29

Whatisit,Anyway?T-Code

0 _ 16 DEAN 48 CORPORAL 109 LIC. 1 MR. 17 JUDGE 50 ELDER 111 SA.

1001 MESSRS. 17002 JUDGE & MRS. 56 MAYOR 114 DA. 1002 MR. & MRS. 18 MAJOR 59002 LIEUTENANT & MRS. 116 SR.

2 MRS. 18002 MAJOR & MRS. 62 LORD 117 SRA. 2002 MESDAMES 19 SENATOR 63 CARDINAL 118 SRTA.

3 MISS 20 GOVERNOR 64 FRIEND 120 YOUR MAJESTY 3003 MISSES 21002 SERGEANT & MRS. 65 FRIENDS 122 HIS HIGHNESS

4 DR. 22002 COLNEL & MRS. 68 ARCHDEACON 123 HER HIGHNESS 4002 DR. & MRS. 24 LIEUTENANT 69 CANON 124 COUNT 4004 DOCTORS 26 MONSIGNOR 70 BISHOP 125 LADY

5 MADAME 27 REVEREND 72002 REVEREND & MRS. 126 PRINCE 6 SERGEANT 28 MS. 73 PASTOR 127 PRINCESS 9 RABBI 28028 MSS. 75 ARCHBISHOP 128 CHIEF

10 PROFESSOR 29 BISHOP 85 SPECIALIST 129 BARON 10002 PROFESSOR & MRS. 31 AMBASSADOR 87 PRIVATE 130 SHEIK 10010 PROFESSORS 31002 AMBASSADOR & MRS. 89 SEAMAN 131 PRINCE AND PRINCESS

11 ADMIRAL 33 CANTOR 90 AIRMAN 132 YOUR IMPERIAL MAJESTY11002 ADMIRAL & MRS. 36 BROTHER 91 JUSTICE 135 M. ET MME.

12 GENERAL 37 SIR 92 MR. JUSTICE 210 PROF.12002 GENERAL & MRS. 38 COMMODORE 100 M.

13 COLONEL 40 FATHER 103 MLLE. 13002 COLONEL & MRS. 42 SISTER 104 CHANCELLOR

14 CAPTAIN 43 PRESIDENT 106 REPRESENTATIVE 14002 CAPTAIN & MRS. 44 MASTER 107 SECRETARY

15 COMMANDER 46 MOTHER 108 LT. GOVERNOR 15002 COMMANDER & MRS. 47 CHAPLAIN

Title

30

3.IgnoringWhat’sNotThere

31

DepressionClinicalTrialStudy

• Designedtostudyantidepressantefficacy•MeasuredviaHamiltonRatingScale

• Sideeffects• Sexualdysfunction•Miscsafetyandtolerabilityissues

• 428patients• Twoantidepressants+placebo

32

TheUsualSuspects

33

What’sMissing?

34

NowAutomatic

35

4.FallinginLoveWithYourModels

36

PredictingMalignancy• Breastcancerdatafrommammograms• Errorratesbytrainedradiologistsarenear25%forbothfalsepositivesandfalsenegatives

• Newerequipmentisprohibitivelyexpensiveforthedevelopingworld• Earlydetectionofbreastcanceriscrucial• CumulativetypeIerroroveradecadeisnear100%leadingto

needlessbiopsies• Treesdidevenworsethanradiologists

37

CombiningModels• Bagging(BootstrapAggregation)• Bootstrapadatasetrepeatedly• Takemanyversionsofsamemodel(e.g.tree)

• RandomForestVariation

• Formacommitteeofmodels• Takemajorityruleofpredictions

• Boosting• Createrepeatedsamplesofweighteddata• Weightsbasedonmisclassification• Combinebymajorityrule,orlinearcombinationofpredictions

38

RandomForestWins

False Positives False NegativesSimple Tree 32.20% 33.70%Neural Network 25.50% 31.70%Boosted Trees 24.90% 32.50%Boostrap Forest 19.30% 28.80%Radiologists 22.40% 35.80%

39

5.UsingBadData

40

WhereDoTheyComeFrom?

✦TheTimesofLondonreports63.2kgbat• Themysteriousmovingdecimal

✦40%ofdoctorsbornonVeteran’sDay1911• DataEntry

✦MarsLanderlost--$125M• Confusionofaccelerationunits(metric–newtons/secvs.Englishpound/sec)Everywhere

✦OvarianCancercurepublishedinLancet• Differencesduetolabpractice,nottreatment

41

DataQuality

42

DataQuality

43

DataQuality

44

WhatCanTheyDo?

✦StudybyLargeCreditCardIssuer

✦Entireeffectwasduetoonecustomerwhocharged$3M

45

6.ConfusingCorrelationandCausation

PCMI2016 7/1/201646

PoorGeorgeGeorgeBox

“Allmodelsarewrong,butsomeareuseful”

PeterNorvig “Allmodelsarewrong,andincreasinglyyoucansucceedwithoutthem.”

Really?

ChrisAnderson “Withenoughdata,thenumbersspeakforthemselves”

47

TheEndofScience

48

Didyouknow?

• Shorter men have greater risk of heart attack

49

Didyouknow?• Who is more likely to have a heart attack in 10 yrs?

50

Didyouknow?• High popsicle sales are strongly correlated with shark

51

• People who believe in alien abductions prefer Pepsi to Coke

Didyouknow?

52

7.NotTakingYourAnti-Hubristines

53

DataScience

54

Or?

55

GooglePredictsFluBeforeCDC!!

56

Or…DidThey?

57

TheSevenVirtues

✦DefinetheProblem✦PreparetheData• UseDomainKnowledge

✦BeOpentoNewMethodsandModels✦BeAwareofMissingData✦EnsureDataQualityandEthicalUseofData✦UseModels,notjustAssociations✦WorkinTeams• AcknowledgelimitstoBigDataAnalysis

58

Thankyou!!

59