Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

20
Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 – Smith College October 30, 2017

Transcript of Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy...

Page 1: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

MachineLearningintheWild

DealingwithMessyData

RajmondaS.Caceres

SDS293– SmithCollege October30,2017

Page 2: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

AnalyticalChain:FromDatatoActions

Whatthislectureisabout?

• Whataresomeofdataqualityissuesinrealapplications?• Whyshouldwecareaboutdataquality?• Howcanweleveragestatisticaltechniquestoimprovethequalityofrealdata?

DataCollection DataCleaning/Preparation

AnalysisVisual

RepresentationDecisionMaking

Page 3: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

DataQualityIssuesInRealApplications

• Missingdata• Outliers• Noisydata• Datamismatch• Curseofdimensionality

• Dataqualitygreatlyeffectsanalysisandinsightswedrawfromdata• Introducesbias• Causesinformationloss

Page 4: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

EffectsofMissingDataOnRegression

• Whichregressionmodelshouldwepick?

FiguresarefromChrisBishop’sBook“PatternRecognitionandMachineLearning”

Sufficientdatapointstoperformregression Insufficientdatapointstoperformregression

Page 5: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

EffectsofOutliers&MissingDataOnPCA

• Poorqualitydatagreatlyaffectsdimensionalityreductionmethods

Case 1 Case2

Page 6: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

MissingData

• Howmuchmissingdataistoomuch?• Roughguideline:Ifless<5-10%,likelyinconsequential,ifgreaterneedtoimpute

Sample X1 X2 X3 X4 Y

1 ?

2 ?

3

4 ? ? ? ? ?

5 ? ? ? ? ?

Observedsubpopulation

UnobservedsubpopulationSampleLevel

FeatureLevel

Page 7: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

MechanismsofMissingData(Rubin1976)

• MissingCompletelyAtRandom(MCAR):missingdatadoesnotdependonobservedandunobserveddata• P(missing|complete data)=P(missing)

• MissingAtRandom(MAR):missingdatadependsonlyonobserveddata• P(missing|complete data)=P(missing|observed data)

• MissingNotAtRandom(MNAR):• P(missing|complete data)⍯ P(missing|observed data)

Hardertoaddress

Page 8: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

MeanImputation1. Assumethedistributionofmissingdatais

thesameasobserveddata• Replacewithmean,medianorotherpointestimates• Advantages:• WorkswellifMCAR• Convenient,easytoimplement

• Disadvantages:• Introducesbiasbysmoothingoutthevariance• Changesthemagnitudeofcorrelationsbetweenvariables

ImputedDataPoint

Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

Page 9: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

RegressionImputation• Betterimputation- leverageattributerelationships• Monotonemissingpatterns• Replacemissingvalueswithpredictedscoresfromregressionequation

• Stochasticregression:addanerrorterm• Advantage:• Useinformationfromobserveddata

• Disadvantage:• Overestimatemodelfit,weakensvariance

ImputedDataPoint

yi = axi + b

Figuretakenfrom:https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

Page 10: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

k-NearestNeighbor(kNN)Imputation

• Leveragesimilaritiesamongdifferentsamplesinthedata• Advantages:• Doesn’trequireamodeltopredictthemissingvalues• Simpletoimplement• Cancapturethevariationindataduetoitslocality

• Disadvantages:• Sensitivetohowwedefinewhatsimilarmeans

?

kNN Imputation,k=3

Page 11: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

MultipleImputation

PooledResultsIncomplete

Data

Set1

Set2

Setm

R1

R2

Rm

.

.

.

.

.

.

ImputedData AnalysisResults

• Treatthemissingvalueasarandomvariableandimputeitm times• Minimizethebiasintroducedbydataimputationthroughaveraging

Page 12: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

Outliers

• Outlier:Observationwhichdeviatessomuchfromotherobservationsastoarousesuspicionitwasgeneratedbyadifferentmechanism”,  Hawkins(1980)• Outliersvsnoise• Outlierdetectionvs.noveltydetection

Page 13: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

OutlierDetectionTechniques

• Makeassumptionsaboutnormaldataandoutliers• Statistical• Proximity-based• Clustering-basedmethods

• Haveaccesstolabelsofoutlierexamples• Supervised• Semi-supervised• Unsupervised

Page 14: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

TypesofOutliers

• Universalvscontextual• O1,O2localoutliers(relativetoclusterC1)• 03globaloutlier

• Singularvscollective

Page 15: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

StatisticalMethods• Assume``normal’’observationsfollowsomestatisticalmodel• Example:AssumethenormalpointscomefromaGaussiandistribution.

• Learntheparametersofthemodelfromthedata

• Datapointsnotfollowingthemodelareconsideredoutliers• Example1:Throwawaypointsthatfallatthetails• Example2:Throwawaylowprobabilitypoints

• Disadvantage:Methodisdependedonmodelassumption

15

Page 16: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

Classification-BasedMethod• Iflabelsofoutlierdatapointsexistwecouldtreattheproblemasaclassificationproblem• Trainaclassifiertoseparatethetwoclasses,“normal’andoutliersclass• Usuallyheavilybiasedtowardthenormalclass:“Unbalancedclassificationproblem”• Cannotdetect“unseen”outliers• Oftenunrealistictoassumewehavelabelsforoutlierpoints

• Oneclassclassification:learntheboundaryforthenormalclass.Pointsoutsidetheboundaryareconsideredoutliers• Candetectnewoutliers

Page 17: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

Proximity-BasedMethods

• Ifthenear-bypointsarefaraway,considerthedatapointanoutlier• Noassumptiononlabelsormodelsof“normal”distribution• Nofreelunchthough:werelyontherobustnessoftheproximitymeasure• Distance-basedmethods:• Anobservationisanoutlierifitsneighborhooddoesnothaveenoughotherobservations

• Density-basedmethods:• Anobservationisanoutlierifitsdensityisrelativelymuchlowerthanthatofitsneighbors

Page 18: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

Density-basedOutlierDetection• LocalOutlierFactor(LOF)Algorithm(Breuning 2000)• Foreachpoint,computetheknearestneighborsN(j)• Computethepointdensity

• Computethelocaloutlierscore:

• Asthenamesuggests,LOFisrobustindetectinglocaloutliers

LOF (i) =1/k

Pj2N(j) f(j)

f(i)

f(i) = kPj2N(j) d(i,j)

Figuretakenfrom:https://commons.wikimedia.org/wiki/File:LOF-idea.svg

Page 19: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

MainTakeaways

• Exploreandunderstandasmuchaspossiblethequalityofyourdata• Identifyappropriatetechniquesthatcanmitigatesomeoftheissuesofdataquality• Someofthesametechniquesyouarelearninginthiscoursecanbeleveragedtoimprovethequalityofthetrainingdata

• Knowwhenyouneedmoredata• Understandbiasesintroducedbydataimputationtechniques• Approachdatascienceasaniterativeprocess- allthecomponentsareconnected

Youranalysisisonlyasgoodasyourdata

Page 20: Machine Learning in the Wild - Clark Science CenterMachine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 –Smith College October 30, 2017 Analytical Chain:

References

• Rubin,D.B.(1976).Inferenceandmissingdata.Biometrika 63(3):581-592.• Rubin,D.B.(1987).MultipleImputationforNonresponseinSurveys.NewYork,J.Wiley&Sons.• Little,R.J.andD.B.Rubin(2002).StatisticalAnalysiswithMissingData.Hoboken,NJ,JohnWiley&Sons.• Breunig, M.M.,Kriegel,H.,Ng,R.T.,Sander,J.(2000).LOF:IdentifyingDensity-BasedLocalOutliers