Statistical Thinking - Computer Science and Engineering

Post on 30-Nov-2021

1 views 0 download

Transcript of Statistical Thinking - Computer Science and Engineering

StatisticalThinkingBasedonC.J.WildandM.Pfannkuch (1999).StatisticalthinkinginEmpiricalEnquiry,InternationalStatisticalReview,67(3):223-265.

+ProfessorMattWaite’snotes

BasicIdeas

• Thoughtprocessesinvolvedinstatisticalproblemsolving• Fromproblemformulationtoconclusions

• Afour-dimensionalframeworkforstatisticalthinkinginempiricalenquiry• Investigativecycle• Interrogativecycle• Typesofthinking• Dispositions

• Centralelement:“variation”

Four-DimensionalFramework

Dimension1:TheInvestigativeCycle• Concernedwithabstractingandsolvingastatisticalproblemgroundedinalarger”real”problem

• BasedonthePPDACmodel(Problem,Plan,Data,Analysis,Conclusions)

Dimension2:TypesofThinking• Variation• Thinkingwhichisstatisticalisconcernedwithlearninganddecisionmakingunderuncertainty

• forthepurposesofexplanation,prediction,orcontrol

Dimension2:MoreonVariation|Sources

Dimension2:MoreonVariation|Prediction,Explain,Control

Dimension2:SummaryonVariation

• Special-cause vs.commoncausevariation• Usefulwhenlookingforcauses

• Explained vs.unexplainedvariation• Usefulwhenexploringdata&buildingamodelforthem

• Suppositions• Variationisanobservablereality

• Somevariationcanbeexplained;othervariationcannot beexplainedoncurrentknowledge• Random variationisthewayinwhichstatisticiansmodelunexplainedvariation

• Thisunexplainedvariationmayinpartorinwholebeproducedbytheprocessofobservationthroughrandomsampling

• Randomnessisaconvenient humanconstructwhichisusedtodealwithvariationinwhichpatternscannotbedetected

CorrelationisNOTcausation

Dimension3:TheInterrogativeCycle• Appliesatmacrolevels

• Appliesalsoatverydetailedlevelsofthinking• Recursive• Subcyclesareinitiatedwithinmajorcycles

Dimension4:Dispositions

• Whenauthorsbecomeintenselyinterestedinaproblemorare,aheightenedsensitivityandawarenessdevelopstowardsinformationontheperipheriesofourexperiencethatmightberelatedtotheproblem• Peoplearemostobservantinareastheyfindmostinteresting

• Engagementintensitieseachdispositionalelement

TypesofAnalytics

• Descriptive• Describingcharacteristicsorpropertiesinthedata

• Predictive• Predictingthetypesofoutcomesgivennewsetsofdata,usuallybasedonaclassifiertrainedusinglabelled,existingdatasets

• Prescriptive• Decidingonthebestrouteoroptionordecisiontomakegivendata

TypesofData

• Categorical (cf.wikipedia)• Variable thatcantakeononeofalimited,andusuallyfixednumberofpossiblevalues,assigningeachindividualorotherunitofobservationtoaparticulargroupor nominalcategory onthebasisofsome qualitativeproperty

• The bloodtype ofaperson:A,B,ABorO• Thestatethatapersonlivesin• The politicalparty thatavotermightvotefor• Thetypeofarock: igneous, sedimentary or metamorphic• Ordinal data?

• Numerical• Canbesubdividedintodiscretedata(thingsthatcanbecounted)andcontinuousdata(allpossiblenumbers).

• # ofchildren,age,scores,temperatures,etc.

DescriptiveStatistics

• Therearethreemaingroupsofdescriptives• Thedistribution• Workswellwithcategoricaldata.Howmanyofeachthingisthere?

• Thecentraltendency• Onlyworkswithnumericaldata.Whatisthemean,medianandmode?

• Thedispersion• Onlyworkswithnumericaldata.Howspreadoutisthedata?

DescriptiveStatistics:Distribution

• Groupingandcountingbycategoricaldata– groupandcountbytown,orzipcodeorsomethinglikethat• Oftencalledafrequencydistribution• Histogram

• Withnumericaldata,minimum andmaximum valuesareuseful

DescriptiveStatistics:CentralTendency

• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues

• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo

• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode

DescriptiveStatistics:Dispersion

• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues

• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo

• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode

DescriptiveStatistics:Dispersion

• Range• Differencebetweenthelowestandhighestvalues• Subjecttoextremes(e.g.,outliers)

• Standarddeviation• Itistherelationthatasetofscoreshastothemean• Subjecttoskewness indistribution

• ForaGaussian/normaldistribution• 68%ofallvalueswillbewithin1standarddeviation• 95%willbewithin3standarddeviation

DirtyData• Missing data

• Blanksinthedatabaseorspreadsheet.• Datamissingfromaperiodoftime.• Missingstates,counties,zipcodes.

• Wrong data• Wrongtype– numberswheretheyshouldbetextandviceversa• Sharpcurves– trendsthatcontinuenormallythatsuddenlyjumpinoneyear• Conflictingdatawithinadatasetoracrossdatasets(race,percentages,etc)

• Unusable data• Non-standardizeddata• Inconsistentdata• Abbreviations• Unitconsistency

Correlation

• Pearsoncorrelationcoefficients(orPearsonproduct-momentcorrelationcoefficient)• ItisameasureofhowLINEARLYrelatedtwoentitiesare.• HowoftenisachangeinArelatedtoachangeinB?Andisthatpositiveornegative?

Correlation:Forapopulation

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

StandarddeviationofX;standarddeviationofY

Correlation:Forasample

Correlation:Whatitmeans?

• Itisbasedonarangefrom-1to1.• 1=perfectpositivecorrelation• Agoesup1,Bgoesup1• Intherealworld,almostneverhappensoutsideofamistake

• 0=nocorrelationatall• 0rarelyeverhappens• NEARzerohappensallthetime

• -1=perfectnegativecorrelation• Agoesup1,Bgoesdown1• Itisjustlike1:rare,probablyamistake

Significance:t-test

• The t-test isany statisticalhypothesistest inwhichthe teststatistic followsa Student's t-distribution underthe null hypothesis.• A t-testismostcommonlyappliedwhentheteststatisticwouldfollowa normal distribution ifthevalueofa scalingterm intheteststatisticwereknown• Whenthescalingtermisunknownandisreplacedbyanestimatebasedonthe data,theteststatistics(undercertainconditions)followaStudent's t distribution• The t-testcanbeused,forexample,todetermineiftwosetsofdataare significantly differentfromeachother

https://en.wikipedia.org/wiki/Student%27s_t-test

Significance:p-value&nullhypothesis• Inthecontextof nullhypothesis testing:toquantifytheideaof statisticalsignificance ofevidence• Inessence,aclaimisassumedvalidifitscounter-claimisimprobable

• Theonlyhypothesisthatneedstobespecifiedinthistestandwhichembodiesthecounter-claimisreferredtoasthe nullhypothesis• i.e.,thehypothesistobenullified

• Aresultissaidtobe statisticallysignificant ifitallowsustoreject thenullhypothesis• Thestatisticallysignificantresultshouldbehighlyimprobableifthenullhypothesisisassumedtobetrue

• Therejectionofthenullhypothesisimpliesthatthecorrecthypothesisliesinthelogicalcomplementofthenullhypothesis

• Caveat:Unlessthereisasinglealternativetothenullhypothesis,therejectionofnullhypothesisdoesnot telluswhichofthealternativesmightbethecorrectone

https://en.wikipedia.org/wiki/Student%27s_t-test