day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science...
Transcript of day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science...
DescriptiveandSummaryStatisticsBIO5312FALL2017
STEPHANIE J. SPIELMAN,PHD
LogisticsAllcoursematerialswillbehostedhere:http://sjspielman.org/bio5312_fall2017
SubmitassignmentsviaCanvas:https://templeu.instructure.com
Pleasebringyourlaptoptoclass!!!
OfficeSERC643◦ WeeklyofficehoursFriday1-3groundfloorofSERCß vote?
CoursegoalsTheprimarygoalistoanalyze,interpret,andvisualizedatainthebiologicalsciences
AchievedviastatisticalanalysisanddatasciencetechniquesinR
Thisisnotacourseinstatisticaltheory.
CoursetopicsDescriptiveandSummaryStatistics
Datavisualization
Fundamentalsinprobability,distributions
Statisticalinference:hypothesistestingandconfidenceintervals
Linearmodeling
Multipletesting
Binaryclassification
Clusteringmethods
Specialtopicsincurrentbiologicaldataanalysis
CoursetopicsDescriptiveandSummaryStatistics
Datavisualization
Fundamentalsinprobability,distributions
Statisticalinference:hypothesistestingandconfidenceintervals
Linearmodeling
Multipletesting
Binaryclassification
Clusteringmethods
Specialtopicsincurrentbiologicaldataanalysis
Butfirst,whatarewedoinghere?Statisticsisthestudyofthecollection,analysis,interpretation,presentation,andorganizationofdata.
Weusestatisticstomakeinferencesaboutphenomenausingsamplesandquantifyuncertaintyofdata
Biostatisticsis(surprisingly!)abranchofappliedstatisticsgearedtowardstomedicalandbiologicalproblems
PopulationsandsamplesPopulations aretheentirecollectionofindividuals/units/etc.aresearcherisinterestedin◦ Generallywecanneverknowthetruecompositionofapopulation◦ Populationsaredescribedwithparameters
Samples aresubsets ofindividuals/unitsfrompopulations◦ Weusehypothesistestingto(tryto)drawpopulation-levelconclusionsfromsamples◦ Samplesaredescribedwith estimates
Parametersandestimatesusedifferentnotations,aswewillsee
Whatmakesagoodsample?Inanidealworld,asampleisunbiasedandfeatureslowsamplingerror◦ Biasisasystematicdiscrepancybetweenestimateandparameter
Samplesshouldberandomlychosen◦ Eachpopulationunitshouldhaveanequalandindependent chanceofbeingchosenforagivensample
Bias
Samplingerror
LowbiasandlowsamplingerrorPrecise Imprecise
Inaccurate
Accurate
Popquiz:Isitrandom?Aresearcherselectsthefirst58studentvolunteersthatsignupforastudy
Acomputerprogramnumbersallresidentsinacommunity,andthenusesarandom-numbergeneratortoselect26residents
Aresearchervigorouslyshakesaboxcontainingequallysizedballsandtakesthefirst3thatfalloutofthebox.
AresearcherselectsallstudyparticipantswhosefirstnamestartswithanA,B,K,M,orO.
Popquiz:Isitrandom?Aresearcherselectsthefirst58studentvolunteersthatsignupforastudy
Acomputerprogramnumbersallresidentsinacommunity,andthenusesarandom-numbergeneratortoselect26residents
Aresearchervigorouslyshakesaboxcontainingequallysizedballsandtakesthefirst3thatfalloutofthebox.
AresearcherselectsallstudyparticipantswhosefirstnamestartswithanA,B,K,M,orO.
DescriptiveandSummaryStatisticsToolstoconciselydescribedata,numericallyandvisually
Generallythefirststepindataexplorationandstatisticalanalysiso Identifymissingvalues,outliers,etc.o Checkassumptionsrequiredtofitmodelsorperformstatisticaltestso Identifytrendsthatmeritfurtherstudy
Typesofdata
Quantitativedata◦ Continuous◦ Discrete(includescountdata)
Categoricaldata◦ Nominal◦ Ordinal◦ Binary*
Howyouanalyzeandvisualizedatadependsonthetype ofdatayouhave
QuantitativedataContinuous◦ Anyreal-numbervaluewithinsomerange
Discrete◦ Valuesareinindivisibleunits,i.e.wholeorcountingnumbers◦ Includescountdata(numberofcupsofcoffeeperday,numberofaminoacidsinaprotein…)
CategoricaldataNominal◦ Haircolor,eyecolor,sexgenotypes(XX,XY,XXY,XYY,XO).
Ordinal – categorieswithanaturalordering◦ Bad,fair,good,excellent◦ A,B,C,D
Binary◦ Yes/No◦ True/False
Bonus:namesofsexgenotypes?
MeasuresofLocationContinuous
Mean
𝑌" = $%∑ 𝑌(%
()$
Median◦ Foroddn,the %*$
+th observation
◦ Forevenn,theaverageofthe %+thand
%++ 1 th observation
Discrete
Mode◦ Themostfrequentappearingobservationinthedistribution(commonlyusedfordiscretedata)
◦ 1,2,2,2,3,4,4,5,6à 2
Measuresoflocationindistributions
http://i.imgur.com/YSEYhha.jpg
MeasuresofspreadRange
Standarddeviationandvariance
Interquartilerange
RangeDifferencebetweenlargestandsmallestvalueinadistribution◦ 1,2,3,7,9à 8◦ 1,2,3,7,9,500à 499
Rangeisverysensitivetoextremeobservationsandbecomesveryunwieldyveryquickly.
StandarddeviationandvarianceGenerallydiscussedinthecontextofmean
Deviance describeshoweachnthdatapointdeviates frommean𝑌":◦ 𝑌$ −𝑌", 𝑌+ −𝑌",𝑌0 −𝑌",…,𝑌% −𝑌"
Standarddeviation ofasample
◦ 𝑠 = $%2$
∑ (𝑌(−𝑌")+%()$
�
Variance◦ 𝑠+
InterquartilerangeGenerallydiscussedinthecontextofmedian
Quartiles dividethedataintofourequalparts(“quar”!)
Interquartilerange(IQR)isthedifferencebetweenthethirdandfirstquartile◦ HowmuchofthedatadoestheIQRencompass?
Fivenumbersummary:min,Q1,median,Q3,max
MedianFirst quartile Third quartile
Interquartile range
1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55
Meanormedian?Themedianismuchmorerobusttooutlierscomparedtothemean.
mean
mean
Whichwouldyouchooseforasymmetric distributionandwhy?
MeasuresofvariabilityCoefficientofvariationisthestandarddeviationofasampleexpressedasapercentageofthesamplemean(akanormalized)
◦ 𝑪𝑶𝑽 = 𝒔𝒀;×𝟏𝟎𝟎%
◦ Usefulmeasureforcomparingvariabilitybetweentwodifferently-scaleddatasets
SamplevspopulationnotationMeasurement Sampleestimate Population parameter
Mean 𝑌" = $%∑ 𝑌(%
()$ 𝜇 = $%∑ 𝑥(%
()$
Standarddeviation 𝑠 = $
%2$ ∑ (𝑌(−𝑌")+%
()$� σ =$
% ∑ (𝜇(−�̅�)+%
()$�
Variance 𝑠+ σ+
VisualizingdataDifferenttypesofplotsareusedtorepresentdifferenttypesofdata
ContinuousdataHistogramDensityplotBoxplotViolinplot
DiscretedataBarplot
ComparingtwocontinuousvariablesScatterplot
TrendovertimeLineplot
Histogram
0
10
20
30
40
12 14 16 18Value
Count
Usinghistogramstodescribedistributions
Uniform Bell–shaped Asymmetric (skewed) Bimodal
Densityplotssmoothenhistograms
0.0
0.1
0.2
0.3
12 14 16 18Value
Density
0
10
20
30
40
50
12 14 16 18x
count
0.0
0.1
0.2
0.3
12 14 16 18x
density
BoxplotGraphicalrepresentationofafive-numbersummary
“Whiskers”calculatedasdatawithin+/-1.5IQR
−4
−2
0
2
Value
Q1
Median
“whiskers”
Q3
IQR
outliers
Boxplots:Theplotthickens*
0
10
Distributions
Value
*Punintended.
Bimodal Unimodal
0 10 0 100
200
400
600
Value
Count
Whatcanwesayaboutthisdistributionbasedonitsboxplot?
0.0
0.2
0.4
0.6
Value
Symmetry?Skewness?Modality?
AsymmetricRight-skewedUnclear
Violinplot:Densitymeetsboxplot
0
4
8
12
x
value
0 3 6 9 12 0 2 4 3.0 3.5 4.0 4.5 5.00.0
0.5
1.0
0.0
0.1
0.2
0.3
0.00
0.05
0.10
0.15
0.20
value
density
0
4
8
12
x
value
Violinplot
Densityplot
Boxplot
N(5,4) N(2,1) N(4,0.09)
Barplot
0
20
40
60
orange pink red whiteFlowers in garden
Cou
ntFlower color
orangepinkredwhite
Cautionarytaleinbarplots
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128
Scatterplot
−10
0
10
−2 −1 0 1 2Variable 1
Varia
ble
2
0
1
2
3
4
−2 −1 0 1 2 3Variable 1
Varia
ble
2
explanatory/independentvariable
response/dependentvariable
Timeseriesdata
100
110
120
130
140
150
1992 1996 2000Year
Value
19901991199219931994199519961997199819992000200120022003
75 100 125 150 175Value
Year
BREAK