day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science...

36
Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD

Transcript of day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science...

Page 1: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

DescriptiveandSummaryStatisticsBIO5312FALL2017

STEPHANIE J. SPIELMAN,PHD

Page 2: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

LogisticsAllcoursematerialswillbehostedhere:http://sjspielman.org/bio5312_fall2017

SubmitassignmentsviaCanvas:https://templeu.instructure.com

Pleasebringyourlaptoptoclass!!!

OfficeSERC643◦ WeeklyofficehoursFriday1-3groundfloorofSERCß vote?

Page 3: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

CoursegoalsTheprimarygoalistoanalyze,interpret,andvisualizedatainthebiologicalsciences

AchievedviastatisticalanalysisanddatasciencetechniquesinR

Thisisnotacourseinstatisticaltheory.

Page 4: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

CoursetopicsDescriptiveandSummaryStatistics

Datavisualization

Fundamentalsinprobability,distributions

Statisticalinference:hypothesistestingandconfidenceintervals

Linearmodeling

Multipletesting

Binaryclassification

Clusteringmethods

Specialtopicsincurrentbiologicaldataanalysis

Page 5: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

CoursetopicsDescriptiveandSummaryStatistics

Datavisualization

Fundamentalsinprobability,distributions

Statisticalinference:hypothesistestingandconfidenceintervals

Linearmodeling

Multipletesting

Binaryclassification

Clusteringmethods

Specialtopicsincurrentbiologicaldataanalysis

Page 6: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Butfirst,whatarewedoinghere?Statisticsisthestudyofthecollection,analysis,interpretation,presentation,andorganizationofdata.

Weusestatisticstomakeinferencesaboutphenomenausingsamplesandquantifyuncertaintyofdata

Biostatisticsis(surprisingly!)abranchofappliedstatisticsgearedtowardstomedicalandbiologicalproblems

Page 7: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

PopulationsandsamplesPopulations aretheentirecollectionofindividuals/units/etc.aresearcherisinterestedin◦ Generallywecanneverknowthetruecompositionofapopulation◦ Populationsaredescribedwithparameters

Samples aresubsets ofindividuals/unitsfrompopulations◦ Weusehypothesistestingto(tryto)drawpopulation-levelconclusionsfromsamples◦ Samplesaredescribedwith estimates

Parametersandestimatesusedifferentnotations,aswewillsee

Page 8: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Whatmakesagoodsample?Inanidealworld,asampleisunbiasedandfeatureslowsamplingerror◦ Biasisasystematicdiscrepancybetweenestimateandparameter

Samplesshouldberandomlychosen◦ Eachpopulationunitshouldhaveanequalandindependent chanceofbeingchosenforagivensample

Bias

Samplingerror

LowbiasandlowsamplingerrorPrecise Imprecise

Inaccurate

Accurate

Page 9: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Popquiz:Isitrandom?Aresearcherselectsthefirst58studentvolunteersthatsignupforastudy

Acomputerprogramnumbersallresidentsinacommunity,andthenusesarandom-numbergeneratortoselect26residents

Aresearchervigorouslyshakesaboxcontainingequallysizedballsandtakesthefirst3thatfalloutofthebox.

AresearcherselectsallstudyparticipantswhosefirstnamestartswithanA,B,K,M,orO.

Page 10: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Popquiz:Isitrandom?Aresearcherselectsthefirst58studentvolunteersthatsignupforastudy

Acomputerprogramnumbersallresidentsinacommunity,andthenusesarandom-numbergeneratortoselect26residents

Aresearchervigorouslyshakesaboxcontainingequallysizedballsandtakesthefirst3thatfalloutofthebox.

AresearcherselectsallstudyparticipantswhosefirstnamestartswithanA,B,K,M,orO.

Page 11: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

DescriptiveandSummaryStatisticsToolstoconciselydescribedata,numericallyandvisually

Generallythefirststepindataexplorationandstatisticalanalysiso Identifymissingvalues,outliers,etc.o Checkassumptionsrequiredtofitmodelsorperformstatisticaltestso Identifytrendsthatmeritfurtherstudy

Page 12: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Typesofdata

Quantitativedata◦ Continuous◦ Discrete(includescountdata)

Categoricaldata◦ Nominal◦ Ordinal◦ Binary*

Howyouanalyzeandvisualizedatadependsonthetype ofdatayouhave

Page 13: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

QuantitativedataContinuous◦ Anyreal-numbervaluewithinsomerange

Discrete◦ Valuesareinindivisibleunits,i.e.wholeorcountingnumbers◦ Includescountdata(numberofcupsofcoffeeperday,numberofaminoacidsinaprotein…)

Page 14: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

CategoricaldataNominal◦ Haircolor,eyecolor,sexgenotypes(XX,XY,XXY,XYY,XO).

Ordinal – categorieswithanaturalordering◦ Bad,fair,good,excellent◦ A,B,C,D

Binary◦ Yes/No◦ True/False

Bonus:namesofsexgenotypes?

Page 15: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

MeasuresofLocationContinuous

Mean

𝑌" = $%∑ 𝑌(%

()$

Median◦ Foroddn,the %*$

+th observation

◦ Forevenn,theaverageofthe %+thand

%++ 1 th observation

Discrete

Mode◦ Themostfrequentappearingobservationinthedistribution(commonlyusedfordiscretedata)

◦ 1,2,2,2,3,4,4,5,6à 2

Page 16: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Measuresoflocationindistributions

http://i.imgur.com/YSEYhha.jpg

Page 17: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

MeasuresofspreadRange

Standarddeviationandvariance

Interquartilerange

Page 18: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

RangeDifferencebetweenlargestandsmallestvalueinadistribution◦ 1,2,3,7,9à 8◦ 1,2,3,7,9,500à 499

Rangeisverysensitivetoextremeobservationsandbecomesveryunwieldyveryquickly.

Page 19: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

StandarddeviationandvarianceGenerallydiscussedinthecontextofmean

Deviance describeshoweachnthdatapointdeviates frommean𝑌":◦ 𝑌$ −𝑌", 𝑌+ −𝑌",𝑌0 −𝑌",…,𝑌% −𝑌"

Standarddeviation ofasample

◦ 𝑠 = $%2$

∑ (𝑌(−𝑌")+%()$

Variance◦ 𝑠+

Page 20: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

InterquartilerangeGenerallydiscussedinthecontextofmedian

Quartiles dividethedataintofourequalparts(“quar”!)

Interquartilerange(IQR)isthedifferencebetweenthethirdandfirstquartile◦ HowmuchofthedatadoestheIQRencompass?

Fivenumbersummary:min,Q1,median,Q3,max

MedianFirst quartile Third quartile

Interquartile range

1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55

Page 21: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Meanormedian?Themedianismuchmorerobusttooutlierscomparedtothemean.

mean

mean

Whichwouldyouchooseforasymmetric distributionandwhy?

Page 22: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

MeasuresofvariabilityCoefficientofvariationisthestandarddeviationofasampleexpressedasapercentageofthesamplemean(akanormalized)

◦ 𝑪𝑶𝑽 = 𝒔𝒀;×𝟏𝟎𝟎%

◦ Usefulmeasureforcomparingvariabilitybetweentwodifferently-scaleddatasets

Page 23: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

SamplevspopulationnotationMeasurement Sampleestimate Population parameter

Mean 𝑌" = $%∑ 𝑌(%

()$ 𝜇 = $%∑ 𝑥(%

()$

Standarddeviation 𝑠 = $

%2$ ∑ (𝑌(−𝑌")+%

()$� σ =$

% ∑ (𝜇(−�̅�)+%

()$�

Variance 𝑠+ σ+

Page 24: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

VisualizingdataDifferenttypesofplotsareusedtorepresentdifferenttypesofdata

ContinuousdataHistogramDensityplotBoxplotViolinplot

DiscretedataBarplot

ComparingtwocontinuousvariablesScatterplot

TrendovertimeLineplot

Page 25: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Histogram

0

10

20

30

40

12 14 16 18Value

Count

Page 26: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Usinghistogramstodescribedistributions

Uniform Bell–shaped Asymmetric (skewed) Bimodal

Page 27: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Densityplotssmoothenhistograms

0.0

0.1

0.2

0.3

12 14 16 18Value

Density

0

10

20

30

40

50

12 14 16 18x

count

0.0

0.1

0.2

0.3

12 14 16 18x

density

Page 28: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

BoxplotGraphicalrepresentationofafive-numbersummary

“Whiskers”calculatedasdatawithin+/-1.5IQR

−4

−2

0

2

Value

Q1

Median

“whiskers”

Q3

IQR

outliers

Page 29: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Boxplots:Theplotthickens*

0

10

Distributions

Value

*Punintended.

Bimodal Unimodal

0 10 0 100

200

400

600

Value

Count

Page 30: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Whatcanwesayaboutthisdistributionbasedonitsboxplot?

0.0

0.2

0.4

0.6

Value

Symmetry?Skewness?Modality?

AsymmetricRight-skewedUnclear

Page 31: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Violinplot:Densitymeetsboxplot

0

4

8

12

x

value

0 3 6 9 12 0 2 4 3.0 3.5 4.0 4.5 5.00.0

0.5

1.0

0.0

0.1

0.2

0.3

0.00

0.05

0.10

0.15

0.20

value

density

0

4

8

12

x

value

Violinplot

Densityplot

Boxplot

N(5,4) N(2,1) N(4,0.09)

Page 32: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Barplot

0

20

40

60

orange pink red whiteFlowers in garden

Cou

ntFlower color

orangepinkredwhite

Page 33: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Cautionarytaleinbarplots

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128

Page 34: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Scatterplot

−10

0

10

−2 −1 0 1 2Variable 1

Varia

ble

2

0

1

2

3

4

−2 −1 0 1 2 3Variable 1

Varia

ble

2

explanatory/independentvariable

response/dependentvariable

Page 35: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

Timeseriesdata

100

110

120

130

140

150

1992 1996 2000Year

Value

19901991199219931994199519961997199819992000200120022003

75 100 125 150 175Value

Year

Page 36: day1 descriptive and summary - GitHub Pages · Achieved via statistical analysis and data science techniques in R ... Descriptive and Summary Statistics Data visualization Fundamentals

BREAK