Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive...

Post on 24-Mar-2018

215 views 2 download

Transcript of Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive...

Revision: 1-12 1 1 1

Module 2: Descriptive Statistics

(and a bit about R) Statistics (OA3102)

Professor Ron Fricker Naval Postgraduate School

Monterey, California

Reading assignment:

WM&S chapter 1

Why Care About

Descriptive Statistics?

• Data sets continue to grow ever bigger

– The human mind cannot assimilate and make

sense of volumes of raw data

• Descriptive statistics are useful data reduction

– Numeric summaries

– Graphical plots

• Good descriptive statistics help analysts and

decision makers understand what the raw

data means

Revision: 1-12 2

Goals for this Module

• Define types of data and types of variables

• Learn how to appropriately summarize data

using descriptive statistics

– Numerical descriptive statistics

• Measures of location: mean, median, mode

• Measures of spread: variance, standard

deviation, range, inter-quartile range, etc.

– Graphical descriptive statistics

• Continuous variables: histogram, boxplot

• Categorical variables: barplots, pie charts

• R paradigms and summarizing data with R Revision: 1-12 3 3 3

Revision: 1-12 4 4

Variables

• A characteristic that is being studied in a statistical problem is called a variable

• Types of variables: – Continuous: Can divide by any number and result

still makes sense

• Examples: flight time, failure rate, detection distance

– Categorical:

• Ordinal: ordered categories – Examples: rank, magazine capacity, shirt size

• Nominal: unordered categories – Examples: gender, service branch, ship type

Revision: 1-12 5 5

Types of Data

Data

Qualitative Quantitative

Discrete Continuous (ordinal)

(nominal)

(continuous)

Revision: 1-12 6

Some Descriptive Statistics

• Numerical: – Location: Mean, median, mode

– Spread: Standard deviation, variance, range, quantiles, IQR

– Correlation

• Graphical: – Histograms, bar charts,

dot charts, boxplots,

scatter plots, etc.

• Good descriptive statistics leads to good decision making

Revision: 1-12 7

Sample Mean ( )

• Sample average or sample mean

– Sample consists of n observations, x1,…,xn

– Often denoted by (spoken “x-bar”)

• To calculate

– R: use mean() function

– Excel: =AVERAGE(cell reference)

x

n

i

ixn

x1

1

x

Revision: 1-12 8

Sample Median ( ) x~

5.5

• The median is the halfway point in the

ordered data

• Steps to calculate the median:

– Order the data from smallest to largest

– If the number of data is odd, the middle

observation is the median. E.g.,

1 3 5 6 12 12 99

– If the number is even, then the average of the two

middle observations is the median. E.g.,

1 3 5 6 12 12

Revision: 1-12 9

Using More Formal Notation…

• Let denote the ith order statistic from a sample – E.g., for , we have

• Then the sample median can be defined as

– Equations apply to samples and populations

• To calculate – R: use median() function

– Excel: =MEDIAN(cell reference)

nxxx ,...,, 21

)(ix

2,12,5 321 xxx12,5,2 )3()2()1( xxx

2

~ 122

nn xx

x 2

1

~ nxxn odd: n even:

Revision: 1-12 10

Mean vs. Median

• Both are measures of location or “central tendency” – But, median less affected by outliers

• Example: – Imagine a sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 2

• Median=mean=1

– Another sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 83

• Median still equals 1, but mean=10!

• Which to use? Depends on whether you are: – characterizing a “typical” observation (the median)

– or describing the average value (the mean)

Revision: 1-12 11

Exercise

• Calculate “by hand” the mean and median for the

data: {6,1,3,7,3,6,7,4,8}

11

Revision: 1-12 12

Exercise (continued)

• Now do the same for {6,1,3,7,3,6,7,4,8,100}

12

Revision: 1-12 13

Now, in R:

• For {6,1,3,7,3,6,7,4,8}:

• For {6,1,3,7,3,6,7,4,8,100}:

Revision: 1-12 14

Common Measures of “Spread”

• Measures of location tell you where the “center” of

the data is

• Measures of spread tell you how variable the data is

around the center

• Typical measures of spread:

– Sample variance: essentially, the average squared deviation

around the mean,

– Standard deviation: the square root of the variance,

• The standard deviation is in the same units at the mean

2

1

)(1

12 xxn

sn

i

i

2ss

Revision: 1-12 15

Exercise

• Calculate “by hand” the sample variance and

standard deviation for the data: {1,2,3,4,5}

15

Pictorially

Revision: 1-12 16

Pictorially

Revision: 1-12 17

Pictorially

Revision: 1-12 18

Pictorially

Revision: 1-12 19

Revision: 1-12 20

Ignore Variability at Your Peril

• Often analyses only focus on the average

• But it’s possible to be right on average and be

way off in every case

– The average high temperature

in Washington DC in June is

83 degrees

• “Oh, how balmy!”

• No...it’s either 75°

or it’s 90+ degrees!

From Flaws and Fallicies in Statistical Thinking

by Stephen K. Campbell.

Revision: 1-12 21

The Range (R)

• Range is another measure of spread

• In words, it is the largest observation in the sample minus the smallest observation – Example: A sample of students’ ages in the class

• Data: 21, 23, 23, 25, 25, 26, 27, 31, 33, 33, 35, 40

• Note that they are already ordered!

• R = 40 - 21 = 19

– Using previous notation:

• In R: use the code diff(range()) – range() function gives x(1) and x(n)

1xxR

n

Other Measures of Spread:

Quantiles and Percentiles

• Percentiles

– For data, the pth percentile , , is the

value of x such that p% of the data is less than

or equal to x

• Quantiles same as percentiles except for

scale

– Percentiles are on a 0 to 100 scale

– Quantiles are on a 0 to 1 scale

– The pth quantile equals the (px100)th percentile

Revision: 1-12 22

0 100p

Revision: 1-12 23

Special Percentiles and Quantiles

• Special percentiles:

– Minimum: 0th percentile (or 0 quantile)

– Median: 50th percentile (or 0.5 quantile)

– Maximum: 100th percentile (or 1.0 quantile)

• Quartiles: 25th and 75th percentiles

– Devore: “lower fourth” and “upper fourth”

• Interquartile Range (IQR):

IQR = 75th percentile - 25th percentile

– Devore calls the IQR the “fourth spread”

– In R: IQR()

Revision: 1-12 24

Calculating Quantiles

• R function: quantile(data, probs)

– data is a numeric vector of data

– probs is a numeric vector of probabilities

• Default: 0, 0.25, 0.5, 0.75 and 1.0 quantiles

• In R, pth quantile is x(px(n-1)+1)

– If px(n-1)+1 is not an integer, interpolate between

two closest values

– E.g.,

Revision: 1-12 25

Hinges

• Hinges are an alternative to quartiles

– They’re the x(j) and x(n-j+1) order statistics, for

where if j is not integer, interpolate

• Easier way to compute:

– If n is even, they’re the median values of the upper

and lower halves of the sorted data

– If n is odd, they’re the median values of the upper

and lower halves of the sorted data, where each

half includes the median data point

11

2

2

n

j

Revision: 1-12 26

Exercise

• “By hand,” calculate the five number summary for

{12,2,7,5,15,4,9,18,6}

– The five number summary is the minimum, lower hinge, median, upper hinge, maximum

26

Revision: 1-12 27

Exercise (continued)

• “By hand,” calculate the five number summary for

{12,2,7,5,15,4,9,18,6,10}

27

Revision: 1-12 28

Results in R

28

The Empirical Rule

29

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

-4 -3 -2 -1 0 1 2 3 4

Z

• If the distribution of measurements is

approximately normal, then:

99.7%

• 99.7% (“almost

all”) within m ± 3s 68%

• 68% of the data is

within m ± 1s

95%

• 95% within m ± 2s

Revision: 1-12 30

Remember Notation Conventions

• Summation:

– Σ notation and subscripts

• Size:

– n denotes size of sample

– N denotes size of population

• Knowns vs. unknowns:

– Small letters (i.e., “x”) mean quantity is known

– Capital letters (i.e., “X”) mean quantity is unknown

(i.e., it’s a random variable)

Revision: 1-12 31

Graphically Depicting Data

• Many different types of plots and charts

• What ever you do, don’t fall into the trap of just

using Excel plots because they’re easy

– R much more powerful and flexible

– Excel does not do some important/useful plot types

5

10

15

Co

un

t Axis

80 85 90 95 100 105 110 115 120 125

(thousands)

Revision: 1-12 32

A Classic Good Graphic

Revision: 1-12 33

Some Types of Graphical and

Tabular Summaries of Data

• Univariate discrete data: tables, barplots, dot

charts, pie charts

• Univariate continuous data: stem-and-leaf

plots, strip charts, histograms, boxplots

• Bivariate discrete data: two-way contingency

tables

• Bivariate continuous data: scatterplots, QQ

plots

Revision: 1-12 34

Tabular Summaries of Data

• Categorical data: counts and/or percentages

by category

• Continuous data: counts and/or percentages

within “bins”

– Bins: sequential intervals over the range of data

• Generally intervals are of equal width

• Must decide how to count data point that falls

on the boundary between two bins

– Either count them all in the left bins, or in the right

bins

– Doesn’t matter which, just be consistent

Revision: 1-12 35

Example: Tabular Summary

of Univariate Categorical Data

Manufacturer Frequency

Relative

Frequency

(fraction)

Honda 41 0.34

Yamaha 27 0.23

Kawasaki 20 0.17

Harley-Davidson 18 0.15

BMW 3 0.03

Other 11 0.08

120 1.00

• In R, use the table() function

• For the example:

Revision: 1-12 36

Barplots

• Barplots also known as bar charts and bar

graphs

• Plot one bar for each category

– Bars show counts or percentage of observations in

each category

• Can plot bars vertically or horizontally

• In R: barplot()

– Option horiz=TRUE plots bars horizontally

(default is FALSE)

Revision: 1-12 37

In R

barplot(table(manufac),xlab="Manufacturer",ylab="Count") barplot(table(manufac),ylab="Manufacturer“

,xlab="Count",horiz=TRUE)

Revision: 1-12 38

Plotting Fractions

barplot(table(manufac)/length(manufac),

xlab="Manufacturer",ylab="Fraction")

barplot(table(manufac)/length(manufac),

ylab="Manufacturer",xlab="Fraction",horiz=TRUE)

Revision: 1-12 39

Histograms

• A histogram is a graph of the observed

frequencies in a sample or population

• Histograms show the distribution of the data

• Reading a histogram:

0

2

4

6

8

10

12

170 180 190 200 210 220 230 240 250 260

There are 10

observations greater

than 215 but less

than or equal to 225

Revision: 1-12 40

Histograms Depict

the Empirical Distribution

• Histograms help answer: – Where is the mean of the data (roughly) located?

– How variable is the data?

– What is the overall shape of the data?

• Is the distribution symmetric? Is it skewed? If so, in what direction?

– Are there any unusual observations?

• In R: hist() function

– Options:

• breaks option allows user to vary number of bars

• freq=TRUE (default) gives counts

• freq=FALSE gives density histogram (area sums to one)

Revision: 1-12 41

Frequency Histogram

of Challenger Data

84 49 61 40 83 67 45 66 70 69 80 58

68 60 67 72 73 70 57 63 70 78 52 67

53 67 75 61 70 81 76 79 75 76 58 31

> challenger<-c(84,49,61,40,

83,67,45,66,70,69,80,

58,68,60,67,72,73,70,

57,63,70,78,52,67,53,

67,75,61,70,81,76,79,

75,76,58,31)

> hist(challenger)

Revision: 1-12 42

Density Histogram

of Challenger Data

hist(challenger,freq=FALSE)

Revision: 1-12 43

• Do try alternate numbers of bars

– Find best depiction of the shape (distribution) of data

– Start with number of classes = (i.e., breaks= )

• Don’t use unequal bin widths – keep the bar widths all

the same

• Don’t plot histograms by hand – use software

Dos and Don’ts for Histograms

n

hist(challenger,breaks=2)

1n

hist(challenger,breaks=5) hist(challenger,breaks=9) hist(challenger,breaks=25)

Revision: 1-12 44

Extremes in Histograms

0

5

10

15

20

25

30

35

40

30-89

Temperature (F)

Freq

uen

cy (

co

un

t)

One extreme: A

single bar for all the

data – but that just

shows the total, no

information about the

shape of the data

Another extreme:

One bar for each

temperature – but

that’s just a bar chart.

It’s hard to see the

shape classes seems to be

about right to show

distribution of the data

n

Revision: 1-12 45

Differences Between

Barplots and Histograms

• Barplots:

– For categorical data

– Often most easily read with bars plotted horizontally

– Adjacent bars are separated from each other

• Histograms:

– For continuous data

– Convention to plot bars vertically (to look like a pdf)

– Adjacent (nonzero) bars touch (since base of each

bar denotes the “bin” for that bar)

Revision: 1-12 46

Boxplots

• Boxplots show distribution in one dimension

– Only useful for continuous variables

– Good for comparing distributions of a continuous

variable between categorical groups

– Will not show multiple modes

• Illustration (of one variant):

median

hinges

whiskers outliers outlier

Revision: 1-12 47

Exercise

• Given the following

summary statistics

for the Challenger

data,

(roughly) draw the

boxplot over the

“strip chart”

Revision: 1-12 48

Exercise: Result from R

• Boxplot

Revision: 1-12 49

Histograms vs. Boxplots

• Histogram shows distribution of the data in two dimensions – the boxplot is in one dimension – Histogram shows frequency of observations within ranges – Boxplot only shows summary statistics

We’ll Use Software To Do Most

Calculations and Plots…

• …generally R

• Benefits of R include:

– It’s free

– More importantly, it’s powerful, flexible, extensible,

and cutting-edge

– In terms of extensible, there are now thousands of

libraries (aka packages) available to do custom

calculations, plots, etc.

Revision: 1-12 50

Some R Paradigms

• Command line interface

• Object-oriented programming

• Types of objects, particularly data frames

• Vector-based calculations

Revision: 1-12 51

Command Line Interface

• Command line allows scripting/programming,

which gives flexibility and extensibility

– Point and click paradigm limits user to what has

been programmed into the interface

– Trade-off is “user friendliness,” meaning command

line users must learn the underlying language and

syntax

• Good news: Once you gain a working

familiarity, you have access to very powerful

computing tool

Revision: 1-12 52

All the Std Graphics Plus…

Revision: 1-12 53

Example #1: Flexible Graphics

Revision: 1-12 54

Example #2: Flexible Graphics

Revision: 1-12 55

Example #3: Flexible Graphics

Revision: 1-12 56

Example #4: Flexible Graphics

Revision: 1-12 57

Example #5: Flexible Graphics

Revision: 1-12 58

Object-oriented Programming

• R is an object-oriented programming

language – Wikipedia: “Object-oriented programming (OOP) is a

programming paradigm that uses "objects" … to design

applications and computer programs. ”

• Everything in R is an object of some type

– Each type of object has particular properties

– Properties control what objects can and cannot

do, as well as how other objects interact with them

Revision: 1-12 59

Types of Objects

• Important types of objects in R:

– Vector: a one-dimensional list of numbers

– Matrix: a two-dimensional list of numbers

– Array: a multi-dimensional list of numbers

– Data.frame: a two-dimensional list that can contain

any type of data (numeric, string, logical, etc)

– Function: small programs that usually take input

as arguments and after running produce output

• The function class(obj) will tell you what

type of object “obj” is

Revision: 1-12 60

More on Data Frames

• Think of them like tables

– Columns correspond to variables (and data in

columns must all be of the same type)

– Rows correspond to observations

Revision: 1-12 61

More on Functions

• Functions always end with parenthesis

– If there are arguments, they go here

– Some functions don’t have or need arguments

• Example: ls()

– Function code output when parentheses left off

• Can run functions of functions

– Example: mean(seq(1:9))

• Lots of built-in functions and you can write

your own

Revision: 1-12 62

Vector-based Calculations

• R very efficient (i.e., fast) working with

vectors, much less so with loops

• Key idea: In data frames, instead of writing

code that operates on the rows of a data

frame (i.e., observation by observation) you

write code that operates on the variables

(i.e., the columns, which are the variables!)

• Takes a while to get used to thinking in terms

of vectors rather than individual observations

Revision: 1-12 63

Simple Example

• Data frame with data

on various types of

travel for a set of

individuals:

• Easy way to calc total days deployed in R:

Revision: 1-12 64

• Even fancier:

• The hard way:

Simple Example, continued

Revision: 1-12 65

What We Covered in this Module

• Defined types of data and types of variables

• Learned how to appropriately summarize

data using descriptive statistics

– Numerical descriptive statistics

• Measures of location: mean, median, mode

• Measures of spread: variance, standard

deviation, range, inter-quartile range, etc.

– Graphical descriptive statistics

• Continuous variables: histogram, boxplot

• Categorical variables: barplots, pie charts

• R paradigms and summarizing data with R Revision: 1-12 66 66 66

Revision: 1-12 67 67

Homework

• WM&S chapter 1 – Required exercises 2, 9, 13, 17, 22, 25

– Extra credit: 11

• Hints and instructions: Do exercises 2,13, and 25 in R as much as possible

o The data sets are in Sakai in CSV format; read them in using the instructions from Lab #1

o Exercise 2: Just construct a frequency histogram in R with the Mt. Washington observation left out

o Exercises 13 and 25: The sort() function in R could be useful for counting the number that fall in each interval

Exercise 9: Use either Table 4 in WM&S or R to calculate. If you use R, the pnorm() function will be helpful

Exercise 17: Only do the approximation for Exercise 1.2