Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive...
Transcript of Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive...
Revision: 1-12 1 1 1
Module 2: Descriptive Statistics
(and a bit about R) Statistics (OA3102)
Professor Ron Fricker Naval Postgraduate School
Monterey, California
Reading assignment:
WM&S chapter 1
Why Care About
Descriptive Statistics?
• Data sets continue to grow ever bigger
– The human mind cannot assimilate and make
sense of volumes of raw data
• Descriptive statistics are useful data reduction
– Numeric summaries
– Graphical plots
• Good descriptive statistics help analysts and
decision makers understand what the raw
data means
Revision: 1-12 2
Goals for this Module
• Define types of data and types of variables
• Learn how to appropriately summarize data
using descriptive statistics
– Numerical descriptive statistics
• Measures of location: mean, median, mode
• Measures of spread: variance, standard
deviation, range, inter-quartile range, etc.
– Graphical descriptive statistics
• Continuous variables: histogram, boxplot
• Categorical variables: barplots, pie charts
• R paradigms and summarizing data with R Revision: 1-12 3 3 3
Revision: 1-12 4 4
Variables
• A characteristic that is being studied in a statistical problem is called a variable
• Types of variables: – Continuous: Can divide by any number and result
still makes sense
• Examples: flight time, failure rate, detection distance
– Categorical:
• Ordinal: ordered categories – Examples: rank, magazine capacity, shirt size
• Nominal: unordered categories – Examples: gender, service branch, ship type
Revision: 1-12 5 5
Types of Data
Data
Qualitative Quantitative
Discrete Continuous (ordinal)
(nominal)
(continuous)
Revision: 1-12 6
Some Descriptive Statistics
• Numerical: – Location: Mean, median, mode
– Spread: Standard deviation, variance, range, quantiles, IQR
– Correlation
• Graphical: – Histograms, bar charts,
dot charts, boxplots,
scatter plots, etc.
• Good descriptive statistics leads to good decision making
Revision: 1-12 7
Sample Mean ( )
• Sample average or sample mean
– Sample consists of n observations, x1,…,xn
– Often denoted by (spoken “x-bar”)
• To calculate
– R: use mean() function
– Excel: =AVERAGE(cell reference)
x
n
i
ixn
x1
1
x
Revision: 1-12 8
Sample Median ( ) x~
5.5
• The median is the halfway point in the
ordered data
• Steps to calculate the median:
– Order the data from smallest to largest
– If the number of data is odd, the middle
observation is the median. E.g.,
1 3 5 6 12 12 99
– If the number is even, then the average of the two
middle observations is the median. E.g.,
1 3 5 6 12 12
Revision: 1-12 9
Using More Formal Notation…
• Let denote the ith order statistic from a sample – E.g., for , we have
• Then the sample median can be defined as
– Equations apply to samples and populations
• To calculate – R: use median() function
– Excel: =MEDIAN(cell reference)
nxxx ,...,, 21
)(ix
2,12,5 321 xxx12,5,2 )3()2()1( xxx
2
~ 122
nn xx
x 2
1
~ nxxn odd: n even:
Revision: 1-12 10
Mean vs. Median
• Both are measures of location or “central tendency” – But, median less affected by outliers
• Example: – Imagine a sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 2
• Median=mean=1
– Another sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 83
• Median still equals 1, but mean=10!
• Which to use? Depends on whether you are: – characterizing a “typical” observation (the median)
– or describing the average value (the mean)
Revision: 1-12 11
Exercise
• Calculate “by hand” the mean and median for the
data: {6,1,3,7,3,6,7,4,8}
11
Revision: 1-12 12
Exercise (continued)
• Now do the same for {6,1,3,7,3,6,7,4,8,100}
12
Revision: 1-12 13
Now, in R:
• For {6,1,3,7,3,6,7,4,8}:
• For {6,1,3,7,3,6,7,4,8,100}:
Revision: 1-12 14
Common Measures of “Spread”
• Measures of location tell you where the “center” of
the data is
• Measures of spread tell you how variable the data is
around the center
• Typical measures of spread:
– Sample variance: essentially, the average squared deviation
around the mean,
– Standard deviation: the square root of the variance,
• The standard deviation is in the same units at the mean
2
1
)(1
12 xxn
sn
i
i
2ss
Revision: 1-12 15
Exercise
• Calculate “by hand” the sample variance and
standard deviation for the data: {1,2,3,4,5}
15
Pictorially
Revision: 1-12 16
Pictorially
Revision: 1-12 17
Pictorially
Revision: 1-12 18
Pictorially
Revision: 1-12 19
Revision: 1-12 20
Ignore Variability at Your Peril
• Often analyses only focus on the average
• But it’s possible to be right on average and be
way off in every case
– The average high temperature
in Washington DC in June is
83 degrees
• “Oh, how balmy!”
• No...it’s either 75°
or it’s 90+ degrees!
From Flaws and Fallicies in Statistical Thinking
by Stephen K. Campbell.
Revision: 1-12 21
The Range (R)
• Range is another measure of spread
• In words, it is the largest observation in the sample minus the smallest observation – Example: A sample of students’ ages in the class
• Data: 21, 23, 23, 25, 25, 26, 27, 31, 33, 33, 35, 40
• Note that they are already ordered!
• R = 40 - 21 = 19
– Using previous notation:
• In R: use the code diff(range()) – range() function gives x(1) and x(n)
1xxR
n
Other Measures of Spread:
Quantiles and Percentiles
• Percentiles
– For data, the pth percentile , , is the
value of x such that p% of the data is less than
or equal to x
• Quantiles same as percentiles except for
scale
– Percentiles are on a 0 to 100 scale
– Quantiles are on a 0 to 1 scale
– The pth quantile equals the (px100)th percentile
Revision: 1-12 22
0 100p
Revision: 1-12 23
Special Percentiles and Quantiles
• Special percentiles:
– Minimum: 0th percentile (or 0 quantile)
– Median: 50th percentile (or 0.5 quantile)
– Maximum: 100th percentile (or 1.0 quantile)
• Quartiles: 25th and 75th percentiles
– Devore: “lower fourth” and “upper fourth”
• Interquartile Range (IQR):
IQR = 75th percentile - 25th percentile
– Devore calls the IQR the “fourth spread”
– In R: IQR()
Revision: 1-12 24
Calculating Quantiles
• R function: quantile(data, probs)
– data is a numeric vector of data
– probs is a numeric vector of probabilities
• Default: 0, 0.25, 0.5, 0.75 and 1.0 quantiles
• In R, pth quantile is x(px(n-1)+1)
– If px(n-1)+1 is not an integer, interpolate between
two closest values
– E.g.,
Revision: 1-12 25
Hinges
• Hinges are an alternative to quartiles
– They’re the x(j) and x(n-j+1) order statistics, for
where if j is not integer, interpolate
• Easier way to compute:
– If n is even, they’re the median values of the upper
and lower halves of the sorted data
– If n is odd, they’re the median values of the upper
and lower halves of the sorted data, where each
half includes the median data point
11
2
2
n
j
Revision: 1-12 26
Exercise
• “By hand,” calculate the five number summary for
{12,2,7,5,15,4,9,18,6}
– The five number summary is the minimum, lower hinge, median, upper hinge, maximum
26
Revision: 1-12 27
Exercise (continued)
• “By hand,” calculate the five number summary for
{12,2,7,5,15,4,9,18,6,10}
27
Revision: 1-12 28
Results in R
28
The Empirical Rule
29
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
-4 -3 -2 -1 0 1 2 3 4
Z
• If the distribution of measurements is
approximately normal, then:
99.7%
• 99.7% (“almost
all”) within m ± 3s 68%
• 68% of the data is
within m ± 1s
95%
• 95% within m ± 2s
Revision: 1-12 30
Remember Notation Conventions
• Summation:
– Σ notation and subscripts
• Size:
– n denotes size of sample
– N denotes size of population
• Knowns vs. unknowns:
– Small letters (i.e., “x”) mean quantity is known
– Capital letters (i.e., “X”) mean quantity is unknown
(i.e., it’s a random variable)
Revision: 1-12 31
Graphically Depicting Data
• Many different types of plots and charts
• What ever you do, don’t fall into the trap of just
using Excel plots because they’re easy
– R much more powerful and flexible
– Excel does not do some important/useful plot types
5
10
15
Co
un
t Axis
80 85 90 95 100 105 110 115 120 125
(thousands)
Revision: 1-12 32
A Classic Good Graphic
Revision: 1-12 33
Some Types of Graphical and
Tabular Summaries of Data
• Univariate discrete data: tables, barplots, dot
charts, pie charts
• Univariate continuous data: stem-and-leaf
plots, strip charts, histograms, boxplots
• Bivariate discrete data: two-way contingency
tables
• Bivariate continuous data: scatterplots, QQ
plots
Revision: 1-12 34
Tabular Summaries of Data
• Categorical data: counts and/or percentages
by category
• Continuous data: counts and/or percentages
within “bins”
– Bins: sequential intervals over the range of data
• Generally intervals are of equal width
• Must decide how to count data point that falls
on the boundary between two bins
– Either count them all in the left bins, or in the right
bins
– Doesn’t matter which, just be consistent
Revision: 1-12 35
Example: Tabular Summary
of Univariate Categorical Data
Manufacturer Frequency
Relative
Frequency
(fraction)
Honda 41 0.34
Yamaha 27 0.23
Kawasaki 20 0.17
Harley-Davidson 18 0.15
BMW 3 0.03
Other 11 0.08
120 1.00
• In R, use the table() function
• For the example:
Revision: 1-12 36
Barplots
• Barplots also known as bar charts and bar
graphs
• Plot one bar for each category
– Bars show counts or percentage of observations in
each category
• Can plot bars vertically or horizontally
• In R: barplot()
– Option horiz=TRUE plots bars horizontally
(default is FALSE)
Revision: 1-12 37
In R
barplot(table(manufac),xlab="Manufacturer",ylab="Count") barplot(table(manufac),ylab="Manufacturer“
,xlab="Count",horiz=TRUE)
Revision: 1-12 38
Plotting Fractions
barplot(table(manufac)/length(manufac),
xlab="Manufacturer",ylab="Fraction")
barplot(table(manufac)/length(manufac),
ylab="Manufacturer",xlab="Fraction",horiz=TRUE)
Revision: 1-12 39
Histograms
• A histogram is a graph of the observed
frequencies in a sample or population
• Histograms show the distribution of the data
• Reading a histogram:
0
2
4
6
8
10
12
170 180 190 200 210 220 230 240 250 260
There are 10
observations greater
than 215 but less
than or equal to 225
Revision: 1-12 40
Histograms Depict
the Empirical Distribution
• Histograms help answer: – Where is the mean of the data (roughly) located?
– How variable is the data?
– What is the overall shape of the data?
• Is the distribution symmetric? Is it skewed? If so, in what direction?
– Are there any unusual observations?
• In R: hist() function
– Options:
• breaks option allows user to vary number of bars
• freq=TRUE (default) gives counts
• freq=FALSE gives density histogram (area sums to one)
Revision: 1-12 41
Frequency Histogram
of Challenger Data
84 49 61 40 83 67 45 66 70 69 80 58
68 60 67 72 73 70 57 63 70 78 52 67
53 67 75 61 70 81 76 79 75 76 58 31
> challenger<-c(84,49,61,40,
83,67,45,66,70,69,80,
58,68,60,67,72,73,70,
57,63,70,78,52,67,53,
67,75,61,70,81,76,79,
75,76,58,31)
> hist(challenger)
Revision: 1-12 42
Density Histogram
of Challenger Data
hist(challenger,freq=FALSE)
Revision: 1-12 43
• Do try alternate numbers of bars
– Find best depiction of the shape (distribution) of data
– Start with number of classes = (i.e., breaks= )
• Don’t use unequal bin widths – keep the bar widths all
the same
• Don’t plot histograms by hand – use software
Dos and Don’ts for Histograms
n
hist(challenger,breaks=2)
1n
hist(challenger,breaks=5) hist(challenger,breaks=9) hist(challenger,breaks=25)
Revision: 1-12 44
Extremes in Histograms
0
5
10
15
20
25
30
35
40
30-89
Temperature (F)
Freq
uen
cy (
co
un
t)
One extreme: A
single bar for all the
data – but that just
shows the total, no
information about the
shape of the data
Another extreme:
One bar for each
temperature – but
that’s just a bar chart.
It’s hard to see the
shape classes seems to be
about right to show
distribution of the data
n
Revision: 1-12 45
Differences Between
Barplots and Histograms
• Barplots:
– For categorical data
– Often most easily read with bars plotted horizontally
– Adjacent bars are separated from each other
• Histograms:
– For continuous data
– Convention to plot bars vertically (to look like a pdf)
– Adjacent (nonzero) bars touch (since base of each
bar denotes the “bin” for that bar)
Revision: 1-12 46
Boxplots
• Boxplots show distribution in one dimension
– Only useful for continuous variables
– Good for comparing distributions of a continuous
variable between categorical groups
– Will not show multiple modes
• Illustration (of one variant):
median
hinges
whiskers outliers outlier
Revision: 1-12 47
Exercise
• Given the following
summary statistics
for the Challenger
data,
(roughly) draw the
boxplot over the
“strip chart”
Revision: 1-12 48
Exercise: Result from R
• Boxplot
Revision: 1-12 49
Histograms vs. Boxplots
• Histogram shows distribution of the data in two dimensions – the boxplot is in one dimension – Histogram shows frequency of observations within ranges – Boxplot only shows summary statistics
We’ll Use Software To Do Most
Calculations and Plots…
• …generally R
• Benefits of R include:
– It’s free
– More importantly, it’s powerful, flexible, extensible,
and cutting-edge
– In terms of extensible, there are now thousands of
libraries (aka packages) available to do custom
calculations, plots, etc.
Revision: 1-12 50
Some R Paradigms
• Command line interface
• Object-oriented programming
• Types of objects, particularly data frames
• Vector-based calculations
Revision: 1-12 51
Command Line Interface
• Command line allows scripting/programming,
which gives flexibility and extensibility
– Point and click paradigm limits user to what has
been programmed into the interface
– Trade-off is “user friendliness,” meaning command
line users must learn the underlying language and
syntax
• Good news: Once you gain a working
familiarity, you have access to very powerful
computing tool
Revision: 1-12 52
All the Std Graphics Plus…
Revision: 1-12 53
Example #1: Flexible Graphics
Revision: 1-12 54
Example #2: Flexible Graphics
Revision: 1-12 55
Example #3: Flexible Graphics
Revision: 1-12 56
Example #4: Flexible Graphics
Revision: 1-12 57
Example #5: Flexible Graphics
Revision: 1-12 58
Object-oriented Programming
• R is an object-oriented programming
language – Wikipedia: “Object-oriented programming (OOP) is a
programming paradigm that uses "objects" … to design
applications and computer programs. ”
• Everything in R is an object of some type
– Each type of object has particular properties
– Properties control what objects can and cannot
do, as well as how other objects interact with them
Revision: 1-12 59
Types of Objects
• Important types of objects in R:
– Vector: a one-dimensional list of numbers
– Matrix: a two-dimensional list of numbers
– Array: a multi-dimensional list of numbers
– Data.frame: a two-dimensional list that can contain
any type of data (numeric, string, logical, etc)
– Function: small programs that usually take input
as arguments and after running produce output
• The function class(obj) will tell you what
type of object “obj” is
Revision: 1-12 60
More on Data Frames
• Think of them like tables
– Columns correspond to variables (and data in
columns must all be of the same type)
– Rows correspond to observations
Revision: 1-12 61
More on Functions
• Functions always end with parenthesis
– If there are arguments, they go here
– Some functions don’t have or need arguments
• Example: ls()
– Function code output when parentheses left off
• Can run functions of functions
– Example: mean(seq(1:9))
• Lots of built-in functions and you can write
your own
Revision: 1-12 62
Vector-based Calculations
• R very efficient (i.e., fast) working with
vectors, much less so with loops
• Key idea: In data frames, instead of writing
code that operates on the rows of a data
frame (i.e., observation by observation) you
write code that operates on the variables
(i.e., the columns, which are the variables!)
• Takes a while to get used to thinking in terms
of vectors rather than individual observations
Revision: 1-12 63
Simple Example
• Data frame with data
on various types of
travel for a set of
individuals:
• Easy way to calc total days deployed in R:
Revision: 1-12 64
• Even fancier:
• The hard way:
Simple Example, continued
Revision: 1-12 65
What We Covered in this Module
• Defined types of data and types of variables
• Learned how to appropriately summarize
data using descriptive statistics
– Numerical descriptive statistics
• Measures of location: mean, median, mode
• Measures of spread: variance, standard
deviation, range, inter-quartile range, etc.
– Graphical descriptive statistics
• Continuous variables: histogram, boxplot
• Categorical variables: barplots, pie charts
• R paradigms and summarizing data with R Revision: 1-12 66 66 66
Revision: 1-12 67 67
Homework
• WM&S chapter 1 – Required exercises 2, 9, 13, 17, 22, 25
– Extra credit: 11
• Hints and instructions: Do exercises 2,13, and 25 in R as much as possible
o The data sets are in Sakai in CSV format; read them in using the instructions from Lab #1
o Exercise 2: Just construct a frequency histogram in R with the Mt. Washington observation left out
o Exercises 13 and 25: The sort() function in R could be useful for counting the number that fall in each interval
Exercise 9: Use either Table 4 in WM&S or R to calculate. If you use R, the pnorm() function will be helpful
Exercise 17: Only do the approximation for Exercise 1.2