Post on 28-Nov-2014
Module 2 - Exploratory Data Analysis (EDA)
Central Tendency and Variability
Text: Field, A. 2009 2nd edition-Chapter 1: 1.7-Chapter 2: 2.1 – 2.5-Chapter 4: 4.1 – 4.9
Describing a Population/Sample
• Statistics is the study of data which has some element of random variation - random variable.
• This variation in the variable under study can be conceptualised as a frequency or probability distribution.
• An example - Distribution of a normal random variable (x)
• The properties of this distribution can be described in several ways - Central tendency, Position, Variability
x
Describing a Population/Sample
• Central Tendency or “Average”– Mode– Median– Mean
• Position – Quantiles– Quartiles– Percentiles
• Variability or Dispersion– Range, Interquartile Range (IQR)– Variance, Standard Deviation– Standard Error of the Sample Mean
16 18 20 22 24 26 28 30 32
height
0
3
6
9
12
15
Fre
qu
en
cy
Mean = 23.03Std. Dev. = 2.7412N = 50
Working With an Example
Note that for the following definitions, we will be working with the following data set (n=23) of individual weights (kg)
7393
68.5101
65.5
78.58380
80.587
7375.6
6186.561.5
65.53998
69.552.5
71.576
74.5
Central Tendency - Mode
• The mode is the most common value
• It has the highest frequency in the dataset
• You can see that the example dataset has two modes:
65.5kg and 73kg both have a frequency of 2
• This dataset is bimodal
Value Frequency39 1
52.5 161 1
61.5 165.5 268.5 169.5 171.5 1
73 274.5 175.6 1
76 178.5 1
80 180.5 1
83 186.5 1
87 193 198 1
101 1
Central Tendency - Median
• The median is the middle value in an ordered list of n numbers
• 50% of the data lie on either side of this value
• It is also represented as Q2 (2nd Quartile)
• The position of Q2 can be calculated by using the following
( 1)
2
n
Calculating the Median
In our example the dataset contains 23 numbers:
( 1)
2(23 1)
2
12th
n
number
Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5
10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101
Therefore the 12th number in the ascending data set will be the median (Q2 = 74.5kg)
Central Tendency - Mean
• Sample mean– Represented by
• Population mean– Represented by
• Note that means– sum all values from 1 to n
xn
xx
n
ii
1
n
i 1
Calculating the Mean
• The summation of all of our data values= 1714.1 kg.
• Divided by the number of values (n = 23)
• So the mean is
23
1i
i
x
1
1714.1
2374.5 .
n
ii
xx
n
kg
Position
• Quantiles – General name for measures of position
that divide the distribution (or ranked data) into equal groups. For examples quarters,tenths, hundreds, etc.
• Quartiles– Measures of position that divide the
distribution (or ranked data) into Quarters.
• Percentiles– Measures of position that divide the
distribution (or ranked data) into 100 equal subsets
Central Tendency vs. Variability
• The mean, median, and mode all tell us about the central tendency of a distribution.
• They cannot tell us about the spread of the distribution (variability).
Variability - Range
• The Range of the distribution of data is given by the difference between the maximum value and the minimum value
Range = Max - Min
• A measurement of variability that usually accompanies the Median.
Variability - Interquartile Range
• Quartiles are the three points (Q1, Q2, Q3) in the distribution defining four equal quarters.
• The quartiles cut the data distribution into four sections each containing 25% of the data.
Q1 Q2 Q325% of the data
Variability - Interquartile Range
• The Interquartile Range (IQR) is represented by the difference between the lower quartile (Q1) and the upper quartile (Q3)
• These quartile positions can be calculated via
• The IQR can then be calculated using the value at these positions
• A measurement of variability that usually accompanies the Median.
1
( 1)
4
nforQ
3
3( 1)
4
nforQ
Calculating the Interquartile
Range1
( 1)
4(23 1)
4
6th
nforQ
number
Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5
10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101
3
3( 1)
43(23 1)
4
18th
nforQ
number
Q1 is therefore 65.5kg.
Q3 is therefore 83.0kg.
Q2 or Median
Calculating the Interquartile
Range
Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5
10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101
Q1 = 65.5kg
Q3 = 83.0kg
IQR = Q3 - Q1
= 83.0 - 65.5= 17.5
Variability Around the Mean
80
70
60
50
40
Mean
Sample
Variation around the mean can be described as the difference (or distance) between the data point and the mean 30
x x
Variability Around the Mean
We cannot simply subtract each number from the mean because the sum of these differences will be zero - the positive differences will cancel out the negative differences
Number Mean Number - Mean
73 74.52609 -1.52608695793 74.52609 18.47391304
68.5 74.52609 -6.026086957101 74.52609 26.47391304
65.5 74.52609 -9.02608695778.5 74.52609 3.973913043
83 74.52609 8.47391304380 74.52609 5.473913043
80.5 74.52609 5.97391304387 74.52609 12.4739130473 74.52609 -1.526086957
75.6 74.52609 1.07391304361 74.52609 -13.52608696
86.5 74.52609 11.9739130461.5 74.52609 -13.0260869665.5 74.52609 -9.026086957
39 74.52609 -35.5260869698 74.52609 23.47391304
69.5 74.52609 -5.02608695752.5 74.52609 -22.0260869671.5 74.52609 -3.026086957
76 74.52609 1.47391304374.5 74.52609 -0.026086957
Total 0
• If we square the differences then we will always get a positive number– this is known as the
sum of squares (SS)– this can be
represented by the following equation
– Where;represents the mean
represents each individual number
Difference
Number Mean Number - Mean Squared
73 74.52609 -1.526086957 2.32894193 74.52609 18.47391304 341.2855
68.5 74.52609 -6.026086957 36.31372101 74.52609 26.47391304 700.8681
65.5 74.52609 -9.026086957 81.4702578.5 74.52609 3.973913043 15.79198
83 74.52609 8.473913043 71.807280 74.52609 5.473913043 29.96372
80.5 74.52609 5.973913043 35.6876487 74.52609 12.47391304 155.598573 74.52609 -1.526086957 2.328941
75.6 74.52609 1.073913043 1.15328961 74.52609 -13.52608696 182.955
86.5 74.52609 11.97391304 143.374661.5 74.52609 -13.02608696 169.678965.5 74.52609 -9.026086957 81.47025
39 74.52609 -35.52608696 1262.10398 74.52609 23.47391304 551.0246
69.5 74.52609 -5.026086957 25.2615552.5 74.52609 -22.02608696 485.148571.5 74.52609 -3.026086957 9.157202
76 74.52609 1.473913043 2.1724274.5 74.52609 -0.026086957 0.000681
Total 0 4386.944
2( )x x
x
x
Variability Around the Mean
• Although useful in some calculations, the sum of squares does not take into account the number of observations (is dependent on sample size).
• There are some important ways that the spread of the data around the mean can be represented (based on sum of squares).– The Variance (s2).– The Standard Deviation (s).– The Standard Error of the Sample Mean.
(S.E. or s).
Variability - Sample Variance
• The Variance uses the Sum of Squares adjusted for the number of “independent” observations in the sample:-“average” variation
• We can use the Sums of Squares calculated in the previous slide:
2
2
4.199
123
944.4386
kg
s
1
)(2
2
n
Xxs
Notice that we are in squared units
Variability - Sample Standard Deviation
• The sample’s Standard Deviation is the square root of the Variance:
2
1
199 4
14 12
( )
.
.
x xs
n
s
kg
Notice that we are now back in our original units
The Standard Error of the Sample Mean
• The Std. Dev. divided by the square root of n is called the Standard Error of the sample mean - we will encounter this measure later on in the course.2
199.4 14.12
23 232.94
x
s ss
n n
Sample VS Population
Sample Population
x = sample mean = population mean
s = sample std dev = population std dev
s2= sample variance 2 = population var.n = sample size N = population size
Sample Only
Standard error of the sample mean (S.E.)
xs
Module 2 - Exploratory Data Analysis (EDA)
Graphical Methods
Text: Field, A. 2009 2nd edition-Chapter 1: 1.7-Chapter 2: 2.1 – 2.5-Chapter 4: 4.1 – 4.9
Graphical Methods & SPSS
• Graphical methods are a good way of summarising information and are useful to visualise patterns within your data.
• Various methods can be used depending on the measurement scale of the variables.
• SPSS is the statistical package that you will be using this semester and has a similar spreadsheet format to Microsoft Excel.
• Generally, when entering data into SPSS, each column contains a different variable.
Graphs for Discrete Variables
• Measurement scale - nominal or ordinal– Other terms -categorical, binned, class,
qualitative– Examples - gender, age group, trap type
• Common graphical methods are:– Pie charts for proportions, percentages, or
values that sum to a fixed value– Bar charts for most other discrete variables
• Data can be entered into SPSS in two forms– Each case (row) represents a single observation– Each case (row) represents the count,
percentage, or proportion of each level of the discrete variable
Data Entry for Discrete Variables
Data entry type 1 :-Can create charts directly using this type of data
Data entry type 2:-First tell SPSS that each discrete level has been counted
An Example - Mass (%) of Each Element Within a Star
• The data is entered into SPSS as in data entry type 2
• You must then tell SPSS to weight each observation (case) by the variable “mass”
• You will need to do this for a pie chart and for a bar graph
Making a Pie Chart in SPSS
The Pie Chart
Cases weighted by MASS
Other
Helium
Hydrogen
Making a Bar Chart in SPSS
Simple Bar Chart
Cases weighted by MASS
Element
OtherHeliumHydrogen
Co
un
t
80
60
40
20
0
One variable with three categories
Clustered Bar Chart
Smoker Non Smoker
Smoking Status
0
100
200
300
400
500
Co
un
t
Cancer StatusCancer
No Cancer
Cases weighted by freq
Two variables with two categories each
Graphs for Continuous Variables
• Measurement scale - Scale– Other terms - quantitative– Examples - Length, Temperature, Species
Richness
• Common graphical methods are:– For a single sample - Histograms, Box and
Whisker plots, Error Bar plots, Q-Q plots.– For 2 or more samples - Clustered Box and
Whisker plots, Clustered Error Bar plots.– For 2 scale variables - Scatter plots.
An Example - Plant Heights
We will be using the following data set of plant heights (cm) to construct a histogram.
21 24.5 20 23.5 24.520 26 21 24 25
21.5 23.5 21 20 2823 24.5 22.5 21 2821 25 21.5 22 26
21.5 26.5 22.5 21.5 2524 21.5 23 16.5 29
25.5 23 25 19 3120.5 22.5 23 19 21.5
24 23.5 23 19.5 22.5
HistogramTo create a histogram by hand, we need to create a series of “bins” or categories.
– The data ranges from 16.5 to 31.0.– we can use the following groups to classify
the data.You can see that the ‘bins’ have been organised so that there each datum belongs to a unique group
Bin Tally Frequency
16 – 17.9
18 – 19.9
20 – 21.9
22 – 23.9
24 – 25.9
26 – 27.9
28 – 29.9
30 – 31.9
Histogram
Histogram of Plant height
0
2
4
6
8
10
12
14
16
16 – 17.9
18 – 19.9
20 – 21.9
22 – 23.9
24 – 25.9
26 – 27.9
28 – 29.9
30 – 31.9
Height Categories (or Bins)
Fre
qu
en
cy
Histogram of Plant height
0
2
4
6
8
10
12
14
16
Height Categories (or Bins)
Fre
qu
en
cy
HistogramHere’s One We Prepared
Earlier
Histogram Using SPSS
• SPSS will create the bins, work out frequencies and create the histogram for you
• The data needs to be entered in a single column
Histogram Using SPSS
Histogram Using SPSS
16 18 20 22 24 26 28 30 32
height
0
3
6
9
12
15
Fre
qu
en
cy
Mean = 23.03Std. Dev. = 2.7412N = 50
Single sampleVariable height (8 bins)
Histogram Using SPSS
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
height
0
2
4
6
8
10
12
Fre
qu
en
cy
Mean = 23.03Std. Dev. = 2.7412N = 50
Single sampleVariable height (16 bins)
Q-Q Plot
• For a single sample
• Plots the quantiles of a variable's distribution (observed - unknown distribution) against the quantiles of a test
distribution (expected - e.g. Normal Dist.).
• The test distribution (expected values) have the same mean and standard deviation as the observed data.
• Available test distributions include Beta, Chi-square, Exponential, Gamma, Logistic, Lognormal, Normal, Student’s t, and Uniform.
Q-Q Plot
• Probability plots are generally used to determine whether the distribution of a variable (observed - unknown distribution) matches a given distribution (expected - e.g..
Normal Dist.).
• If the selected variable matches the test distribution, the points line up on a 450 line (observed = expected).
• Note, if using a sample from a population the sample size needs to be reasonably large.
• An alternative is the P-P plot (percentile plot)
Q-Q Plot
Normal Q-Q Plot of HEIGHT
Observed Value
323028262422201816
Exp
ect
ed
No
rma
l Va
lue
30
28
26
24
22
20
18
16
Expected quantiles for a normal distribution with the same mean and standard deviation as the observed distribution
Observed quantiles from our sample of plant heights
Box and Whisker Plots
• The Box includes– The Median
– Q1 and Q3 as the edges of the box
• The Whiskers – either (method 1) – “5 number summary”
• Max and the Min are the ends of the whiskers
– or (method 2) – default method used in SPSS• Q3+1.5 IQR and Q1-1.5 IQR are the ends of the
whiskers
• Q3+3.0 IQR and Q1-3.0 IQR border between outliers and extreme outliers
• symbols used for outliers (O) and extreme outliers (*)
Box and Whisker Plot Method 1 - 5 Number
Summary
Max
Q3
Q2 (Median)
Q1
Min
IQRRange
This type of Box and Whisker Plot is the simplest.
It is based on a five number summary:-
Max, Q3, Q2, Q1, Min
Box and Whisker Plot Method 2 - SPSS (Boxplot)
Extreme Outlier
Outlier
Outliers
o
*
oo
Q3 + 3 IQR
Q3 + 1.5 IQR (or max)
Q3
Q2 (Median)
Q1
Q1 - 1.5 IQR (or min)
Q1 - 3 IQR
Making a Boxplot in SPSS
SPSS Clustered Boxplot
88888N =
SITES
54321
GA
LL
S
70
60
50
40
30
20
10
0
-10
15
Note:Outlier present in second site (sample)
Several samples
Error Bar Plot
The Error Bar plot is used to represent
• The mean
• Plus a measure of variation around the mean– Confidence Interval of the Sample Mean– The Standard Error of the Sample Mean– The Standard Deviation of the sample
• The most common form of the Error Bar Plot– Is the Standard Error Plot– Mean 1 Standard Error of the Sample Mean
Error Bar Plot in SPSS
Make sure you select the correct measure of variability
The default multiplier is 2 so make sure that you always change it to 1
88888N =
SITES
54321
Me
an
+-
1 S
E G
AL
LS
40
30
20
10
0
SPSS Clustered Error Bar Plot
Note:Mean 1 S.E.
Several samples
Scatter Plot
Two scale variables
-20.00 -10.00 0.00 10.00 20.00
Temperature
2.00
3.00
4.00
5.00
Ox
yg
en
Co
nc
en
tra
tio
n
-20.00 -10.00 0.00 10.00 20.00
Temperature
2.00
3.00
4.00
5.00
Oxy
gen
Co
nc
entr
ati
on
R Sq Linear = 0.979
Line of best fit or linear regression model
Scatter PlotThree scale variables
20.0 25.0 30.0 35.0 40.0
10.0
12.0
14.0
16.0
18.0
20.0
Tu
rbid
ity
4.05.06.07.08.09.010.011.0