EDA stats 2010

Module 2 - Exploratory Data Analysis (EDA)

Central Tendency and Variability

Text: Field, A. 2009 2nd edition-Chapter 1: 1.7-Chapter 2: 2.1 – 2.5-Chapter 4: 4.1 – 4.9

Describing a Population/Sample

• Statistics is the study of data which has some element of random variation - random variable.

• This variation in the variable under study can be conceptualised as a frequency or probability distribution.

• An example - Distribution of a normal random variable (x)

• The properties of this distribution can be described in several ways - Central tendency, Position, Variability

Describing a Population/Sample

• Central Tendency or “Average”– Mode– Median– Mean

• Position – Quantiles– Quartiles– Percentiles

• Variability or Dispersion– Range, Interquartile Range (IQR)– Variance, Standard Deviation– Standard Error of the Sample Mean

16 18 20 22 24 26 28 30 32

height

Mean = 23.03Std. Dev. = 2.7412N = 50

Working With an Example

Note that for the following definitions, we will be working with the following data set (n=23) of individual weights (kg)

68.5101

78.58380

80.587

7375.6

6186.561.5

65.53998

69.552.5

71.576

Central Tendency - Mode

• The mode is the most common value

• It has the highest frequency in the dataset

• You can see that the example dataset has two modes:

65.5kg and 73kg both have a frequency of 2

• This dataset is bimodal

Value Frequency39 1

52.5 161 1

61.5 165.5 268.5 169.5 171.5 1

73 274.5 175.6 1

76 178.5 1

80 180.5 1

83 186.5 1

87 193 198 1

Central Tendency - Median

• The median is the middle value in an ordered list of n numbers

• 50% of the data lie on either side of this value

• It is also represented as Q2 (2nd Quartile)

• The position of Q2 can be calculated by using the following

Calculating the Median

In our example the dataset contains 23 numbers:

2(23 1)

number

Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5

10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101

Therefore the 12th number in the ascending data set will be the median (Q2 = 74.5kg)

Central Tendency - Mean

• Sample mean– Represented by

• Population mean– Represented by

• Note that means– sum all values from 1 to n

Calculating the Mean

• The summation of all of our data values= 1714.1 kg.

• Divided by the number of values (n = 23)

• So the mean is

1714.1

2374.5 .

Position

• Quantiles – General name for measures of position

that divide the distribution (or ranked data) into equal groups. For examples quarters,tenths, hundreds, etc.

• Quartiles– Measures of position that divide the

distribution (or ranked data) into Quarters.

• Percentiles– Measures of position that divide the

distribution (or ranked data) into 100 equal subsets

Central Tendency vs. Variability

• The mean, median, and mode all tell us about the central tendency of a distribution.

• They cannot tell us about the spread of the distribution (variability).

Variability - Range

• The Range of the distribution of data is given by the difference between the maximum value and the minimum value

Range = Max - Min

• A measurement of variability that usually accompanies the Median.

Variability - Interquartile Range

• Quartiles are the three points (Q1, Q2, Q3) in the distribution defining four equal quarters.

• The quartiles cut the data distribution into four sections each containing 25% of the data.

Q1 Q2 Q325% of the data

Variability - Interquartile Range

• The Interquartile Range (IQR) is represented by the difference between the lower quartile (Q1) and the upper quartile (Q3)

• These quartile positions can be calculated via

• The IQR can then be calculated using the value at these positions

• A measurement of variability that usually accompanies the Median.

Calculating the Interquartile

Range1

4(23 1)

number

Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5

10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101

43(23 1)

number

Q1 is therefore 65.5kg.

Q3 is therefore 83.0kg.

Q2 or Median

Calculating the Interquartile

Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5

10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101

Q1 = 65.5kg

Q3 = 83.0kg

IQR = Q3 - Q1

= 83.0 - 65.5= 17.5

Variability Around the Mean

Sample

Variation around the mean can be described as the difference (or distance) between the data point and the mean 30

We cannot simply subtract each number from the mean because the sum of these differences will be zero - the positive differences will cancel out the negative differences

Number Mean Number - Mean

73 74.52609 -1.52608695793 74.52609 18.47391304

68.5 74.52609 -6.026086957101 74.52609 26.47391304

65.5 74.52609 -9.02608695778.5 74.52609 3.973913043

83 74.52609 8.47391304380 74.52609 5.473913043

80.5 74.52609 5.97391304387 74.52609 12.4739130473 74.52609 -1.526086957

75.6 74.52609 1.07391304361 74.52609 -13.52608696

86.5 74.52609 11.9739130461.5 74.52609 -13.0260869665.5 74.52609 -9.026086957

39 74.52609 -35.5260869698 74.52609 23.47391304

69.5 74.52609 -5.02608695752.5 74.52609 -22.0260869671.5 74.52609 -3.026086957

76 74.52609 1.47391304374.5 74.52609 -0.026086957

Total 0

• If we square the differences then we will always get a positive number– this is known as the

sum of squares (SS)– this can be

represented by the following equation

– Where;represents the mean

represents each individual number

Difference

Number Mean Number - Mean Squared

73 74.52609 -1.526086957 2.32894193 74.52609 18.47391304 341.2855

68.5 74.52609 -6.026086957 36.31372101 74.52609 26.47391304 700.8681

65.5 74.52609 -9.026086957 81.4702578.5 74.52609 3.973913043 15.79198

83 74.52609 8.473913043 71.807280 74.52609 5.473913043 29.96372

80.5 74.52609 5.973913043 35.6876487 74.52609 12.47391304 155.598573 74.52609 -1.526086957 2.328941

75.6 74.52609 1.073913043 1.15328961 74.52609 -13.52608696 182.955

86.5 74.52609 11.97391304 143.374661.5 74.52609 -13.02608696 169.678965.5 74.52609 -9.026086957 81.47025

39 74.52609 -35.52608696 1262.10398 74.52609 23.47391304 551.0246

69.5 74.52609 -5.026086957 25.2615552.5 74.52609 -22.02608696 485.148571.5 74.52609 -3.026086957 9.157202

76 74.52609 1.473913043 2.1724274.5 74.52609 -0.026086957 0.000681

Total 0 4386.944

2( )x x

• Although useful in some calculations, the sum of squares does not take into account the number of observations (is dependent on sample size).

• There are some important ways that the spread of the data around the mean can be represented (based on sum of squares).– The Variance (s2).– The Standard Deviation (s).– The Standard Error of the Sample Mean.

(S.E. or s).

Variability - Sample Variance

• The Variance uses the Sum of Squares adjusted for the number of “independent” observations in the sample:-“average” variation

• We can use the Sums of Squares calculated in the previous slide:

944.4386

Notice that we are in squared units

Variability - Sample Standard Deviation

• The sample’s Standard Deviation is the square root of the Variance:

Notice that we are now back in our original units

The Standard Error of the Sample Mean

• The Std. Dev. divided by the square root of n is called the Standard Error of the sample mean - we will encounter this measure later on in the course.2

199.4 14.12

23 232.94

Sample VS Population

Sample Population

x = sample mean = population mean

s = sample std dev = population std dev

s2= sample variance 2 = population var.n = sample size N = population size

Sample Only

Standard error of the sample mean (S.E.)

Module 2 - Exploratory Data Analysis (EDA)

Graphical Methods

Text: Field, A. 2009 2nd edition-Chapter 1: 1.7-Chapter 2: 2.1 – 2.5-Chapter 4: 4.1 – 4.9

Graphical Methods & SPSS

• Graphical methods are a good way of summarising information and are useful to visualise patterns within your data.

• Various methods can be used depending on the measurement scale of the variables.

• SPSS is the statistical package that you will be using this semester and has a similar spreadsheet format to Microsoft Excel.

• Generally, when entering data into SPSS, each column contains a different variable.

Graphs for Discrete Variables

• Measurement scale - nominal or ordinal– Other terms -categorical, binned, class,

qualitative– Examples - gender, age group, trap type

• Common graphical methods are:– Pie charts for proportions, percentages, or

values that sum to a fixed value– Bar charts for most other discrete variables

• Data can be entered into SPSS in two forms– Each case (row) represents a single observation– Each case (row) represents the count,

percentage, or proportion of each level of the discrete variable

Data Entry for Discrete Variables

Data entry type 1 :-Can create charts directly using this type of data

Data entry type 2:-First tell SPSS that each discrete level has been counted

An Example - Mass (%) of Each Element Within a Star

• The data is entered into SPSS as in data entry type 2

• You must then tell SPSS to weight each observation (case) by the variable “mass”

• You will need to do this for a pie chart and for a bar graph

Making a Pie Chart in SPSS

The Pie Chart

Cases weighted by MASS

Helium

Hydrogen

Making a Bar Chart in SPSS

Simple Bar Chart

Cases weighted by MASS

Element

OtherHeliumHydrogen

One variable with three categories

Clustered Bar Chart

Smoker Non Smoker

Smoking Status

Cancer StatusCancer

No Cancer

Cases weighted by freq

Two variables with two categories each

Graphs for Continuous Variables

• Measurement scale - Scale– Other terms - quantitative– Examples - Length, Temperature, Species

Richness

• Common graphical methods are:– For a single sample - Histograms, Box and

Whisker plots, Error Bar plots, Q-Q plots.– For 2 or more samples - Clustered Box and

Whisker plots, Clustered Error Bar plots.– For 2 scale variables - Scatter plots.

An Example - Plant Heights

We will be using the following data set of plant heights (cm) to construct a histogram.

21 24.5 20 23.5 24.520 26 21 24 25

21.5 23.5 21 20 2823 24.5 22.5 21 2821 25 21.5 22 26

21.5 26.5 22.5 21.5 2524 21.5 23 16.5 29

25.5 23 25 19 3120.5 22.5 23 19 21.5

24 23.5 23 19.5 22.5

HistogramTo create a histogram by hand, we need to create a series of “bins” or categories.

– The data ranges from 16.5 to 31.0.– we can use the following groups to classify

the data.You can see that the ‘bins’ have been organised so that there each datum belongs to a unique group

Bin Tally Frequency

16 – 17.9

18 – 19.9

20 – 21.9

22 – 23.9

24 – 25.9

26 – 27.9

28 – 29.9

30 – 31.9

Histogram

Histogram of Plant height

16 – 17.9

18 – 19.9

20 – 21.9

22 – 23.9

24 – 25.9

26 – 27.9

28 – 29.9

30 – 31.9

Height Categories (or Bins)

Histogram of Plant height

Height Categories (or Bins)

HistogramHere’s One We Prepared

Earlier

Histogram Using SPSS

• SPSS will create the bins, work out frequencies and create the histogram for you

• The data needs to be entered in a single column

16 18 20 22 24 26 28 30 32

height

Mean = 23.03Std. Dev. = 2.7412N = 50

Single sampleVariable height (8 bins)

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

height

Mean = 23.03Std. Dev. = 2.7412N = 50

Single sampleVariable height (16 bins)

Q-Q Plot

• For a single sample

• Plots the quantiles of a variable's distribution (observed - unknown distribution) against the quantiles of a test

distribution (expected - e.g. Normal Dist.).

• The test distribution (expected values) have the same mean and standard deviation as the observed data.

• Available test distributions include Beta, Chi-square, Exponential, Gamma, Logistic, Lognormal, Normal, Student’s t, and Uniform.

Q-Q Plot

• Probability plots are generally used to determine whether the distribution of a variable (observed - unknown distribution) matches a given distribution (expected - e.g..

Normal Dist.).

• If the selected variable matches the test distribution, the points line up on a 450 line (observed = expected).

• Note, if using a sample from a population the sample size needs to be reasonably large.

• An alternative is the P-P plot (percentile plot)

Q-Q Plot

Normal Q-Q Plot of HEIGHT

Observed Value

323028262422201816

Expected quantiles for a normal distribution with the same mean and standard deviation as the observed distribution

Observed quantiles from our sample of plant heights

Box and Whisker Plots

• The Box includes– The Median

– Q1 and Q3 as the edges of the box

• The Whiskers – either (method 1) – “5 number summary”

• Max and the Min are the ends of the whiskers

– or (method 2) – default method used in SPSS• Q3+1.5 IQR and Q1-1.5 IQR are the ends of the

whiskers

• Q3+3.0 IQR and Q1-3.0 IQR border between outliers and extreme outliers

• symbols used for outliers (O) and extreme outliers (*)

Box and Whisker Plot Method 1 - 5 Number

Summary

Q2 (Median)

IQRRange

This type of Box and Whisker Plot is the simplest.

It is based on a five number summary:-

Max, Q3, Q2, Q1, Min

Box and Whisker Plot Method 2 - SPSS (Boxplot)

Extreme Outlier

Outlier

Outliers

Q3 + 3 IQR

Q3 + 1.5 IQR (or max)

Q2 (Median)

Q1 - 1.5 IQR (or min)

Q1 - 3 IQR

Making a Boxplot in SPSS

SPSS Clustered Boxplot

88888N =

Note:Outlier present in second site (sample)

Several samples

Error Bar Plot

The Error Bar plot is used to represent

• The mean

• Plus a measure of variation around the mean– Confidence Interval of the Sample Mean– The Standard Error of the Sample Mean– The Standard Deviation of the sample

• The most common form of the Error Bar Plot– Is the Standard Error Plot– Mean 1 Standard Error of the Sample Mean

Error Bar Plot in SPSS

Make sure you select the correct measure of variability

The default multiplier is 2 so make sure that you always change it to 1

88888N =

SPSS Clustered Error Bar Plot

Note:Mean 1 S.E.

Several samples

Scatter Plot

Two scale variables

-20.00 -10.00 0.00 10.00 20.00

Temperature

-20.00 -10.00 0.00 10.00 20.00

Temperature

R Sq Linear = 0.979

Line of best fit or linear regression model

Scatter PlotThree scale variables

20.0 25.0 30.0 35.0 40.0

4.05.06.07.08.09.010.011.0

EDA stats 2010

Documents

Transcript of EDA stats 2010

NEFAR Aug 2010 Market Stats

GC (world) stats, 2006 2010

The Millennium Development Goals - Report 2010 [STATS] (ONU 2010)

About the EDA 2010

Census 2010 -- Metro Minnesota Stats

Sept 2010 NEFFAR Market Stats

Mosman Council web stats 2010

Stats Report Feb 26 2010

World Health Stats - 2010

Austin Market Stats - May 2010

EDA: Gallaudet University (OSERS)--FY 2010 …€¦ · Web viewFY 2010 Program Performance Report (System Print Out) Strategic Goal 3 Direct Appropriation EDA, Title I, Part A and

AIFS Stats 2010

BizBash Expo & Awards NY 2010 Tweet Stats

Indain Tourism Stats 2010 11

Lakewood WA crime stats April 2010

WSH Stats Report 2010

Lakewood WA crime stats March 2010

Eurelectric Power Stats 2010 Synopsis

Lake County Sheriff's 2010 Crime Stats

2010 WWPA Stats