STA 291 Spring 2010

26
STA 291 Spring 2010 Lecture 5 Dustin Lueker

description

STA 291 Spring 2010. Lecture 5 Dustin Lueker. Measures of Central Tendency. Mean - Arithmetic Average . Median - Midpoint of the observations when they are arranged in increasing order. Notation: Subscripted variables n = # of units in the sample N = # of units in the population - PowerPoint PPT Presentation

Transcript of STA 291 Spring 2010

Page 1: STA 291 Spring 2010

STA 291Spring 2010

Lecture 5Dustin Lueker

Page 2: STA 291 Spring 2010

Measures of Central Tendency

2

Mode - Most frequent value.

Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi = Measurement of the ith unit

Mean - Arithmetic Average

Mean of a Sample - x

Mean of a Population -

μ

Median - Midpoint of the observations when they are arranged in increasing order

STA 291 Spring 2010 Lecture 5

Page 3: STA 291 Spring 2010

Symbols

3

2

(mu)

(sigma)

(sigma-squared)

or (x-i)

(x-bar)

i

population mean

population standard deviation

population variance

x x observation

x sample mean

s

s

2

sample standard deviation

s sample variance

ummation symbol

STA 291 Spring 2010 Lecture 5

Page 4: STA 291 Spring 2010

Variance and Standard Deviation Sample

◦ Variance

◦ Standard Deviation

Population◦ Variance

◦ Standard Deviation

4

22 ( )

1

ix xs

n

2( )

1

ix xs

n

22 ( )ix

N

2( )ix

N

STA 291 Spring 2010 Lecture 5

Page 5: STA 291 Spring 2010

5

Variance Step By Step

1. Calculate the mean2. For each observation, calculate the

deviation3. For each observation, calculate the squared

deviation4. Add up all the squared deviations5. Divide the result by (n-1)

Or N if you are finding the population variance

(To get the standard deviation, take the square root of the result)

STA 291 Spring 2010 Lecture 5

Page 6: STA 291 Spring 2010

Empirical Rule If the data is approximately symmetric and

bell-shaped then◦ About 68% of the observations are within one

standard deviation from the mean◦ About 95% of the observations are within two

standard deviations from the mean◦ About 99.7% of the observations are within

three standard deviations from the mean

6STA 291 Spring 2010 Lecture 5

Page 7: STA 291 Spring 2010

Empirical Rule

STA 291 Spring 2010 Lecture 5 7

Page 8: STA 291 Spring 2010

The pth percentile (Xp) is a number such that p% of the observations take values below it, and (100-p)% take values above it◦ 50th percentile = median◦ 25th percentile = lower quartile◦ 75th percentile = upper quartile

The index of Lp

◦ (n+1)p/100

Percentiles

8STA 291 Spring 2010 Lecture 5

Page 9: STA 291 Spring 2010

25th percentile ◦ lower quartile◦ Q1◦ (approximately) median of the observations

below the median

75th percentile◦ upper quartile◦ Q3◦ (approximately) median of the observations

above the median

Quartiles

9STA 291 Spring 2010 Lecture 5

Page 10: STA 291 Spring 2010

Find the 25th percentile of this data set◦ {3, 7, 12, 13, 15, 19, 24}

Example

10STA 291 Spring 2010 Lecture 5

Page 11: STA 291 Spring 2010

Use when the index is not a whole number Want to start with the closest index lower

than the number found then go the distance of the decimal towards the next number

If the index is found to be 5.4 you want to go to the 5th value then add .4 of the value between the 5th value and 6th value◦ In essence we are going to the 5.4th value

Interpolation

STA 291 Spring 2010 Lecture 5 11

Page 12: STA 291 Spring 2010

Find the 40th percentile of the same data set◦ {3, 7, 12, 13, 15, 19, 24}

Must use interpolation

Example

12STA 291 Spring 2010 Lecture 5

Page 13: STA 291 Spring 2010

Five Number Summary◦ Minimum◦ Lower Quartile◦ Median◦ Upper Quartile◦ Maximum

Example◦ minimum=4◦ Q1=256◦ median=530◦ Q3=1105◦ maximum=320,000.

What does this suggest about the shape of the distribution?

Data Summary

13STA 291 Spring 2010 Lecture 5

Page 14: STA 291 Spring 2010

The Interquartile Range (IQR) is the difference between upper and lower quartile◦ IQR = Q3 – Q1◦ IQR = Range of values that contains the middle

50% of the data◦ IQR increases as variability increases

Murder Rate Data◦ Q1= 3.9◦ Q3 = 10.3◦ IQR =

Interquartile Range (IQR)

14STA 291 Spring 2010 Lecture 5

Page 15: STA 291 Spring 2010

Displays the five number summary (and more) graphical

Consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)

A line within the box that marks the median, And whiskers that extend to the maximum

and minimum values

This is assuming there are no outliers in the data set

Box Plot

15STA 291 Spring 2010 Lecture 5

Page 16: STA 291 Spring 2010

An observation is an outlier if it falls ◦ more than 1.5 IQR above the upper quartile

or◦ more than 1.5 IQR below the lower quartile

Outliers

16STA 291 Spring 2010 Lecture 5

Page 17: STA 291 Spring 2010

Whiskers only extend to the most extreme observations within 1.5 IQR beyond the quartiles

If an observation is an outlier, it is marked by an x, +, or some other identifier

Box Plot

17STA 291 Spring 2010 Lecture 5

Page 18: STA 291 Spring 2010

Values Min = 148 Q1 = 158 Median = Q2 = 162 Q3 = 182 Max = 204

Create a box plot

Example

18STA 291 Spring 2010 Lecture 5

Page 19: STA 291 Spring 2010

On right-skewed distributions, minimum, Q1, and median will be “bunched up”, while Q3 and the maximum will be farther away.

For left-skewed distributions, the “mirror” is true: the maximum, Q3, and the median will be relatively close compared to the corresponding distances to Q1 and the minimum.

Symmetric distributions?

5 Number Summary/Box Plot

STA 291 Spring 2010 Lecture 5 19

Page 20: STA 291 Spring 2010

Value that occurs most frequently◦ Does not need to be near the center of the distribution

Not really a measure of central tendency◦ Can be used for all types of data (nominal, ordinal,

interval) Special Cases

◦ Data Set {2, 2, 4, 5, 5, 6, 10, 11} Mode =

◦ Data Set {2, 6, 7, 10, 13} Mode =

Mode

20STA 291 Spring 2010 Lecture 5

Page 21: STA 291 Spring 2010

Mean◦ Interval data with an approximately symmetric

distribution Median

◦ Interval or ordinal data Mode

◦ All types of data

Mean vs. Median vs. Mode

21STA 291 Spring 2010 Lecture 5

Page 22: STA 291 Spring 2010

Mean is sensitive to outliers◦ Median and mode are not

Why? In general, the median is more appropriate

for skewed data than the mean◦ Why?

In some situations, the median may be too insensitive to changes in the data

The mode may not be unique

Mean vs. Median vs. Mode

22STA 291 Spring 2010 Lecture 5

Page 23: STA 291 Spring 2010

Example “How often do you read the newspaper?”

23

Response Frequency

every day 969

a few times a week

452

once a week 261

less than once a week

196

Never 76

TOTAL 1954

• Identify the mode

• Identify the median response

STA 291 Spring 2010 Lecture 5

Page 24: STA 291 Spring 2010

Measures of Variation Statistics that describe variability

◦ Two distributions may have the same mean and/or median but different variability Mean and Median only describe a typical value,

but not the spread of the data

◦ Range◦ Variance◦ Standard Deviation◦ Interquartile Range

All of these can be computed for the sample or population

24STA 291 Spring 2010 Lecture 5

Page 25: STA 291 Spring 2010

Range Difference between the largest and smallest

observation◦ Very much affected by outliers

A misrecorded observation may lead to an outlier, and affect the range

The range does not always reveal different variation about the mean

25STA 291 Spring 2010 Lecture 5

Page 26: STA 291 Spring 2010

Example Sample 1

◦ Smallest Observation: 112◦ Largest Observation: 797◦ Range =

Sample 2◦ Smallest Observation: 15033◦ Largest Observation: 16125◦ Range =

26STA 291 Spring 2010 Lecture 5