1.2: Describing Distributions with Numbers

Post on 01-Jan-2016

36 views 0 download

description

1.2: Describing Distributions with Numbers. Measures of Location (Center) Mean Median B. Measures of Spread (Variability) Quartiles (Quantiles) Variance and Standard deviation. Measures of Location. 1. Mean (Average). How to find the mean (average):. - PowerPoint PPT Presentation

Transcript of 1.2: Describing Distributions with Numbers

1.2: Describing Distributions with Numbers

A. Measures of Location (Center) Mean Median

B. Measures of Spread (Variability) Quartiles (Quantiles) Variance and Standard deviation

Measures of Location

How to find the mean (average):

1) Add the values together

2) Divide the total by the number of observations

• Example: Test Scores : 56, 65, 54, 55, 57, 54, 61, 62, 60, 55, 57, 56, 57, 61, 62, 60, 49, 66, 59, 80

Step 1 : 56 + 65 + 54 + …… + 59 + 80 = 1186

Step 2 : 1186 / 20 = 59.3

Mean

1. Mean (Average)

Mean

To find the mean x of a set of observations, add their values anddivide by the number of observations. If the n observations arex , x , x , ….. , x , their mean is :1 2 3 n

Or, in more compact notation:

x =x

1x

nx

3x

2+ + + +...

nx

x = xi

2. MedianHow to find the median M :

1) Arrange the observations in order from smallest to largest.

2) If the number of observations is odd, then the median is located at the center of the list. So, if there are n observations,then the median is located in spot (n + 1) / 2

3) If the number of observations is even, then the median isthe average of the two terms in the middle spots. These arelocated in spots (n / 2) and (n / 2) + 1

Median

Example of finding a Median :

List 1 : 2, 4, 6, 3, 5, 2, 6, 8, 10, 11, 1

Step 1: Order the list :

1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11

Step 2 : Find the middle term2 : (n+1) / 2 = (11 + 1) / 2 = 6

1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11

Median

MedianExample of finding a Median :

List : 2, 4, 6, 3, 5, 2, 6, 8, 10, 11, 1, 12

Step 1: Order the list :

1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11, 12

Step 2 : Find the two middle terms :

1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11, 12

Median = (5 + 6) /2

n / 2 = 12 / 2 = 6 (n / 2) + 1 = (12 / 2) + 1 = 7

Step 3 : Average the sixth and seventh terms :

= 5.5

In The Presence Of Outliers

Q: Do outliers affect the Mean and Median?

Consider the list on numbers from 1 through 9 :

1, 2, 3, 4, 5, 6, 7 ,8 ,9

The Mean is : 5 The Median is : 5

What if we put the number 100 at the end of the list :

The Mean is :

1, 2, 3, 4, 5, 6, 7 ,8 ,9, 100

14.5 The Median is : 5.5

A: Outliers affect the Mean much more than the Median !

Distributions

The mean is the point at which a histogram balances. For symmetric distributions the mean and median will be nearlythe same.

However, since the mean is influenced by outliers, for skewed distributions the mean will be pulled in the direction of the long tail while the median will be resistant to the outliers and remain in nearly the same place.

Skewed Right

M X

Skewed Left

X M

Describing SpreadThe Five Number Summary :

1) The Median

2) First Quartile : 25% of the observations lie below the First Quartile

3) Third Quartile : 75% of the observations lie below the third quartile

4) Lowest Individual Observation (Minimum)

5) Highest Individual Observation (Maximum)

QuartilesCalculating the Quartiles :

1) Arrange the observations in increasing order and locate the Median M in the ordered list o’ observations.

2) The First Quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median.

3) The Third Quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median.

Quartiles

Example of calculating First Quartile :

List of quiz scores: 10, 8, 9, 4, 6, 6, 8, 9, 2, 7

1) Order the list: 2, 4, 6, 6, 7, 8, 8, 9, 9, 10

Find the median: (7 + 8) / 2 = 7.5

2) Find all the observations whose position in the list is to the left of the median : 2, 4, 6, 6, 7, 8, 8, 9, 9, 10

Find the median of these values : 6

Quartiles

Example of calculating Third Quartile :

List of quiz scores: 10, 8, 9, 4, 6, 6, 8, 9, 2, 7, 11

1) Order the list: 2, 4, 6, 6, 7, 8, 8, 9, 9, 10, 11

Find the median: 8

2) Find all the observations whose position in the list is to the right of the median : 2, 4, 6, 6, 7, 8, 8, 9, 9, 10, 118, 9, 9, 10, 11

Find the median of these values : 9

Interquartile RangeThe interquartile range , IQR, is the distance between the firstquartile and the third quartile.

Determining OutliersCall an observation a suspected outlier if it falls more than1.5 * IQR above the third quartile or below the first quartile.

Example : Imagine we have a bunch of test scores with Q1 = 50 andQ3 = 80.

The IQR = 80 - 50 = 30 So, 1.5 * IQR = 1.5 * 30 = 45

This means that if there are any scores above Q3 + 45 = 125or any scores Q1 - 45 = 5, then these scores are suspected outliers.

Boxplot• A Boxplot is a graph of the five number summary. A central box spans the quartiles, with a line marking the median. Whiskers extend out from the box to the extremes.

Example: Low = 47, High = 98, Median = 77, Q1 = 65, Q3 = 85

0

30

10

50

70

90

Median (77)

Q1 (65)

Q3 (85)

Lowest Observation (47)

Highest Observation (98)

Describing Spread2. The Standard Deviation

• Variance: The variance of a set of observations is an “average” of the deviations of the observations from the mean.

• Standard Deviation: The SD is the square root of the variance.

• Note: You divide by (n - 1) instead of n.

Describing SpreadThe Standard Deviation

Example : Test Scores : 65, 77, 83, 80, 95

1) Find the average : 80

2) Find the deviations from the mean, and their squares

Obs Deviation from Mean Deviations Squared

65 -15 22577 -3 983 3 9

80 0 0

95 15 225

Describing SpreadThe Standard Deviation

3) Determine the mean of the squares:

Variance4) Determine the Standard Deviation:

117 = 10.8

(225 + 9 + 9 + 0 + 225)

(5 - 1)= 117

More Fancy NotationThe variance of a set of observations is the average of the squaresof the deviations of the observations from their mean. In symbols, the variance on n observations , , is :

s 2

xn

x2

x1

...

s 2 =(x - x )

n

2

(x - x )2

2

(x - x )1

2

+++ ...

n - 1

or, in more compact notation :

s 2 = (x - x )i

21

n-1

The standard deviation s is the square root of the variance :s 2

(x - x )i

21 s =

n-1

Another Example of Standard DeviationConsider the following years in our past :

1792, 1666, 1362, 1614, 1460, 1867, 1439

Find the standard deviation of these years.

The Mean = 1600

xi

xi- x( )

2

xi- x( )

1792166613621614146018671439

192 66-238 14-140 267-161

36864 435656644 196196007129825921

s 2 = (x - x )i

21

n-1

=1

6( 214879 )

= 35813.166

s = 189.2

Why Do We Square The Deviations ?1) The sum of the squared deviations of any set of observations from theirmean is the smallest that the sum of squared deviations from any number can possibly be.

1) The standard deviation is the measure of spread for an importantclass of symmetric unimodal distributions called the normal distribution.

Why use the Standard Deviation and not the Variance ?

2) The standard deviation is used by the normal distribution.

3) The variance uses squared deviations, which gives a different unitfrom the original data.

Why use n - 1 ?1) The sum of the deviations is *always* zero. So, if we know n-1 of thedeviations, then the last deviation can be calculated. So, only n-1 of thedeviations can vary freely. These are called degrees of freedom.

Properties of Standard Deviations

1) The standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center.

2) s = 0 only when there is no spread. This happens only when allobservations have the same value. Otherwise, s > 0. As the observationsget more spread out from the mean, then s gets larger.

3) s, like the mean, is not resistant. A few outliers can make s very large.

Which Measure To Use ?

Q: When is the mean better than median? When is the five number summary better than the standard deviation?

Rules Of Thumb

A1: If outliers appear, or if your distribution is skewed, then the mean could be affected, so use the median and the five number summary.

A2: If the distribution is reasonably symmetric and is free of outliers, then the mean and standard deviation should be used.

Changing UnitsConsider the following values : 30, 40, 50, 60, 70

The mean is 50 and the standard deviation is 15.8

What happens to these if we take every score, multiply it by 2 and add 10

We get these values : 70, 90, 110, 130, 150

The mean is 110 and the standard deviation is 31.6

Changing UnitsOld values : 30, 40, 50, 60, 70 mean = 50 and s = 15.8

What happens to these if we take every score, multiply it by 2 and add 10

New values : 70, 90, 110, 130, 150 mean = 110 and s = 31.6

30

50

90

70

110

130

150

30

50

90

70

110

130

150

30

50

90

70

110

130

150

Linear TransformationsA linear transformation changes the original variable x into the newvariable given an equation of the form :x new

x new = bx + a

Note: The constant a shifts all values of x either up or down by the valuea. The constant b changes the size of the unit of the distribution.

Effects of Linear Transformations1) To get the new spread, multiply the old spread by |b|.

2) To get the new mean, multiply the old mean by b and add the constant a.

Density CurvesA density curve is a curve that :

1) is always on or above the vertical axis, and

2) has area exactly 1 underneath it.

A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values is the relative frequency of all observations that fall in that range.

1.3: The Normal Distributions

Density Curves

Normal and Skewed Curves

Median Mean

Why are Normal Distributions important in stats?

1) Normal distributions are good descriptions for somedistributions of real data.

2) Normal distributions are good to the results of many kindsof chance outcomes.

3) Many statistical inference procedures based on normaldistributions work well for other roughly symmetricdistributions.

The 68 - 95 - 99.7 RuleIn the normal distribution with mean and standard deviation :• 68 % of the observations fall within of the mean • 95 % of the observations fall within 2 of the mean

• 99.7 % of the observations fall within 3 of the mean

Normal Curve ExampleJohn collected data on the heights of women ages 18 to 24. He found that the distribution was roughly normal, with a meanof 64.5 inches and a standard deviation of 2.5 inches.

Standardizing ObservationsIf x is an observation from a roughly symmetric distribution that has mean and standard deviation , then the standard value of x is :

z = x -

Note: A standardized score is often called a z-score.

Example : Women’s IQ’s have a symmetric distribution with amean of 97 and a standard deviation of 6.

What is the standard score for a woman with an IQ of 106 ?

z =106 - 97

6=

9

6= 1.5

Standardizing ObservationsIf x is an observation from a roughly symmetric distribution that has mean and standard deviation , then the standard value of x is :

z = x -

Note: A standardized score is often called a z-score.

Example : Men’s IQ’s have a roughly symmetric distribution with amean of 72 and a standard deviation of 8.

What is the standard score for a man with an IQ of 66 ?

z =66 - 72

8=

-6

8= - .75

If x is an observation from a roughly symmetric distribution that has mean and standard deviation , then the standard value of x is :

z = x -

Note: A standardized score is often called a z-score.

Example : Men’s IQ’s have a roughly symmetric distribution with amean of 72 and a standard deviation of 8.

What is the standard score for a man with an IQ of 66 ?

z =66 - 72

8=

-6

8= - .75

The Standard Normal Distribution

Q: What percentage of people have a score below 66 ?

The Standard Normal TableTable A is a table of areas under the standard normal curve. Thetable entry for each z value is the area under the curve to the left of z

.1357

The Standard Normal TableExample : Imagine we have done an experiment, and we want to findwhat percentage of people fell under a score, namely x.

We then proceed to find that the z-score for the value x is -1.10.

The Standard Normal TableExample : The Graduate Record Examinations (GRE) are widelyused to help predict the performance of applicants to graduate schools.The range of possible sores on a GRE is 200 to 900. The psychologydepartment at a university finds the scores of its applicants on thequantitative GRE are approximately normal with mean = 544 andstandard deviation = 103. Answer the following :

1) Find the percentage of people who scored 700 or higher on the test.

2) Find the percentage of people who scored below 500 on the test.

3) Find the percentage of people who scored between 500 and 800 on the test.

1) Find the percentage of people who scored 700 or higher on the test.

Find the percentage to the right of the 700 marker.

1) Find the percentage of people who scored 700 or higher on the test.

Find the z-score : z =700 - 544

103=

156

103= 1.51

P(X>700)=P(Z>1.51)=1-P(Z<1.51)=1 - .9345 = .0655

.9345

.0655

2) Find the percentage of people who scored below 500 on the test.

Find the percentage to the left of 500

Find the z-score : z =500 - 544

103=

- 44

103= - 0.43

2) Find the percentage of people who scored below 500 on the test.

0.3336

Answer : 0.3336

3) Find the percentage of people who scored between 500 and 800 on the test.

Find the percentage between 500 and 800

3) Find the percentage of people who scored between 500 and 800 on the test.

Find the first z-score : z =500 - 544

103=

- 44

103= - 0.43

Find the second z-score : z =800 - 544

103=

256

103= 2.49

0.3336

0.9936Area =

.9936 - .3336 =

0.66

Example : The Soup Nazi charges, on the average, $4.50 for a cup of soup, and if you’re lucky, some bread, with astandard deviation of $0.45.

4.50

What is the probability that our check will be morethan $5.00 ?

4.50

What is the probability that our check will be morethan $5.00 ?

5.00

P (X > 5 ) =P(Z >1.11)=0.1335

Z = 5.00 - 4.50

0.45= 1.11

0.8665 0.1335

13.35 %

“Backward” Normal Calculations

• We could find the observed value (x) of a given proportion in N( , ) by unstandardizing the z-score.

1) State the problem

2) Draw a picture

3) Use the normal table to find the proportion closest to the one you need

4) Read off the z-value

5) Unstandardize x= + z

Example

Find the value of z such that the probability of being less than z is 0.10.

0

1. z: P(Z < z) = .10

Example Find the value of z such that the probability of being less than z is .10.

1. z: P(Z < z) = .10

2.

3. In the body of the normal table, find the closest value to .10. Once found, determine the z value.

P(Z < -1.28) = .1003So z = -1.28

0

Closest is .1003

0

.33???

Example

Find the value of z such that the probability of being greater than z is .33.

2. z: P(Z < z) = 1 - .33 = .67

.67

1. z: P(Z > z) = .33

Example Find the value of z such that the probability of being greater than z is .33.

1. z: P(Z > z) = .33 2. z: P(Z < z) = 1 - .33 = .67

0

.33.67

3. In the body of the normal table, find the closest value to .67. Once found, determine the z value.

P(Z > .44) = .33So z = .44I found .6700

Example

X = time Americans stir sugar into their iced tea X ~ N(12.3, 3.1) seconds

(1)Find the percent of Americans who spend between 20 to22 seconds in stirring sugar into their iced tea? i.e. P(20 < X < 22)

X = time Americans stir sugar into their iced tea X ~ N(12.3, 3.1)

Find P(20 < X < 22) = P(20 - 12.3 < Z < 22 - 12.3) 3.1 3.1

Example

= P(2.48 < Z < 3.13)

= P(Z < 3.13) - P(Z < 2.48)

= .9991 - .9934

= .0057

Example X = time Americans stir sugar into their iced tea X ~ N(12.3, 3.1)(2) About 18.4% of Americans spend more than how manyseconds stirring sugar into their iced tea?i.e. Find the value of X such that the probability of being greater than this value is .184.

(1) z: P(Z > z) = .184

(2) z: P(Z < z) = 1 - .184 = .816

(3) From the normal table, z = 0.90

(4) So x = +z = 12.3 + 0.90(3.1) = 12.3 + 2.79 = 15.09The person would have to stir 15.09 seconds.

Example

X = IQ scores X ~ N(112, 9)

Find the IQ score that replaces you in the top 2%of all scores.

1. z: P(Z > z) = .02

2. z: P(Z < z) = 1 - .02 = .98

3. From the normal table, z = 2.05

x = +z = 112 + 2.05 (9) = 130.45

ExerciseThe distribution of SAT Math scores is approximately normally distributed with mean 500 and standard deviation 100.

1. In what range do the middle 95% of all SAT Math

scores lie?2. What proportion of SAT Math scores are between 450 and 650?

3. If high school students having SAT Math scores in the top 10% of all scores are eligible for a certain scholarship, what is the lowest score a person eligible for the scholarship can have?