Chapter 3 - Describing Comparing Data

Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 1

SECTION 2.4: Measures of Center

There are three main statistical measures which attempt to locate a measure of center. The main objectives

of this section are to present the important measures of center and to show how to compute them.

Definition:

A measure of center is a value at the center or middle of a data set.

The measures of center that we will work with are: The mean (arithmetic mean); the mode and the median.

The arithmetic mean

For a sample from a larger population, the mean is denoted by x.

If all the values of the population are used, then the mean is denoted by .

Notation:

denotes the addition of a set of values

x is the variable usually used to represent the individual data values

n represents the number of values in a sample.

N represents the number of values in a population.

x = x is the mean of a set of sample values

n

= x denotes the mean of all values in a population

N

For

(a) Raw Data

Example: Find the mean of the set of numbers: 63, 65, 67, 68, 69, 70, 71, 72, 74, 75

Solution: n = 10

x = 694

x = 694/10

x = 69.4

(b) Ungrouped frequency distribution

For a frequency distribution

x = fx

f

Example: The 30 members of an orchestra were asked how many instruments each could play. The results

are set out in the frequency distribution. Calculate the mean number of instruments played:

Number of instruments, x 1 2 3 4 5

Frequency, f 11 10 5 3 1


x f fx

1

2

3

4

5

11

10

5

3

1

11

20

15

12

5

f = 30 fx = 63

x = fx

f

= 63/30

= 2.1

The mean number of instruments played is 2.1.

(c) Grouped frequency distribution

When data has been grouped into intervals, the midpoint, x, of the interval is taken to represent the

interval.

Example: The lengths of 40 bean pods were measured to the nearest cm and grouped as shown.

Find the mean length, giving the answer to 1 d.p.

Length (cm) Midpoint, x f fx

4 – 8

9 – 13

14 – 18

19 – 23

24 – 28

29 - 33

6

11

16

21

26

31

2

4

7

14

8

5

12

44

112

294

208

155

f = 40 fx = 825

x = fx

f

= 825/40

x = 20.6 (1 d.p.)

The mean length of the bean pods is 20.6 cm (1d.p.)

Weighted mean

In some situations, the values vary in their degree of importance, so we may want to compute a weighted

mean, which is a mean computed with the different scores assigned different weights. In such cases, we can

calculate the weighted mean by assigning different weights to different values, as shown in the formula

below:

Weighted mean, x = (wx)

w

Example: Suppose we need a mean of three test scores (85, 90, 75), but the first test counts for 20%, the

second test counts for 30%, and the third test counts for 50% of the final grade. We can assign

weights of 20, 30, and 50 to the test scores, as follows:


x = (wx)

w

= (20 x 85) + (30 x 90) + (50 x 75)

20 + 30 + 50

= 81.5

The weighted mean formula is used to calculate grade-point average.

Trimmed mean

An important advantage of the mean is that it takes every value into account, but an important disadvantage

is that it is sometimes dramatically affected by a few extreme values (outliers). Because the mean is very

sensitive to extreme values, we say that it is not a resistant measure of center.

To overcome this disadvantage, a trimmed mean can be used.

To find the 10% trimmed mean for a data set, first arrange the data in order, then delete the bottom 10% of

the values and the top 10% of the values, and calculate the mean of the remaining values.

Exercise:

Determine the arithmetic mean for the given set of data.

Then determine the trimmed mean for the same data set, and compare both results:

Weights of anesthetized bears:

80 344 416 348 166 220 262 360 204 144 332 34 140 180 105

166 204 26 120 436 125 132 90 40 220 46 154 116 182 150

65 356 316 94 86 150 270 202 365 79 148 446 62 236 212

60 64 114 76 48 29 514 140

The median

The median of a data set is the middle value when the original data values are arranged in order of

increasing (or decreasing) magnitude.

If there are n numbers the median is the ½ (n + 1)th value.

Procedure for finding the median:

Sort the data. (Arrange in

increasing order)

Is the number of values odd or even?

Odd: the median is the value in the exact middle.

Even: the median is the mean of the two

middle numbers. (add the middle numbers,

divide by 2.


Determining the median from

(a) Raw Data:

Example: Find the median of each of the sets.

(i) 7, 7, 2, 3, 4, 2, 7, 9, 31

Solution:

In order of magnitude:

2, 2, 3, 4, 7, 7, 7, 9, 31

n = 8. The median is the ½(9 + 1)th value, i.e. the 5th

value.

So median = 7

(ii) 36, 41, 27, 32, 29, 38, 39, 43

Solution:

In order of magnitude:

27, 29, 32, 36, 38, 39, 41, 43

n = 8 and the median is the ½(8 + 1)th value, i.e. the 4 ½ th value.

This does not exist, so we consider the 4th

and 5th

values.

Median = ½(36 + 38)

= 37

(b) Ungrouped frequency distribution:

The median can be found directly from the cumulative frequency distribution.

Example: The table below shows the number of children in the family for 35 families in a certain area. Find

the mean number of children per family.

Number of

children

Frequency Cumulative

frequency, cf

0

1

2

3

4

5

3

5

12

9

4

2

3

8

20

29

33

35

The median is the 18th

value: ½(35 + 1) = 18

We could have written out all the values in order from the frequency table, thus 0, 0, 0, 1, 1, 1, 1, 2, 2,….

However, we can see from the cumulative frequency table that the 18th

value is 2, as the first 8 values are 0

or 1 and the first 20 values are 0 or 1 or 2.

(c) Grouped frequency distribution:

Once the information has been grouped and the raw data lost we can only estimate a value for the median.

We will consider two methods to determine an approximate value for the median:

(i) by calculation

(ii) from a cumulative frequency curve


Example: The masses, measured to the nearest kg, of 49 boys are noted and the distribution formed.

Estimate the median mass.

Mass (kg) F Mass (kg) cf

- 59

60 – 64

65 – 69

70 – 74

75 – 79

80 – 84

85 - 89

0

2

6

12

14

10

5

< 59.5

< 64.5

< 69.5

< 74.5

< 79.5

< 84. 5

< 89.5

0

2

8

20

34

44

49

The median is the ½(49 + 1) th value, i.e. the 25th value

Method (a): by calculation:

The 25th

value lies in the class 74.5 – 79.5

There are 14 items in this class. The median is 5/14 of the interval of 5 kg from 74.5 to 79.5.

Estimate of the median mass = 74.5 + (5/14) (5)

= 76. 3 kg (1 d.p.)

Method (b): from the cumulative frequency curve

Draw the ogive (cf curve) and read off the value corresponding to a cumulative frequency of 25.

Ogive showing the masses of 49 boys

0

10

20

30

40

50

60

59.5 64.5 69.5 74.5 79.5 84.5 89.5

mass (kg)

cu

mu

lati

ve f

req

uen

cy

From the graph, the value corresponding to the cumulative frequency of 25 is 76.3 kg.

The mode:

The mode is the value that occurs most often (has the highest frequency). For a given data set, more than

one mode can exist.

Two modes: bimodal

More than two modes: multimodal


Determining the mode from

(a) Raw data:

Example: Find the mode(s) of each of the following set:

(i) 4, 5, 5, 1, 2, 9, 5, 6, 4, 5, 7, 5, 5

Solution: mode = 5

(ii) 2, 2, 3, 5, 8, 2, 5, 6, 6, 5

Solution: modes = 2, 5 (The distribution is bimodal.)

(b) Grouped data:

When data has been grouped into classes, the class which has the largest standard frequency is called the

modal class. An estimate of the mode can be obtained from the modal class.

Example:

Estimate the mode of the following frequency distribution which shows the marks of 330 candidates in an

examination.

marks 11- 20 21- 30 31 - 40 41 - 50 51 - 60 61 - 70 71 - 80 81 - 90 91 – 100

frequency 20 40 80 100 50 20 10 10 0

Solution: First a histogram or bar chart is constructed.

Histogram to show examination marks

0

20

40

60

80

100

120

10 20 30 40 50 60 70 80 90 100

Marks

Fre

qu

en

cy

The modal class is 41 – 50.

The modal class contains 20 more than the class below and 50 more than the class above. So the mode is

likely to divide the modal class in the ration 20: 50 = 2: 5

An estimate of the mode can be found from the histogram by drawing lines as shown in the diagram. This

gives a value of 43 marks.

By calculation:

An estimate of the mode is 20/(20 + 50) of the interval of 10 marks from 40 – 50.

Estimate of mode = 40 + (2/7)(10)

= 42.9


Exercise:

1] If the mean of the following numbers is 17, find the value of c: 12, 18, c, 13

2] The mean of 10 numbers is 8. If an eleventh number is now included in the results, the mean

becomes 9. What is the value of the eleventh number.

3] The mean of 4 numbers is 5, and the mean of 3 different numbers is 12. What is the mean of the 7

numbers together?

4] A bag contained five balls each bearing one of the numbers 1, 2, 3, 4, 5. A ball was drawn from the

bag, its number noted, and then replaced. This was repeated 50 times and the table below shows the

resulting frequency distribution.

Number 1 2 3 4 5

Frequency x 11 y 8 9

If the mean is 2.7,

(i) determine the value of x and y.

(ii) state the mode and median of this distribution.

5] On a certain day the number of books on 40 shelves in a library was noted and grouped as shown.

Find the mean number of books on a shelf. Give your answer to 2 significant figures.

Number of books 31 - 35 36 - 40 41 - 45 46 - 50 51 - 55 56 – 60

Number of shelves 4 6 10 13 5 2

Skewness

A comparison of the mean , median, and mode can reveal information about the characteristic of skewness,

defined and illustrated below:

A distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half.

A distribution of data is skewed if it is not symmetric and if it extends more to one side than the other.

Lopsided to the right = skewed to left = negatively skewed

Lopsided to the left = skewed to right = positively skewed

Data not lopsided = symmetric = zero skewness


COMPARISON OF MEAN, MEDIAN, AND MODE:

Average Definition How

common? Existence

Takes every

value into

account?

Affected

by extreme

values

Advantages

and

Disadvantages

Mean x = x

n

Most

familiar

„average‟

Always

exist yes yes

Works with

many

statistical

methods

Median Middle score Commonly

used

Always

exists no no

Often a good

choice if there

are some

extreme values

Mode

Most

frequent

score

Sometimes

used

Might not

exist; may

be more

than one

mode

no no

Appropriate

for data at the

nominal level

General comments:

For a data collection that is approximately symmetric with one mode, the mean, median, and mode tend to

be about the same.

For a data collection that is obviously asymmetric, it would be good to report both the mean and median.

The mean is relatively reliable. That is, when samples are drawn from the same population, the sample

means tend to be more consistent than the other averages (consistent in the sense that the means of samples

drawn from the same population don‟t vary as much as the other averages).

Using Technology

Exercise:

Microsoft Excel can calculate all measures of central tendency.

Example 1: Cambridge Power and Light Company selected 20 residential customers. Following are the

amounts, to the nearest dollar, the customers were charged for electrical services last month.

54 48 58 50 25 47 75 46 60 70

67 68 39 35 56 66 33 62 65 67

What are the mean median, and mode of these amounts?

On a new worksheet, key your data in column A. Give A1 a title. Type the data in cells A2 to A21.

To calculate:

the mean, key

=AVERAGE(A2:A21) in cell A23. You may type in a cell beside it “Mean”

the median

=MEDIAN(A2:A21) in cell A24


the mode:

=MODE (A2:A21) in cell A25.

If there is no mode, Excel will display #N/A in that cell. If there is more than one mode Excel will display

the one that occurs first in the string of data.

Example 2: The weighted mean

Carter Construction Company pays its hourly employees either $6.50, $7.50, or $8.50 per hour. There are 26

hourly employees. 14 are paid at the $6.50 rate, 10 at the $7.50 rate, and 2 at the $8.50 rate. What is the

weighted mean hourly rate paid to the 26 employees.

Key Employee in F1, Rate in G1, Product in H1

Key 14, 10, 2 in F2 to F4 respectively.

Key 6.5, 7.5, 8.5 in G2 to G4 respectively.

In H2, key = F2*G2

Make H2 your active cell. Place your cursor on the bottom right handle. You will have a thick black plus

sign. Click and drag to H3:H4.

Highlight F2: F4. From the Tool bar choose the AutoSum button.

Highlight H2:H3. From the Tool bar choose the AutoSum button.

In J7, key = H5/F5

You may type in “weighted mean” in an adjacent cell to this result.

Example 3: To find the 10% trimmed mean of the data “Weights of Anethesized Bears” (previous

example), key in the data as usual in column form, labeling the data in the first cell.

Key

= TRIMMEAN( : , 10%) in a new cell.

SECTION 2.5: Measures of Variation

Because variation is so important in statistics, this is one of the most important sections.

The following key concepts are discussed in detail:

(1) Variation refers to the amount that values vary among themselves, and it can be measured with

specific numbers;

(2) Values that are relatively close together have lower measures of variation, and values that are

spread farther apart have measures of variation that are larger;

(3) The standard deviation, which is a particularly important important measure of variation, can be

computed;

(4) The values of standard deviation must be interpreted correctly.

Example:

Waiting times of customers ( in minutes) at the JV Bank (where all customers enter a single waiting line) and

the Bank of P (where customers wait in individual lines at three different teller windows:

JV: 6.5 6.6 6.7 6.8 7.1 7.3 7.4 7.7 7.7 7.7

P: 4.2 5.4 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0

(a) Determine the mean, median, and mode for each data set.

(b) Interpret the results by determining whether there is a difference between the two data sets that is

not apparent from a comparison of the measures of center. If so, how are the data sets different?

We will now develop some specific ways to measure variation:


Range

The range of a data set is the difference between the highest value and the lowest value.

Range = (highest value) – (lowest value)

For JV: Range = 7.7 – 6.5 = 1.2 min

For P: Range = 10.0 – 4.2 = 5.8 min

Standard deviation of a sample

The standard deviation of a set of sample values is a measure of variation of values from the mean.

Formula

s = ( x – x)2 sample standard deviation

n – 1

alternative formula,

s = x2 - x

2

n

There is also a shortcut formula for the standard deviation.

s = n(x2) – (x)

2

n(n – 1)

Example: Use the first standard deviation formula to find the standard deviation of the JV Bank customer

waiting times. Those times (in minutes) have been listed in the first column of the table below:

x x - x (x – x)2

6.5 -0.65 0.4225

6.6 -0.55 0.3025

6.7

6.8

7.1

7.3

7.4

7.7

7.7

7.7 0.55 0.3025

71.5 = x 2.0450= (x – x)2

Example: Here is the same example, but the „shortcut formula‟ is used.

Find n, x , and x2.

n = 10 (sample size = 10)

x = 71.5 (sum of the 10 sample values)

x2 = 513.27 (=6.5

2 + 6.6

2 + 6.7

2 + …7.7

2)

s = 10(513.27) – (71.5)2

10(10 – 1)

= 0.48 min


Standard deviation of a Population:

Here is the formula for the standard deviation for a population, denoted by .

= ( x - )2

N – 1

Where is the population mean, and

N is the population size.

Variance of a Sample and Population

The variance of a set of values is a measure of variation equal to the square of the standard deviation.

Sample variance: square of the standard deviation s

Denoted by s2

Example:

From one of our previous examples (JV bank customer waiting times), s = 0.48, more precisely s =

0.4767 min. So,

S2 = (0.4767 min)

2 = 0.23 min

Population variance: square of the population standard deviation

Denoted by 2

Round off rule:

Carry one more decimal place than is present in the original set of values.

Finding Standard Deviation from a Frequency Table

Sometimes it is necessary to compute the standard deviation of a data set that is summarized in the form of a

frequency table. If the original list of sample values is available, use those values with the previous standard

deviation formulae to get more exact results. If the original data are not available, use the formula:

s = n(fx2) – (fx)

2

n(n – 1)

Example: Use the following table to calculate the standard deviation.

Word rating f x fx fx2

0 - 2 20 1 20 20

3 - 5 14 4 56 224

6 - 8 15 7 105 735

9 - 11 2 10 20 200

12 - 14 1 13 13 169

52 = f 214 = (fx) 1348 = (fx2)

S = 52(1348) – (214)2

52(52 – 1)

s = 3.0


Interpreting and understanding Standard Deviation

The standard deviation measures the variation among values. Values close together will yield a small

standard deviation, whereas values spread farther apart will yield a larger standard deviation.

Three different ways of developing a sense for values of standard deviation:

(1) Range rule of thumb

Based on the principle that for many data sets, the vast majority (such as 95%) of sample values lie within 2

standard deviation of the mean.

x 2s

For estimation:

To obtain a rough estimate of the standard deviation s, use the equation

S range

4

where range = (highest value) – (lowest value)

Example:

Previous results from the National Health Survey show that the heights of men have a mean of 69.0 inches

and a standard deviation of 2.8 inches. Use the range rule of thumb to find the minimum and maximum

„usual‟ heights.

x 2s = 69.0 2(2.8)

Based on these results, we expect that typical men will range in height between 63.4 inches and 74.6 inches.

(2) Empirical Rule for data with a bell-shaped distribution

For data sets having a distribution that is approximately bell-shaped, the following properties apply:

about 68% of all values fall within 1 standard deviation of the mean

x s

about 95% of all values fall within 2 standard deviations of the mean

x 2s

about 99.7% of all values fall within 3 standard deviations of the mean

Example:

The heights of men have a bell-shaped distribution with a mean of 69.0 inches and a standard deviation of

2.8 inches (based on data from the National Health Survey). What percentage of men have heights between

60.6 inches and 77.4 inches?

Solution:

60.6 inches and 77.4 inches are each exactly 3 standard deviations away from the mean of 69.0 inches.

69.0 3(2.8)

According to the empirical rule, 99.7% of all men‟s heights are between 60.6 inches and 77.4 inches.

(3) Chebyshev’s theorem

Chebyshev‟s theorem applies to any data set, unlike the empirical rule, but its results are very approximate.

It says:

The proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at

least 1 – 1/K2, where K is any positive number greater than 1. For K = 2 and K = 3, the following results are

gotten:


at least ¾ (or 75%) of all values lie within 2 standard deviations of the mean.

at least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean.

Example:

Heights of men have a mean of 69.0 inches and a standard deviation of 2.8 inches. What can we conclude

from Chebyshev‟s theorem.

Within 2 standard deviations

At least ¾ (or 75%) of all men have heights within 2 standard deviations of the mean (63.4 in. – 74.6 in.)

Within 3 standard deviations

At least 8/9 (or 89%) of all men have heights within 3 standard deviations of the mean (60.6in. – 77.4 in.)

Exercise:

1] Find the range, variance, and standard deviation for each of the two samples, then compare the two

sets of results.

Maximum breadth of samples of male Egyptian skulls from 4000 BC and 150 AD

4000 BC: 131 119 138 125 129 126 131 132 126 128 128 131

150 AD: 136 130 126 126 139 141 137 138 133 131 134 129

(Based on data from Ancient Races of the Thebaid by Thomson and Randall-Maciver.)

2] Find the standard deviation of the data summarized in the given frequency table.

Samples of students cars and faculty/staff cars were obtained at a certain college, and their ages (in

years) are summarized in the frequency table.

Age Students Facult/staff

0 -2 23 30

3 - 5 33 47

6 - 8 63 36

9 - 11 68 30

12 - 14 19 8

15 - 17 10 0

18 - 20 1 0

21 - 23 0 1

3] Two different sections of a statistics class take the same quiz and the scores are recorded below.

Do a double stem-and leaf plot for both data sets. Discuss the variation of the data in each of the sets

and compare.

Find the range and standard deviation for each section.

What do the range values lead you to conclude about the variation in the two sections.

Why is the range misleading in this case?

What do the standard deviation values lead you to conclude about the variation in the two sections?

Section 1: 1 20 20 20 20 20 20 20 20 20 20

Section 2: 2 3 4 5 6 14 15 16 17 18 19


4] Let a population consist of the values 1, 2, and 3. Assume that samples of two different values are

randomly selected with replacement.

(a) Find the variance 2 of the population {1, 2, 3}.

(b) List the nine different possible samples and find the sample variance s2 for each of them. If you

repeatedly select two different items, what is the mean value of the sample variances s2?

(c) For each of the nine samples, find the variance by treating each sample as if it is a population.

(Be sure to use the formula for population variance.) If you repeatedly select two different items,

what is the mean value of the population variances?

(d) Which approach results in values that are better estimates of 2: part (b) or part (c)?

Why?

When computing variances of samples, should you use division by n or n – 1?

(e) The preceding parts show that s2 is an unbiased estimator of

2. Is s an unbiased estimator of ?

Using technology

Here are the standard commands for calculating

(i) Range

= MAX( : ) – MIN( : )

(ii) Sample Variance, s

2

= VAR( : )

Population Variance 2

= VARP( : )

(iii) Sample standard deviation, s

=STDEV( : )

Population standard deviation,

=STDEVP( : )

Use example 1 above to calculate the measures of variation (dispersion) using the appropriate commands

from above.

NOTE:

Microsoft Excel can create a summary of all calculations for descriptive statistics.

Under Tools, locate Data Analysis. Locate Descriptive Statistics. Follow instructions on the dialog

box.. Click on Input range and key in range of cells where the data has been placed (example: A2:

A40). Leave the columns button clicked. Click on summary statistics. Then OK.

SECTION 2.6: Measures of Position

This section introduces measures that can be used to compare values from different data sets or to compare

values within the same data set. The basic tools are z scores, quartiles, and percentiles.

z Scores

A standard score, or z score, is the number of standard deviations that a given value x is above or below the

mean. It is found using the expressions

Sample: z = x – x

S

Population: z = x -


Example:

Former NBA superstar Michael Jordan is 78 inches tall, and WNBA basketball player Rebecca Lobo is 76

inches tall. Jordan is obviously taller by 2 inches, but which player is relatively taller? Does Jordan‟s height

among men exceed Lobo‟s height among women?

Men have heights with a mean of 69.0 inches and a standard deviation of 2.8 inches; women have heights

with a mean of 63.6 inches and a standard deviation of 2.5 inches.

Solution:

For Jordan For Lobo

x = 78 in. x = 76 in.

= 2.8 in. = 2.5 in.

= 69.0 in. = 63.6 in.

using for both data sets,

z = x -

For Jordan For Lobo

z = 78 – 69.0 z = 76 – 63.6

2.8 2.5

z = 3.21 z = 4.96

Interpretation:

Jordan‟s height is 3.21 standard deviations above the mean, but Lobo‟s height is 4.96 standard deviations

above the mean. Lobo‟s height among women is greater than Jordan‟s height among men.

Consider another player, Mugsy Bogues, who is only 63 inches tall. When his height is converted to a z

score,

z = 63 – 69.0

2.8

= -2.14

his height is 2.14 standard deviations below the mean (because the z score is negative). He is a relatively

short person amongst the population of men.

NOTE:

Ordinary values fall within 2 standard deviations from the mean: x 2 std dev.

-2 z score 2

Unusual values fall below or above 2 standard deviatons from the mean.

z score < -2 or z score > 2

Quartiles and Percentiles

Quartiles (Q1, Q2, Q3) divide the sorted values into four equal parts.

For Q1,

At least 25% of the sorted values will be less than or equal to Q1.

Q1 = ¼ (n + 1)th value

For Q2,

At least 50 % of the sorted values will be less than or equal to Q2.


Q2 = median

Q2 = ½(n + 1)th value

For Q3,

At least 75% of the sorted values will be less than or equl to Q3.

Q3 = ¾(n + 1)th value

Percentiles (denoted by Pn) are the 99 values which split a distribution into 100 equal portions.

For example, the 10th

percentile, P10

P10 = 10/100(n + 1)th value

Or

P90 = 90/100(n + 1)th percentile

Note:

Q1 = P25

Q2 = P50

Q3 = P75

Example:

The table lists 36 weights (in pounds) of the contents of 36 cans of regular Coke.

(a) Find the quartiles

(b) Find the 10th

and 90th

percentiles

(c) Find the percentile corresponding to the weight of 0.8143 lbs.

0.7901 0.8044 0.8062 0.8073 0.8079 0.8110

0.8126 0.8128 0.8143 0.8150 0.8150 0.8152

0.8152 0.8161 0.8161 0.8163 0.8165 0.8170

0.8172 0.8176 0.8181 0.8189 0.8192 0.8192

0.8194 0.8194 0.8207 0.8211 0.8229 0.8244

0.8244 0.8247 0.8251 0.8264 0.8284 0.8295

Solution:

The data must be sorted in order of size, if it isn‟t as yet. This set has already been sorted.

(a) Quartiles

Q1 = ¼(n + 1)th value

= ¼(36 + 1)th value

= 9.25 th value

9th

value is 0.8143.

10th

value is 0.8150

The difference between 0.8150 – 0.8143 = 0.0007

So

Q1 = 0.8143 +1/4(0.0007)

= 0.814475

Q2 = ½(n + 1)th value

= ½(36 + 1) th value

= 18.5 th value

Q2 is the average of the two middle entries. (Remember: Q2 = median)

Q2 = 0.8170 + 0.8172)

2

= 0.8171


Q3 = ¾(n + 1) th value

= ¾(36 + 1)th value

= 27 ¾ th value

27th

value is 0.8207 and the 28th

value is 0.8211

The difference is 0.8211 – 0.8207 = 0.0004

So

Q3 = 0.8207 + ¾(0.0004)

= 0.8210

(b) P10 and P90

P10 = 10/100(n + 1)th value

= 10/100(36 + 1)th value

= 3.7th

value

P10 =

P90 = 90/100(n + 1)th value

= 90/100(36 + 1)th value

= 33 1/3 rd value

P90 =

(b) To determine the percentile which correspond to a certain value from the data set, use

percentile of value x = number of values less than x · 100

Total number of values

Percentile of 0.8143 lbs = 8 · 100

36

= 22 (rounded)

The weight of 0.8143 is the 22nd

percentile.

Using Technology

Excel‟s Rank and Percentile analysis toll produces a table showing the rank order and percentile for each

value in a data set. As an alternative to the analysis tool, these results could also be obtained using Excel‟s

Data Sort and Edit Fill commands.

Exercise:

Use Excel to determine the information required in the exercise on 36 weights (in pounds) of 36 cans of

regular Coke (previous exercise). Compare with your manual calculations.

Exercise: Use Microsoft Excel wherever possible.

Express all z scores with two decimal places. Consider a value to be unusual if its z score is less than –2.00

or greater than 2.00.

1. Human body temperatures have a mean of 98.20º and a standard deviation of 0.62º. An

emergency room patient is found to have a temperature of 101º. Convert 101º to a z score. Is that

temperature unusually high? What does it suggest?


2. Scores on a history test have a mean of 80 and a standard deviation of 12. Scores on a

psychology test have a mean of 30 and a standard deviation of 8. Which is relatively better: a

score of 75 on the history test or a score of 27 on the psychology test?

3. Refer to the data set for the sample of 36 weights of regular Coke. Convert the weight of 0.7901

to a z score. Is 0.7901 an unusual weight for regular Coke?

4. Refer to the data set for weights of anesthesized bears and find the indicated percentile or quartile.

(i) P85 (ii) P35 (iii) Q1 (iv) Q3 (v) P50

5. The first several terms of the famous Fibonacci Sequence are 1, 1, 2, 3, 5, 8, 13.

(a) Find the mean x and standard deviation s, then convert each value to a z score. Don‟t round

the z scores; carry as many places as your calculator can handle.

(b) Find the mean and standard deviation of the z scores found in part (a).

(c) If you use any other data set, will you get the same results obtained in part (b)?

SECTION 2.7: Exploratory Data Analysis EDA

Exploratory data analysis is the process of using statistical tools (such as graphs, measures of center,

measures of variation) to investigate data sets in order to understand their important characteristics.

When exploring a data set, we usually want to calculate the mean and the standard deviation and to generate

a graph, usually a histogram/bar chart. It is also important to further examine the data set to identify any

notable features, especially those that could have a strong effect on results and conclusions (for example,

outliers).

Outliers

An outlier is a value that is located very far away from almost all of the other values. Relative to the other

data, an outlier is an extreme value.

Effects of an outlier

(1) can have a dramatic effect on the mean, and, hence, in some instances, a trimmed mean is

calculated.

(2) Can have a dramatic effect on the standard deviation.

(3) Can have a dramatic effect on the scale of the histogram/bar chart so that the true nature of the

distribution is totally obscured.

An easy way to find outliers is to examine a sorted list of the data. Look at the minimum and maximum

sample values and determine whether they are very far away from the other typical values. When an outlier

occurs because of a nonsampling error, it should either be corrected or deleted. However, some data sets

include outliers that are correct values.

To study the effects of outliers, we can construct graphs and calculate statistics with and without the outliers

included.

Boxplots

Boxplots are useful for revealing the center of the data, the spread of the data, the distribution of the data,

and the presence of outliers.

It is a graph that consists of a line extending from the minimum value to the maximum value, and a box with

lines drawn at the first quartile Q1; the median; and the third quartile Q3.


Example:

Comparing Ages of Oscar Winners.

In “Ages of Oscar-winning Best Actors and Actresses (Mathematics Teacher magazine) by Richard

Brown and Gretchen Davis, stem-and-leaf plots are used to compare the ages of actors and actresses at

the time they won Oscars. Here are the results for recent winners from each category:

Actors:

32 37 36 32 51 53 33 61 35 45 55 39

76 37 42 40 32 60 38 56 48 48 40 43

62 43 42 44 41 56 39 46 31 47 45 60

Actresses:

50 44 35 80 26 28 41 21 61 38 49 33

74 30 33 41 31 35 41 42 37 26 34 34

35 26 61 60 34 24 30 37 31 27 39 34

Use boxplots to compare the two data sets.

Exploring

We now have the following tools to explore data sets:

measures of center: mean, median, mode

measures of variation: standard deviation and range

measures of spread and relative location: minimum value, maximum value, and quartiles

unusual values: outliers

distribution: histograms, stem-and-leaf plots, and box plots

Rather than simply producing statistics and graphs, try to identify those that are particularly interesting and

important. As a first step, investigate outliers and consider their effects by finding measures and graphs with

and without the outliers included.

Exercise: PROJECT

The traditional typewriter keyboard configuration is called a Qwerty keyboard because of the position of the

letters QWERTYin the top row of letters. Developed in 1872, the Qwerty configuration was supposed to

force typists to slow down so that their work machines would be less likely to jam. The Dvorak keyboard

developed in 1936, positioned the keys most frequently used in the middle (or „home‟) row, a move intended

to improve efficiency. Both keyboard configurations are shown in the accompanying illustration.


An article in the magazine Discover suggests that you can measure the ease of typing by using this point rate

system: count each letter on the home row as 0, each letter on the top row as 1, and each letter on the bottom

row as 2. (see „Typecasting‟ by Scott, Kim, Discover). For example, the word statistics would have a rating

of 7 on the Qwerty board and a rating of 1 on the Dvorak keyboard:

S t a t i s t i c s

Qwerty: 0 1 0 1 1 0 1 1 2 0 (sum = 7)

Dvorak: 0 0 0 0 0 0 0 0 1 0(sum = 1)

This rating system was used with each of the 52 words in a certain document and the rating values are shown

below:

Table 1: Qwerty keyboard word ratings

2 2 5 1 2 6 3 3 4 2

4 0 5 7 7 5 6 6 8 10

7 2 2 10 5 8 2 5 4 2

6 2 6 1 7 2 7 2 3 8

1 5 2 5 2 14 2 2 6 3

1 7

Table 2: Dvorak keyboard word ratings

2 0 3 1 0 0 0 0 2 0

4 0 3 4 0 3 3 1 3 5

4 2 0 5 1 4 0 3 5 0

2 0 4 1 5 0 4 0 1 3

0 1 0 3 0 1 2 0 0 0

1 4

Exploratory Data Analysis

(a) Organize each data set in a frequency table, using 5 classes with the first being 0 – 2.

(b) Construct the appropriate histogram/bar chart for each data set from the frequency tables in part

(a)

(c) Construct the frequency polygons on the same axes for Qwerty and Dvorak keyboards, using the

frequency tables in part (a).

(d) From the graphs above, describe the type of distribution (skewed negativelyor positively,

symmetric) displayed by each data set.

(e) Calculate all measures of center and summarize them in the table below:

Qwerty Dvorak

Mean

Median

Mode

(f) Calculate all measures of variation and summarize them in the table below:

Qwerty Dvorak

Range

Standard deviation

Variance


(g) On the same axes, construct the boxplot (using 5 number summary) for each of the given data

sets. Identify any outliers. Which of these ratings (Qwerty or Dvorak) appear to have more

spread (use information from the boxplots).

(h) Analysis: Using the graphs and measures you have determined above, comment on the level of

typing difficulty using each of the keyboards. Do both key boards appear to have the same level

of difficulty or is one easier to use than the other?

So far we have treated the two data sets as if they were separate and independent, but the word ratings came

from the same 52 words in the same document. We should therefore explore the differences between the

pairs of ratings corresponding to each of the 42 words.

For example: let us say the first word in the document was the word „we‟.

W e

Qwerty 1 1 (sum = 2)

Dvorak 2 0 9sum = 2)

The difference is 2 – 2 = 0.

(i) Find the difference of each pair of data from both data sets:

(j) Determine the mean, standard deviation, 5-number summary boxplot, any outliers for the set of

data consisting of the 52 differences.

(k) Construct an appropriate graph to discuss the type of distribution.

(l) If both keyboards were to have the same level of difficulty, we would expect the differences

between word ratings to average around 0. What does the mean difference and graph tell you

about the level of difficulty of one board compared to the other? Does this support your earlier

analysis in part (h)?

(m) Would removing any outliers affect the conclusion(s) you have drawn? Show any necessary

calculations to support your statement.

Chapter 3 - Describing Comparing Data

Documents

Transcript of Chapter 3 - Describing Comparing Data