Chapter 2. Describing Data Ir.Muhril Ardiansyah,M.Sc.,Ph.D.1 Chapter 2. Describing Data
Chapter 3 - Describing Comparing Data
-
Upload
pedro-choco -
Category
Documents
-
view
323 -
download
11
Transcript of Chapter 3 - Describing Comparing Data
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 1
SECTION 2.4: Measures of Center
There are three main statistical measures which attempt to locate a measure of center. The main objectives
of this section are to present the important measures of center and to show how to compute them.
Definition:
A measure of center is a value at the center or middle of a data set.
The measures of center that we will work with are: The mean (arithmetic mean); the mode and the median.
The arithmetic mean
For a sample from a larger population, the mean is denoted by x.
If all the values of the population are used, then the mean is denoted by .
Notation:
denotes the addition of a set of values
x is the variable usually used to represent the individual data values
n represents the number of values in a sample.
N represents the number of values in a population.
x = x is the mean of a set of sample values
n
= x denotes the mean of all values in a population
N
For
(a) Raw Data
Example: Find the mean of the set of numbers: 63, 65, 67, 68, 69, 70, 71, 72, 74, 75
Solution: n = 10
x = 694
x = 694/10
x = 69.4
(b) Ungrouped frequency distribution
For a frequency distribution
x = fx
f
Example: The 30 members of an orchestra were asked how many instruments each could play. The results
are set out in the frequency distribution. Calculate the mean number of instruments played:
Number of instruments, x 1 2 3 4 5
Frequency, f 11 10 5 3 1
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 2
x f fx
1
2
3
4
5
11
10
5
3
1
11
20
15
12
5
f = 30 fx = 63
x = fx
f
= 63/30
= 2.1
The mean number of instruments played is 2.1.
(c) Grouped frequency distribution
When data has been grouped into intervals, the midpoint, x, of the interval is taken to represent the
interval.
Example: The lengths of 40 bean pods were measured to the nearest cm and grouped as shown.
Find the mean length, giving the answer to 1 d.p.
Length (cm) Midpoint, x f fx
4 – 8
9 – 13
14 – 18
19 – 23
24 – 28
29 - 33
6
11
16
21
26
31
2
4
7
14
8
5
12
44
112
294
208
155
f = 40 fx = 825
x = fx
f
= 825/40
x = 20.6 (1 d.p.)
The mean length of the bean pods is 20.6 cm (1d.p.)
Weighted mean
In some situations, the values vary in their degree of importance, so we may want to compute a weighted
mean, which is a mean computed with the different scores assigned different weights. In such cases, we can
calculate the weighted mean by assigning different weights to different values, as shown in the formula
below:
Weighted mean, x = (wx)
w
Example: Suppose we need a mean of three test scores (85, 90, 75), but the first test counts for 20%, the
second test counts for 30%, and the third test counts for 50% of the final grade. We can assign
weights of 20, 30, and 50 to the test scores, as follows:
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 3
x = (wx)
w
= (20 x 85) + (30 x 90) + (50 x 75)
20 + 30 + 50
= 81.5
The weighted mean formula is used to calculate grade-point average.
Trimmed mean
An important advantage of the mean is that it takes every value into account, but an important disadvantage
is that it is sometimes dramatically affected by a few extreme values (outliers). Because the mean is very
sensitive to extreme values, we say that it is not a resistant measure of center.
To overcome this disadvantage, a trimmed mean can be used.
To find the 10% trimmed mean for a data set, first arrange the data in order, then delete the bottom 10% of
the values and the top 10% of the values, and calculate the mean of the remaining values.
Exercise:
Determine the arithmetic mean for the given set of data.
Then determine the trimmed mean for the same data set, and compare both results:
Weights of anesthetized bears:
80 344 416 348 166 220 262 360 204 144 332 34 140 180 105
166 204 26 120 436 125 132 90 40 220 46 154 116 182 150
65 356 316 94 86 150 270 202 365 79 148 446 62 236 212
60 64 114 76 48 29 514 140
The median
The median of a data set is the middle value when the original data values are arranged in order of
increasing (or decreasing) magnitude.
If there are n numbers the median is the ½ (n + 1)th value.
Procedure for finding the median:
Sort the data. (Arrange in
increasing order)
Is the number of values odd or even?
Odd: the median is the value in the exact middle.
Even: the median is the mean of the two
middle numbers. (add the middle numbers,
divide by 2.
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 4
Determining the median from
(a) Raw Data:
Example: Find the median of each of the sets.
(i) 7, 7, 2, 3, 4, 2, 7, 9, 31
Solution:
In order of magnitude:
2, 2, 3, 4, 7, 7, 7, 9, 31
n = 8. The median is the ½(9 + 1)th value, i.e. the 5th
value.
So median = 7
(ii) 36, 41, 27, 32, 29, 38, 39, 43
Solution:
In order of magnitude:
27, 29, 32, 36, 38, 39, 41, 43
n = 8 and the median is the ½(8 + 1)th value, i.e. the 4 ½ th value.
This does not exist, so we consider the 4th
and 5th
values.
Median = ½(36 + 38)
= 37
(b) Ungrouped frequency distribution:
The median can be found directly from the cumulative frequency distribution.
Example: The table below shows the number of children in the family for 35 families in a certain area. Find
the mean number of children per family.
Number of
children
Frequency Cumulative
frequency, cf
0
1
2
3
4
5
3
5
12
9
4
2
3
8
20
29
33
35
The median is the 18th
value: ½(35 + 1) = 18
We could have written out all the values in order from the frequency table, thus 0, 0, 0, 1, 1, 1, 1, 2, 2,….
However, we can see from the cumulative frequency table that the 18th
value is 2, as the first 8 values are 0
or 1 and the first 20 values are 0 or 1 or 2.
(c) Grouped frequency distribution:
Once the information has been grouped and the raw data lost we can only estimate a value for the median.
We will consider two methods to determine an approximate value for the median:
(i) by calculation
(ii) from a cumulative frequency curve
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 5
Example: The masses, measured to the nearest kg, of 49 boys are noted and the distribution formed.
Estimate the median mass.
Mass (kg) F Mass (kg) cf
- 59
60 – 64
65 – 69
70 – 74
75 – 79
80 – 84
85 - 89
0
2
6
12
14
10
5
< 59.5
< 64.5
< 69.5
< 74.5
< 79.5
< 84. 5
< 89.5
0
2
8
20
34
44
49
The median is the ½(49 + 1) th value, i.e. the 25th value
Method (a): by calculation:
The 25th
value lies in the class 74.5 – 79.5
There are 14 items in this class. The median is 5/14 of the interval of 5 kg from 74.5 to 79.5.
Estimate of the median mass = 74.5 + (5/14) (5)
= 76. 3 kg (1 d.p.)
Method (b): from the cumulative frequency curve
Draw the ogive (cf curve) and read off the value corresponding to a cumulative frequency of 25.
Ogive showing the masses of 49 boys
0
10
20
30
40
50
60
59.5 64.5 69.5 74.5 79.5 84.5 89.5
mass (kg)
cu
mu
lati
ve f
req
uen
cy
From the graph, the value corresponding to the cumulative frequency of 25 is 76.3 kg.
The mode:
The mode is the value that occurs most often (has the highest frequency). For a given data set, more than
one mode can exist.
Two modes: bimodal
More than two modes: multimodal
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 6
Determining the mode from
(a) Raw data:
Example: Find the mode(s) of each of the following set:
(i) 4, 5, 5, 1, 2, 9, 5, 6, 4, 5, 7, 5, 5
Solution: mode = 5
(ii) 2, 2, 3, 5, 8, 2, 5, 6, 6, 5
Solution: modes = 2, 5 (The distribution is bimodal.)
(b) Grouped data:
When data has been grouped into classes, the class which has the largest standard frequency is called the
modal class. An estimate of the mode can be obtained from the modal class.
Example:
Estimate the mode of the following frequency distribution which shows the marks of 330 candidates in an
examination.
marks 11- 20 21- 30 31 - 40 41 - 50 51 - 60 61 - 70 71 - 80 81 - 90 91 – 100
frequency 20 40 80 100 50 20 10 10 0
Solution: First a histogram or bar chart is constructed.
Histogram to show examination marks
0
20
40
60
80
100
120
10 20 30 40 50 60 70 80 90 100
Marks
Fre
qu
en
cy
The modal class is 41 – 50.
The modal class contains 20 more than the class below and 50 more than the class above. So the mode is
likely to divide the modal class in the ration 20: 50 = 2: 5
An estimate of the mode can be found from the histogram by drawing lines as shown in the diagram. This
gives a value of 43 marks.
By calculation:
An estimate of the mode is 20/(20 + 50) of the interval of 10 marks from 40 – 50.
Estimate of mode = 40 + (2/7)(10)
= 42.9
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 7
Exercise:
1] If the mean of the following numbers is 17, find the value of c: 12, 18, c, 13
2] The mean of 10 numbers is 8. If an eleventh number is now included in the results, the mean
becomes 9. What is the value of the eleventh number.
3] The mean of 4 numbers is 5, and the mean of 3 different numbers is 12. What is the mean of the 7
numbers together?
4] A bag contained five balls each bearing one of the numbers 1, 2, 3, 4, 5. A ball was drawn from the
bag, its number noted, and then replaced. This was repeated 50 times and the table below shows the
resulting frequency distribution.
Number 1 2 3 4 5
Frequency x 11 y 8 9
If the mean is 2.7,
(i) determine the value of x and y.
(ii) state the mode and median of this distribution.
5] On a certain day the number of books on 40 shelves in a library was noted and grouped as shown.
Find the mean number of books on a shelf. Give your answer to 2 significant figures.
Number of books 31 - 35 36 - 40 41 - 45 46 - 50 51 - 55 56 – 60
Number of shelves 4 6 10 13 5 2
Skewness
A comparison of the mean , median, and mode can reveal information about the characteristic of skewness,
defined and illustrated below:
A distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half.
A distribution of data is skewed if it is not symmetric and if it extends more to one side than the other.
Lopsided to the right = skewed to left = negatively skewed
Lopsided to the left = skewed to right = positively skewed
Data not lopsided = symmetric = zero skewness
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 8
COMPARISON OF MEAN, MEDIAN, AND MODE:
Average Definition How
common? Existence
Takes every
value into
account?
Affected
by extreme
values
Advantages
and
Disadvantages
Mean x = x
n
Most
familiar
„average‟
Always
exist yes yes
Works with
many
statistical
methods
Median Middle score Commonly
used
Always
exists no no
Often a good
choice if there
are some
extreme values
Mode
Most
frequent
score
Sometimes
used
Might not
exist; may
be more
than one
mode
no no
Appropriate
for data at the
nominal level
General comments:
For a data collection that is approximately symmetric with one mode, the mean, median, and mode tend to
be about the same.
For a data collection that is obviously asymmetric, it would be good to report both the mean and median.
The mean is relatively reliable. That is, when samples are drawn from the same population, the sample
means tend to be more consistent than the other averages (consistent in the sense that the means of samples
drawn from the same population don‟t vary as much as the other averages).
Using Technology
Exercise:
Microsoft Excel can calculate all measures of central tendency.
Example 1: Cambridge Power and Light Company selected 20 residential customers. Following are the
amounts, to the nearest dollar, the customers were charged for electrical services last month.
54 48 58 50 25 47 75 46 60 70
67 68 39 35 56 66 33 62 65 67
What are the mean median, and mode of these amounts?
On a new worksheet, key your data in column A. Give A1 a title. Type the data in cells A2 to A21.
To calculate:
the mean, key
=AVERAGE(A2:A21) in cell A23. You may type in a cell beside it “Mean”
the median
=MEDIAN(A2:A21) in cell A24
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 9
the mode:
=MODE (A2:A21) in cell A25.
If there is no mode, Excel will display #N/A in that cell. If there is more than one mode Excel will display
the one that occurs first in the string of data.
Example 2: The weighted mean
Carter Construction Company pays its hourly employees either $6.50, $7.50, or $8.50 per hour. There are 26
hourly employees. 14 are paid at the $6.50 rate, 10 at the $7.50 rate, and 2 at the $8.50 rate. What is the
weighted mean hourly rate paid to the 26 employees.
Key Employee in F1, Rate in G1, Product in H1
Key 14, 10, 2 in F2 to F4 respectively.
Key 6.5, 7.5, 8.5 in G2 to G4 respectively.
In H2, key = F2*G2
Make H2 your active cell. Place your cursor on the bottom right handle. You will have a thick black plus
sign. Click and drag to H3:H4.
Highlight F2: F4. From the Tool bar choose the AutoSum button.
Highlight H2:H3. From the Tool bar choose the AutoSum button.
In J7, key = H5/F5
You may type in “weighted mean” in an adjacent cell to this result.
Example 3: To find the 10% trimmed mean of the data “Weights of Anethesized Bears” (previous
example), key in the data as usual in column form, labeling the data in the first cell.
Key
= TRIMMEAN( : , 10%) in a new cell.
SECTION 2.5: Measures of Variation
Because variation is so important in statistics, this is one of the most important sections.
The following key concepts are discussed in detail:
(1) Variation refers to the amount that values vary among themselves, and it can be measured with
specific numbers;
(2) Values that are relatively close together have lower measures of variation, and values that are
spread farther apart have measures of variation that are larger;
(3) The standard deviation, which is a particularly important important measure of variation, can be
computed;
(4) The values of standard deviation must be interpreted correctly.
Example:
Waiting times of customers ( in minutes) at the JV Bank (where all customers enter a single waiting line) and
the Bank of P (where customers wait in individual lines at three different teller windows:
JV: 6.5 6.6 6.7 6.8 7.1 7.3 7.4 7.7 7.7 7.7
P: 4.2 5.4 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0
(a) Determine the mean, median, and mode for each data set.
(b) Interpret the results by determining whether there is a difference between the two data sets that is
not apparent from a comparison of the measures of center. If so, how are the data sets different?
We will now develop some specific ways to measure variation:
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 10
Range
The range of a data set is the difference between the highest value and the lowest value.
Range = (highest value) – (lowest value)
For JV: Range = 7.7 – 6.5 = 1.2 min
For P: Range = 10.0 – 4.2 = 5.8 min
Standard deviation of a sample
The standard deviation of a set of sample values is a measure of variation of values from the mean.
Formula
s = ( x – x)2 sample standard deviation
n – 1
alternative formula,
s = x2 - x
2
n
There is also a shortcut formula for the standard deviation.
s = n(x2) – (x)
2
n(n – 1)
Example: Use the first standard deviation formula to find the standard deviation of the JV Bank customer
waiting times. Those times (in minutes) have been listed in the first column of the table below:
x x - x (x – x)2
6.5 -0.65 0.4225
6.6 -0.55 0.3025
6.7
6.8
7.1
7.3
7.4
7.7
7.7
7.7 0.55 0.3025
71.5 = x 2.0450= (x – x)2
Example: Here is the same example, but the „shortcut formula‟ is used.
Find n, x , and x2.
n = 10 (sample size = 10)
x = 71.5 (sum of the 10 sample values)
x2 = 513.27 (=6.5
2 + 6.6
2 + 6.7
2 + …7.7
2)
s = 10(513.27) – (71.5)2
10(10 – 1)
= 0.48 min
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 11
Standard deviation of a Population:
Here is the formula for the standard deviation for a population, denoted by .
= ( x - )2
N – 1
Where is the population mean, and
N is the population size.
Variance of a Sample and Population
The variance of a set of values is a measure of variation equal to the square of the standard deviation.
Sample variance: square of the standard deviation s
Denoted by s2
Example:
From one of our previous examples (JV bank customer waiting times), s = 0.48, more precisely s =
0.4767 min. So,
S2 = (0.4767 min)
2 = 0.23 min
Population variance: square of the population standard deviation
Denoted by 2
Round off rule:
Carry one more decimal place than is present in the original set of values.
Finding Standard Deviation from a Frequency Table
Sometimes it is necessary to compute the standard deviation of a data set that is summarized in the form of a
frequency table. If the original list of sample values is available, use those values with the previous standard
deviation formulae to get more exact results. If the original data are not available, use the formula:
s = n(fx2) – (fx)
2
n(n – 1)
Example: Use the following table to calculate the standard deviation.
Word rating f x fx fx2
0 - 2 20 1 20 20
3 - 5 14 4 56 224
6 - 8 15 7 105 735
9 - 11 2 10 20 200
12 - 14 1 13 13 169
52 = f 214 = (fx) 1348 = (fx2)
S = 52(1348) – (214)2
52(52 – 1)
s = 3.0
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 12
Interpreting and understanding Standard Deviation
The standard deviation measures the variation among values. Values close together will yield a small
standard deviation, whereas values spread farther apart will yield a larger standard deviation.
Three different ways of developing a sense for values of standard deviation:
(1) Range rule of thumb
Based on the principle that for many data sets, the vast majority (such as 95%) of sample values lie within 2
standard deviation of the mean.
x 2s
For estimation:
To obtain a rough estimate of the standard deviation s, use the equation
S range
4
where range = (highest value) – (lowest value)
Example:
Previous results from the National Health Survey show that the heights of men have a mean of 69.0 inches
and a standard deviation of 2.8 inches. Use the range rule of thumb to find the minimum and maximum
„usual‟ heights.
x 2s = 69.0 2(2.8)
Based on these results, we expect that typical men will range in height between 63.4 inches and 74.6 inches.
(2) Empirical Rule for data with a bell-shaped distribution
For data sets having a distribution that is approximately bell-shaped, the following properties apply:
about 68% of all values fall within 1 standard deviation of the mean
x s
about 95% of all values fall within 2 standard deviations of the mean
x 2s
about 99.7% of all values fall within 3 standard deviations of the mean
Example:
The heights of men have a bell-shaped distribution with a mean of 69.0 inches and a standard deviation of
2.8 inches (based on data from the National Health Survey). What percentage of men have heights between
60.6 inches and 77.4 inches?
Solution:
60.6 inches and 77.4 inches are each exactly 3 standard deviations away from the mean of 69.0 inches.
69.0 3(2.8)
According to the empirical rule, 99.7% of all men‟s heights are between 60.6 inches and 77.4 inches.
(3) Chebyshev’s theorem
Chebyshev‟s theorem applies to any data set, unlike the empirical rule, but its results are very approximate.
It says:
The proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at
least 1 – 1/K2, where K is any positive number greater than 1. For K = 2 and K = 3, the following results are
gotten:
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 13
at least ¾ (or 75%) of all values lie within 2 standard deviations of the mean.
at least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean.
Example:
Heights of men have a mean of 69.0 inches and a standard deviation of 2.8 inches. What can we conclude
from Chebyshev‟s theorem.
Within 2 standard deviations
At least ¾ (or 75%) of all men have heights within 2 standard deviations of the mean (63.4 in. – 74.6 in.)
Within 3 standard deviations
At least 8/9 (or 89%) of all men have heights within 3 standard deviations of the mean (60.6in. – 77.4 in.)
Exercise:
1] Find the range, variance, and standard deviation for each of the two samples, then compare the two
sets of results.
Maximum breadth of samples of male Egyptian skulls from 4000 BC and 150 AD
4000 BC: 131 119 138 125 129 126 131 132 126 128 128 131
150 AD: 136 130 126 126 139 141 137 138 133 131 134 129
(Based on data from Ancient Races of the Thebaid by Thomson and Randall-Maciver.)
2] Find the standard deviation of the data summarized in the given frequency table.
Samples of students cars and faculty/staff cars were obtained at a certain college, and their ages (in
years) are summarized in the frequency table.
Age Students Facult/staff
0 -2 23 30
3 - 5 33 47
6 - 8 63 36
9 - 11 68 30
12 - 14 19 8
15 - 17 10 0
18 - 20 1 0
21 - 23 0 1
3] Two different sections of a statistics class take the same quiz and the scores are recorded below.
Do a double stem-and leaf plot for both data sets. Discuss the variation of the data in each of the sets
and compare.
Find the range and standard deviation for each section.
What do the range values lead you to conclude about the variation in the two sections.
Why is the range misleading in this case?
What do the standard deviation values lead you to conclude about the variation in the two sections?
Section 1: 1 20 20 20 20 20 20 20 20 20 20
Section 2: 2 3 4 5 6 14 15 16 17 18 19
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 14
4] Let a population consist of the values 1, 2, and 3. Assume that samples of two different values are
randomly selected with replacement.
(a) Find the variance 2 of the population {1, 2, 3}.
(b) List the nine different possible samples and find the sample variance s2 for each of them. If you
repeatedly select two different items, what is the mean value of the sample variances s2?
(c) For each of the nine samples, find the variance by treating each sample as if it is a population.
(Be sure to use the formula for population variance.) If you repeatedly select two different items,
what is the mean value of the population variances?
(d) Which approach results in values that are better estimates of 2: part (b) or part (c)?
Why?
When computing variances of samples, should you use division by n or n – 1?
(e) The preceding parts show that s2 is an unbiased estimator of
2. Is s an unbiased estimator of ?
Using technology
Here are the standard commands for calculating
(i) Range
= MAX( : ) – MIN( : )
(ii) Sample Variance, s
2
= VAR( : )
Population Variance 2
= VARP( : )
(iii) Sample standard deviation, s
=STDEV( : )
Population standard deviation,
=STDEVP( : )
Use example 1 above to calculate the measures of variation (dispersion) using the appropriate commands
from above.
NOTE:
Microsoft Excel can create a summary of all calculations for descriptive statistics.
Under Tools, locate Data Analysis. Locate Descriptive Statistics. Follow instructions on the dialog
box.. Click on Input range and key in range of cells where the data has been placed (example: A2:
A40). Leave the columns button clicked. Click on summary statistics. Then OK.
SECTION 2.6: Measures of Position
This section introduces measures that can be used to compare values from different data sets or to compare
values within the same data set. The basic tools are z scores, quartiles, and percentiles.
z Scores
A standard score, or z score, is the number of standard deviations that a given value x is above or below the
mean. It is found using the expressions
Sample: z = x – x
S
Population: z = x -
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 15
Example:
Former NBA superstar Michael Jordan is 78 inches tall, and WNBA basketball player Rebecca Lobo is 76
inches tall. Jordan is obviously taller by 2 inches, but which player is relatively taller? Does Jordan‟s height
among men exceed Lobo‟s height among women?
Men have heights with a mean of 69.0 inches and a standard deviation of 2.8 inches; women have heights
with a mean of 63.6 inches and a standard deviation of 2.5 inches.
Solution:
For Jordan For Lobo
x = 78 in. x = 76 in.
= 2.8 in. = 2.5 in.
= 69.0 in. = 63.6 in.
using for both data sets,
z = x -
For Jordan For Lobo
z = 78 – 69.0 z = 76 – 63.6
2.8 2.5
z = 3.21 z = 4.96
Interpretation:
Jordan‟s height is 3.21 standard deviations above the mean, but Lobo‟s height is 4.96 standard deviations
above the mean. Lobo‟s height among women is greater than Jordan‟s height among men.
Consider another player, Mugsy Bogues, who is only 63 inches tall. When his height is converted to a z
score,
z = 63 – 69.0
2.8
= -2.14
his height is 2.14 standard deviations below the mean (because the z score is negative). He is a relatively
short person amongst the population of men.
NOTE:
Ordinary values fall within 2 standard deviations from the mean: x 2 std dev.
-2 z score 2
Unusual values fall below or above 2 standard deviatons from the mean.
z score < -2 or z score > 2
Quartiles and Percentiles
Quartiles (Q1, Q2, Q3) divide the sorted values into four equal parts.
For Q1,
At least 25% of the sorted values will be less than or equal to Q1.
Q1 = ¼ (n + 1)th value
For Q2,
At least 50 % of the sorted values will be less than or equal to Q2.
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 16
Q2 = median
Q2 = ½(n + 1)th value
For Q3,
At least 75% of the sorted values will be less than or equl to Q3.
Q3 = ¾(n + 1)th value
Percentiles (denoted by Pn) are the 99 values which split a distribution into 100 equal portions.
For example, the 10th
percentile, P10
P10 = 10/100(n + 1)th value
Or
P90 = 90/100(n + 1)th percentile
Note:
Q1 = P25
Q2 = P50
Q3 = P75
Example:
The table lists 36 weights (in pounds) of the contents of 36 cans of regular Coke.
(a) Find the quartiles
(b) Find the 10th
and 90th
percentiles
(c) Find the percentile corresponding to the weight of 0.8143 lbs.
0.7901 0.8044 0.8062 0.8073 0.8079 0.8110
0.8126 0.8128 0.8143 0.8150 0.8150 0.8152
0.8152 0.8161 0.8161 0.8163 0.8165 0.8170
0.8172 0.8176 0.8181 0.8189 0.8192 0.8192
0.8194 0.8194 0.8207 0.8211 0.8229 0.8244
0.8244 0.8247 0.8251 0.8264 0.8284 0.8295
Solution:
The data must be sorted in order of size, if it isn‟t as yet. This set has already been sorted.
(a) Quartiles
Q1 = ¼(n + 1)th value
= ¼(36 + 1)th value
= 9.25 th value
9th
value is 0.8143.
10th
value is 0.8150
The difference between 0.8150 – 0.8143 = 0.0007
So
Q1 = 0.8143 +1/4(0.0007)
= 0.814475
Q2 = ½(n + 1)th value
= ½(36 + 1) th value
= 18.5 th value
Q2 is the average of the two middle entries. (Remember: Q2 = median)
Q2 = 0.8170 + 0.8172)
2
= 0.8171
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 17
Q3 = ¾(n + 1) th value
= ¾(36 + 1)th value
= 27 ¾ th value
27th
value is 0.8207 and the 28th
value is 0.8211
The difference is 0.8211 – 0.8207 = 0.0004
So
Q3 = 0.8207 + ¾(0.0004)
= 0.8210
(b) P10 and P90
P10 = 10/100(n + 1)th value
= 10/100(36 + 1)th value
= 3.7th
value
P10 =
P90 = 90/100(n + 1)th value
= 90/100(36 + 1)th value
= 33 1/3 rd value
P90 =
(b) To determine the percentile which correspond to a certain value from the data set, use
percentile of value x = number of values less than x · 100
Total number of values
Percentile of 0.8143 lbs = 8 · 100
36
= 22 (rounded)
The weight of 0.8143 is the 22nd
percentile.
Using Technology
Excel‟s Rank and Percentile analysis toll produces a table showing the rank order and percentile for each
value in a data set. As an alternative to the analysis tool, these results could also be obtained using Excel‟s
Data Sort and Edit Fill commands.
Exercise:
Use Excel to determine the information required in the exercise on 36 weights (in pounds) of 36 cans of
regular Coke (previous exercise). Compare with your manual calculations.
Exercise: Use Microsoft Excel wherever possible.
Express all z scores with two decimal places. Consider a value to be unusual if its z score is less than –2.00
or greater than 2.00.
1. Human body temperatures have a mean of 98.20º and a standard deviation of 0.62º. An
emergency room patient is found to have a temperature of 101º. Convert 101º to a z score. Is that
temperature unusually high? What does it suggest?
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 18
2. Scores on a history test have a mean of 80 and a standard deviation of 12. Scores on a
psychology test have a mean of 30 and a standard deviation of 8. Which is relatively better: a
score of 75 on the history test or a score of 27 on the psychology test?
3. Refer to the data set for the sample of 36 weights of regular Coke. Convert the weight of 0.7901
to a z score. Is 0.7901 an unusual weight for regular Coke?
4. Refer to the data set for weights of anesthesized bears and find the indicated percentile or quartile.
(i) P85 (ii) P35 (iii) Q1 (iv) Q3 (v) P50
5. The first several terms of the famous Fibonacci Sequence are 1, 1, 2, 3, 5, 8, 13.
(a) Find the mean x and standard deviation s, then convert each value to a z score. Don‟t round
the z scores; carry as many places as your calculator can handle.
(b) Find the mean and standard deviation of the z scores found in part (a).
(c) If you use any other data set, will you get the same results obtained in part (b)?
SECTION 2.7: Exploratory Data Analysis EDA
Exploratory data analysis is the process of using statistical tools (such as graphs, measures of center,
measures of variation) to investigate data sets in order to understand their important characteristics.
When exploring a data set, we usually want to calculate the mean and the standard deviation and to generate
a graph, usually a histogram/bar chart. It is also important to further examine the data set to identify any
notable features, especially those that could have a strong effect on results and conclusions (for example,
outliers).
Outliers
An outlier is a value that is located very far away from almost all of the other values. Relative to the other
data, an outlier is an extreme value.
Effects of an outlier
(1) can have a dramatic effect on the mean, and, hence, in some instances, a trimmed mean is
calculated.
(2) Can have a dramatic effect on the standard deviation.
(3) Can have a dramatic effect on the scale of the histogram/bar chart so that the true nature of the
distribution is totally obscured.
An easy way to find outliers is to examine a sorted list of the data. Look at the minimum and maximum
sample values and determine whether they are very far away from the other typical values. When an outlier
occurs because of a nonsampling error, it should either be corrected or deleted. However, some data sets
include outliers that are correct values.
To study the effects of outliers, we can construct graphs and calculate statistics with and without the outliers
included.
Boxplots
Boxplots are useful for revealing the center of the data, the spread of the data, the distribution of the data,
and the presence of outliers.
It is a graph that consists of a line extending from the minimum value to the maximum value, and a box with
lines drawn at the first quartile Q1; the median; and the third quartile Q3.
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 19
Example:
Comparing Ages of Oscar Winners.
In “Ages of Oscar-winning Best Actors and Actresses (Mathematics Teacher magazine) by Richard
Brown and Gretchen Davis, stem-and-leaf plots are used to compare the ages of actors and actresses at
the time they won Oscars. Here are the results for recent winners from each category:
Actors:
32 37 36 32 51 53 33 61 35 45 55 39
76 37 42 40 32 60 38 56 48 48 40 43
62 43 42 44 41 56 39 46 31 47 45 60
Actresses:
50 44 35 80 26 28 41 21 61 38 49 33
74 30 33 41 31 35 41 42 37 26 34 34
35 26 61 60 34 24 30 37 31 27 39 34
Use boxplots to compare the two data sets.
Exploring
We now have the following tools to explore data sets:
measures of center: mean, median, mode
measures of variation: standard deviation and range
measures of spread and relative location: minimum value, maximum value, and quartiles
unusual values: outliers
distribution: histograms, stem-and-leaf plots, and box plots
Rather than simply producing statistics and graphs, try to identify those that are particularly interesting and
important. As a first step, investigate outliers and consider their effects by finding measures and graphs with
and without the outliers included.
Exercise: PROJECT
The traditional typewriter keyboard configuration is called a Qwerty keyboard because of the position of the
letters QWERTYin the top row of letters. Developed in 1872, the Qwerty configuration was supposed to
force typists to slow down so that their work machines would be less likely to jam. The Dvorak keyboard
developed in 1936, positioned the keys most frequently used in the middle (or „home‟) row, a move intended
to improve efficiency. Both keyboard configurations are shown in the accompanying illustration.
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 20
An article in the magazine Discover suggests that you can measure the ease of typing by using this point rate
system: count each letter on the home row as 0, each letter on the top row as 1, and each letter on the bottom
row as 2. (see „Typecasting‟ by Scott, Kim, Discover). For example, the word statistics would have a rating
of 7 on the Qwerty board and a rating of 1 on the Dvorak keyboard:
S t a t i s t i c s
Qwerty: 0 1 0 1 1 0 1 1 2 0 (sum = 7)
Dvorak: 0 0 0 0 0 0 0 0 1 0(sum = 1)
This rating system was used with each of the 52 words in a certain document and the rating values are shown
below:
Table 1: Qwerty keyboard word ratings
2 2 5 1 2 6 3 3 4 2
4 0 5 7 7 5 6 6 8 10
7 2 2 10 5 8 2 5 4 2
6 2 6 1 7 2 7 2 3 8
1 5 2 5 2 14 2 2 6 3
1 7
Table 2: Dvorak keyboard word ratings
2 0 3 1 0 0 0 0 2 0
4 0 3 4 0 3 3 1 3 5
4 2 0 5 1 4 0 3 5 0
2 0 4 1 5 0 4 0 1 3
0 1 0 3 0 1 2 0 0 0
1 4
Exploratory Data Analysis
(a) Organize each data set in a frequency table, using 5 classes with the first being 0 – 2.
(b) Construct the appropriate histogram/bar chart for each data set from the frequency tables in part
(a)
(c) Construct the frequency polygons on the same axes for Qwerty and Dvorak keyboards, using the
frequency tables in part (a).
(d) From the graphs above, describe the type of distribution (skewed negativelyor positively,
symmetric) displayed by each data set.
(e) Calculate all measures of center and summarize them in the table below:
Qwerty Dvorak
Mean
Median
Mode
(f) Calculate all measures of variation and summarize them in the table below:
Qwerty Dvorak
Range
Standard deviation
Variance
Probability & Statistics – Chapter 2 – Describing, Exploring & Comparing Data 21
(g) On the same axes, construct the boxplot (using 5 number summary) for each of the given data
sets. Identify any outliers. Which of these ratings (Qwerty or Dvorak) appear to have more
spread (use information from the boxplots).
(h) Analysis: Using the graphs and measures you have determined above, comment on the level of
typing difficulty using each of the keyboards. Do both key boards appear to have the same level
of difficulty or is one easier to use than the other?
So far we have treated the two data sets as if they were separate and independent, but the word ratings came
from the same 52 words in the same document. We should therefore explore the differences between the
pairs of ratings corresponding to each of the 42 words.
For example: let us say the first word in the document was the word „we‟.
W e
Qwerty 1 1 (sum = 2)
Dvorak 2 0 9sum = 2)
The difference is 2 – 2 = 0.
(i) Find the difference of each pair of data from both data sets:
(j) Determine the mean, standard deviation, 5-number summary boxplot, any outliers for the set of
data consisting of the 52 differences.
(k) Construct an appropriate graph to discuss the type of distribution.
(l) If both keyboards were to have the same level of difficulty, we would expect the differences
between word ratings to average around 0. What does the mean difference and graph tell you
about the level of difficulty of one board compared to the other? Does this support your earlier
analysis in part (h)?
(m) Would removing any outliers affect the conclusion(s) you have drawn? Show any necessary
calculations to support your statement.