Post on 26-Dec-2015
1
Descriptive Statistics:
Numerical Methods
Chapter 4
2
Introduction
In this chapter we use numerical measures to describe data sets, that represent populations or samples.
Usually, we focus our attention on two types of measures when describing population characteristics: Measure of the central location. Measure of dispersion.
3
Why both the central location and the variability are used to describe a set of number?
Observe the following example.
Introduction
4
IntroductionThink of a sample portfolio composed of three stocks.
100 sharesARR = 10%
200 sharesARR = 15% 100 shares
ARR = 20%
A central measure for this portfolio’s ARR for is 15%.Now observe the following portfolio
100 sharesARR = 5%100 sharesARR = 5%
200 sharesARR = 15% 100 shares
ARR = 25%100 sharesARR = 25%
A central measure of this portfolio’s ARR for is 15% too.
5
Introduction
Considering the average ARR only the two portfolios are equal. But are they really?
Is the dispersion of ARR the same for the two portfolio?The dispersion (variability) is an important property
when describing a set of numbers, at least as important as the central location.
We’ll have more detailed discussions on these two important measures later.
6
4.1 Measures of Central Location
With one data pointclearly the central location is at the pointitself.
The central data point reflects the locations of all the actual data points.
How?With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).
7
4.1 Measures of Central LocationThe central data point reflects the locations of all
the actual data points.How?
If the third data point appears in the centerthe measure of central location will remainin the center, but… (click)
But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.
8
As more and more data points are added, the central location moves (left and right) as requiredin order to reflect the effects of all the points.
4.1 Measures of Central Location
9
Sum of the measurementsNumber of measurements
Mean =
This is the most popular and useful measure of central location
The Arithmetic Mean
10
nx
x in
1i
Sample mean Population mean
Nx i
N1i
Sample size Population size
nx
x in
1i
The Arithmetic Mean
11
Find the mean rate of return for a portfolio equally invested in five stocks having the following annual rate of returns: 11.2%, 8.07%, 5.55%, 13.7%, 21%.
Solution
Example 1
%764.95
217.1355.507.82.11x
The Arithmetic Mean
12
The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude.
When determining the median pay attention to the number of observations (k). ‘k’ is odd
Median = the number at the (k+1)/2th location of the ordered array.
‘k’ is Even Median = the average of the two numbers in the middle
(The number at the (k/2)th and the [(k/2)+1)]th locations of the ordered array.)
The Median
13
30,32,60,3126,26,28,29,
Odd number of observations
26,26,28,29,30,32,60
Example 2
The salaries of seven employeeswere recorded (in 1000s): 28, 60, 26, 32, 30, 26, 29.Find the median salary.
Suppose an additional salary of $31,000is added to the group of salaries recorded before. Find the median salary.
Even number of observations
29.5,
The Median
There are seven salaries (K = 7). The (k+1)/2th salary of the ordered array is the number at the (7+1)/2th = 4th location.The median is 29.
There are eight salaries (K = 8). The two salaries in the middle are 29 (in the (k/2)th =4th location), and 30 (in the [(k/2)+1]th=5th location.The median is the average number – 29.5.
14
The Mode of a set of measurements is the value that occurs most frequently.
A Set of data may have one mode (or modal class), or two or more modes.
The modal classFor large data setsthe modal class is much more relevant than a single-value mode.
The Mode
15
Example 3 The manager of a men’s clothing store observes the waist
size (in inches) of trousers sold last week: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.
The mode of this data set is 34 in.
This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”
This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”
The Mode
16
Relationship among Mean, Median, and Mode
If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.
A positively skewed distribution(“skewed to the right”)
MeanMedian
Mode
17
If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.
A positively skewed distribution(“skewed to the right”)
MeanMedian
Mode MeanMedian
Mode
A negatively skewed distribution(“skewed to the left”)
Relationship among Mean, Median, and Mode
18
Using the Mean, Median, and Mode
When to use (not use) each measure of central location):• The mean - is very sensitive to extreme values, thus, should
not be used when a few extreme values residing away from most of the observations, are present. The mean is used in most statistical analyses.
• The median – is not effected by extreme values therefore, can be used in their presence. Yet, the medians does not reflect all the values included in the data set, but rather the location of the observation in the middle.
• The mode – should be used mainly for categorical data.
19
Example 4 A professor of statistics wants to report the results of a midterm exam, taken by 100 students.
• The mean of the test marks is 73.90• The median of the test marks is 81• The mode of the test marks is 84
Describe the information each one provides.The mean provides informationabout the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams.
The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his/her mark relative to other students in the class.
The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.
The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.
Summary Examples
20
Summary Examples
Example 5 The following sample represents the lateness of arriving flights in a
certain domestic flight airport (in minutes): 22, 12, 4, -3,… (the data is found in Lateness.xls)(a) Find the mean, median, and mode of this sample. Are these data form
a skewed distribution? negative, positive? (b) Which measure should not be used? Change the largest lateness to 34
minutes (rather than 67). Which central location measures are effected?(c) A person is waiting for the arrival of a certain flight. He is told the flight will
probably be late not more than10 minutes. Should he believe this is a reliable estimate? Use the distribution of data requested in part (b).
21
Example 5 - solution We run the data on Excel using the ‘Descriptive
Statistics’ tool.Lateness
Mean 10.8709677Standard Error 2.6436135Median 6Mode 4Standard Deviation 14.719017Sample Variance 216.649462Kurtosis 6.39059859Skewness 2.17922953Range 75Minimum -8Maximum 67Sum 337Count 31
The distribution of these data shows a positive skewness:
Do not use the mean, because an ‘outlier’ of 67 minutes lateness effects (increases) the mean value to be almost 11 minutes.
Lateness
201510 5 0
-8 5 18 31 44 57 70
Summary Examples
22
Lateness
0
10
20
-1 8 17 26 35 More
Frequency
.00%
50.00%
100.00%
Example 5 - solution When changing the largest observation from 67 to 34, the mean reduces
to 9.80 minutes, but the median and mode do not change.
Lateness
Mean 9.806451613Standard Error 2.034339265Median 6Mode 4Standard Deviation 11.32672166Sample Variance 128.2946237Kurtosis 0.919374432Skewness 1.051857781Range 48Minimum -8Maximum 40Sum 304Count 31
• It is reasonable to believe that the lateness will not exceed 10 minutes. From the Ogive we see that about 60 % of the flights arrive within 10 minutes of the scheduled arrival time.
Summary Examples
23
Problems
P4-1: Consider the following sample of measurements: 27, 32, 30, 28, 31, 32, 35, 28, 28, 29. Compute the mean, median, mode.Does it appear that the mode is a good measure of central location for this set of numbers?
P4-2: The manager at a local supermarket (facing tough competition) tries to improve service to customers waiting to pay by adding a second cashier. The goal is to have customers wait at most 4.5 minutes before leaving the cashier area. From the data presented in P4-02.xls, was the manager successful in achieving this goal? Use Excel and numerical descriptive measures.
24
4.2 Measures of Variability
Measures of central location fail to tell the whole story about the distribution.
A question of interest still remains unanswered:
How much are the values of a given set spread out around the mean value?
25
Observe two hypothetical data sets:
The mean provides a good representation of thevalues in the data set.
Set 1: Small variability
Why do we need measures of variability?
26
Why do we need measures of variability?
Observe two hypothetical data sets:
Set 1: Small variability
Set 2: Larger variability
The mean is the same as before but no longer represents the set values as good as before.
The mean provides a good representation of thevalues in the data set.
27
The range of a set of measurements is the difference between the largest and smallest measurements.
Its major advantage is the ease with which it can be computed.
Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points.
? ? ?
But, how do all the measurements spread out?
Smallestmeasurement
Largestmeasurement
The range cannot assist in answering this questionRange
The Range
28
This measure reflects the dispersion of all the measurement values.
The variance of a population of N measurements x1, x2,…,xN having a mean is defined as
The variance of a sample of n measurementsx1, x2, …,xn having a mean is defined as
x
N
)x( 2i
N1i2
N
)x( 2i
N1i2
1n
)xx(s
2i
n1i2
1n
)xx(s
2i
n1i2
The Variance
29
Consider two small populations:
1098
74 10
11 12
13 16
8-10= -2
9-10= -111-10= +1
12-10= +2
4-10 = - 6
7-10 = -3
13-10 = +3
16-10 = +6
Sum = 0
Sum = 0
The mean of both populations is 10...
…but measurements in Bare more dispersedthen those in A.
A measure of dispersion should agree with this observation.
Can the sum of deviations from the meanbe a good measure of dispersion?
A
B
The Variance
30
The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion, since clearly their dispersion is not equal.
The Variance
31
Let us calculate the variance of the two populations
185
)1016()1013()1010()107()104( 222222B
25
)1012()1011()1010()109()108( 222222A
Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of dispersion instead?
After all, the sum of squared deviations increases in magnitude when the dispersionof a data set increases!!
The Variance
32
Which data set has a larger dispersion?Which data set has a larger dispersion?
1 3 1 32 5
A B
Data set Bis more dispersedaround the mean
Let us calculate the sum of squared deviations for both data sets
The Variance
33
1 3 1 32 5
A B
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the observation that set B is more dispersed.
The Variance
34
1 3 1 32 5
A B
However, when calculated on “per observation” basis (variance), the dispersions are properly ordered.
A2 = SumA/N = 10/10 = 1
B2 = SumB/N = 8/2 = 4
The Variance
35
Example 6 Find the variance of the following set of numbers,
representing annual rates of returns for a group of mutual funds. Assume the set is (i) a sample, (ii) a population: -2, 4, 5, 6.9, 10
Solution
2
2222
in
1i2
percent59.19
)78.410(...)78.44()78.42(15
11n
)xx(s
4.785
23.95
106.95425
xx i
61i
Assuming a sample
The Variance
36
Example 6 - solution continued
2
2222
in
1i2
percent6736.15
)78.410(...)78.44()78.42(51
n)xx(
Assuming a population
The Variance
37
The standard deviation of a set of measurements is the square root of the set variance.
2
2
:deviationandardstPopulation
ss:deviationstandardSample
2
2
:deviationandardstPopulation
ss:deviationstandardSample
Standard Deviation
38
Example 7 The daily percentage of defective items in two weeks of production (10 working days) were calculated for two production lines?Which line provides good items more consistently?
Line 1: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05
Line 2: 12.1, 2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, 1.3, 11.4
Standard Deviation
39
Example 7, Solution
Line 1 Line 2
Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10
Line 1 Line 2
Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10
Line 1 should be considered less consistent because the standard deviation of its defective proportion is larger (i.e. therefore the standard deviation of the good item proportion is also larger).
Standard Deviation
Let us use the Excel printout obtained from the “Descriptive Statistics” sub-menu.
40
Interpreting the Standard Deviation
The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a
distribution.When describing the shape of a distribution we
refer to A distribution with any shape A mound shaped distribution
41
The Empirical Rule – Describing a Mound Shaped Data Set
If a sample of measurements has a mound-shaped distribution, the interval…
tsmeasuremen the of 68%ely approximat contains )sx,sx(
tsmeasuremen the of 95%ely approximat contains )s2x,s2x(
tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(
42
Example 10 Describe the set of data provided in Data 10 using numerical descriptive measures.
The Empirical Rule
0
5
10
15
17 17.4 17.8 18.2 18.6 More
Measurements
Frequency
Solution From the histogram it
appears that the distribution is approximately mound shaped. We ’ll use the empirical rule to describe the data.
43
From the empirical rule we get: Approximately 68% of the data lie between 17.403 and 18.515
[17. 959-1(.556), 17.959 + 1(.556)]
Approximately 95% of the data lie between 16.847 and 19.071 [17. 959-2(.556), 17. 959+2(.556)]
Approximately 99.7% of the data lie between 16.291 and 19.627
[17. 959-3(.556), 17. 959+3(.556)]
Example 10 – solution continued Running the Descriptive statistics tool in Excel we have
Mean = 17.959Standard deviation (sample) = 0.556
The Empirical Rule – Interpreting the Standard Deviation
Actual count: 26 (100%)
Actual count: 25(96%)
Actual count: 19 (73%)
44
The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/z2
for any z > 1.This theorem is valid for any set of measurements
(sample, population) of any shape!!K Interval Minimum %1 at least 75%2 at least 89%3 at least 94%
s3x,s3x s2x,s2x
s4x,s4x
The Chebyshev Theorem - Describing Any Data Set
(1-1/22)
(1-1/32)
(1-1/42)
45
Example 9 Employee salaries were recorded and a histogram was
created. Describe this data using the correct numerical measures.
The Chebyshev Theorem
Histogram
0
5
10
15
20
155 200 245 290 335 380 425
Salary
Frequency
Solution Creating the histogram we realize
that the distribution is positively skewed. Chebychev Theorem needs to be used to describe the data.
46
Example 9 – solution continued From Excel we have:
Mean = 243.2Standard deviation = 58.354
Applying Chebychev Theorem
• At least 75% of the salaries lie within [243.2-2(58.354), 243.2+2(58.354)] = [126.492, 359.908]
• At least 88.9% of the salaries lie within [243.2-3(58.354), 243.2+3(58.354)] = [68.138, 418.262]
The Chebyshev Theorem
Actual count
39 (97.5%)
All (100%)
47
The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.
This coefficient provides a proportionate measure of variation.
CV :variation oft coefficien Population
xs
cv :variation oft coefficien Sample
A standard deviation of 10 may be perceivedlarge when the mean value is 100, but only moderately large when the mean value is 500
The Coefficient of Variation
48
4.3 Measures of Relative Location and Box Plots
Additional information on the general shape of a data set can be obtained by describing the relative location of 5 values within the data set.
We use percentiles to describe these 5 relative locations. What is a percentile?
49Your score
Percentile The pth percentile of a set of measurements is the
value for which • At most p% of the measurements are less than that value• At most (100-p)% of all the measurements are greater
than that value.Example
Suppose your score is the 60th percentile of a SAT test. Then
60% of all the scores lie here 40%
4.3 Measures of Relative Location and Box Plots
50
Here are two possible approaches commonly used to describe a set of values.
The five number summary: Smallest value First quartile (Q1) Median (Q2) Third quartile (Q3) Largest value
- OR -•The first decile (the 10th percentile)•First quartile (Q1)•Median (Q2)•Third quartile (Q3)•The ninth decile (90th percentile)
4.3 Measures of Relative Location and Box Plots
51
First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile
Lower decile
A demostration of Commonly used percentiles
10% 90% lie here
52
Commonly used percentiles: First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile
Lower quartile
A demostration of Commonly used percentiles - optional
10% 90% lie here
25% 75% lie here
Click
53
Commonly used percentiles: First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile
Middle decile-Median
A demostration of Commonly used percentiles
And so on…
25% 75% lie here
50%lie here
50% lie here
Click
54
There are two general cases to consider: The percentile is a member of the data set The percentile is not a member of the data set; It
falls in between two values of the data set.Let us demonstrate the two cases with two
examples.
Determining Percentiles and their Location
55
Example 11
Find the quartiles for the data set of flight lateness presented in example 4.5.Data: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05
Determining Percentiles and their Location
56
At most (.25)(10) = 2.5 measurements should appear below the first quartile.Check the smallest 2 measurements on the left hand side.
At most (.25)(10) = 2.5 measurements should appear below the first quartile.Check the smallest 2 measurements on the left hand side.
At most (.75)(10)=7.5 measurements should appear above the first quartile.Check the largest 7 measurements on the right hand side.
At most (.75)(10)=7.5 measurements should appear above the first quartile.Check the largest 7 measurements on the right hand side.
The first quartileThe first quartile10 measurements
Example 11 - SolutionSort the measurements
2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9
Determining Percentiles and their Location
57
Example 11 – solution continued The second quartile (Median):
• At most (.5)(10) = 5 numbers lie below and above Q2
• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9
Q2
Q2 = (8.3 + 20.9)/2 = 14.6
Determining Percentiles and their Location
58
Example 11 – solution continued The third quartile
• At most (.75)10 = 7.5 numbers lie below Q3
• At most (.25)10 = 2.5 numbers lie above Q3
• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9
Q3
Determining Percentiles and their Location
59
Example 12
Find the 20th percentile for the data set of flight lateness presented in example 11.
Solution Following the procedure applied to the previous example,
• At most (.20)10 = 2 numbers should fall below the 20th percentile.• At least (.80)10 = 8 numbers should fall above the 20th percentile.
• The sorted data set is: 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.• From the sorted data set we see that every number greater than 3.1
and smaller than 5.2 meets these two conditions. • We show next how to determine the location and value of a percentile
whose value is not one of the data set points.
Determining Percentiles and their Location
60
Find the location of any percentile using the formula
percentilePtheoflocationtheisLwhere100P
)1n(L
thP
P
percentilePtheoflocationtheisLwhere100P
)1n(L
thP
P
Determining Percentiles and their Location
61
Example 12-solution continued Finding the location of the 20th percentile:
2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 Finding the value of the 20th percentile.
The 20th percentile is located at location 2.75, that is, at .75 the distance from 3.1 to 5.2. Therefore,
75.210020
)110(100P
)1n(LP
2 3
3.1
5.2
2.75
P20 = 3.1 + .75(5.2 – 3.1) = 4.675
Determining Percentiles and their Location
62
Quartiles and Variability
Quartiles can provide an idea about the shape of a histogram
Q1 Q2 Q3
Positively skewedhistogram
Q1 Q2 Q3
Negatively skewedhistogram
63
This is a measure of the spread of the middle 50% of the observations
Large value indicates a large spread of the observations
Interquartile range = Q3 – Q1
Inter-quartile Range
64
1.5(Q3 – Q1) 1.5(Q3 – Q1)
A box plot is a pictorial display that provides the main descriptive measures of the measurement set:
• L - the largest measurement• Q3 - The upper quartile
• Q2 - The median
• Q1 - The lower quartile• S - The smallest measurement
S Q1 Q2 Q3 LWhisker Whisker
Box Plot
An outlier is defined as any valuethat is more than 1.5(Q3 – Q1)away from the box.
65
Example 13 Create a box plot for the data regarding the GMAT scores of 200 applicants (see Data13.xls)
Box Plot
GMAT512531461515...
Smallest = 449Q1 = 512Median = 537Q3 = 575Largest = 788IQR = 63Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
537512449 575417.5512-1.5(IQR) 575+1.5(IQR)
669.5 788
66
Interpreting the box plot results• The scores range from 449 to 788.• About half the scores are smaller than 537, and about half are larger than
537.• About half the scores lie between 512 and 575.• About a quarter lies below 512 and a quarter above 575.
Q1
512Q2
537Q3
575
25% 50% 25%
449 669.5
Box PlotExample 13 - continued
67
50%
25% 25%
The data set is positively skewed
Q1
512Q2
537Q3
575
25% 50% 25%
449 669.5
Box PlotExample 13 - continued
68
4.4 Measures of Linear Relationship
The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. The Covariance answers the question: Is there any pattern
to the way two variables move together? The Correlation Coefficient answers the question: How
strong is the linear relationship between two variables.
69
N
)y)((xY)COV(X,covariance Population yixi
N
)y)((xY)COV(X,covariance Population yixi
x (y) is the population mean of the variable X (Y).N is the population size.
1-n)yy)(x(x
y) cov(x,covariance Sample ii
1-n)yy)(x(x
y) cov(x,covariance Sample ii
Covariance
x (y) is the population mean of the variable X (Y).n is the sample size.
70
If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number.
Covariance
1
3
4
6
10
8
X
Y
71
If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number.
Covariance
X
Y
4
6
3
101
8
72
If the two variables are unrelated, the covariance will be close to zero.
Covariance
1
3
6
104
8
X
Y
73
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yxss)Y,Xcov(
r
ncorrelatio oft coefficien Sample
yxss
)Y,Xcov(r
ncorrelatio oft coefficien Sample
The coefficient of correlation
The coefficient of correlation measures the strength of the linear relationship between two variables.
74
If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).
The coefficient of correlation
75
The coefficient of correlation
If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).
76
A weak linear relationship is indicated by a coefficient close to zero.
Also, a non-linear relationship translates to a weak linear relationship
The coefficient of correlation
77
Example 14 Compute the covariance and the coefficient of
correlation to measure how are car speed (mile per hour) and gas consumption (miles per gallon) related to one another (see data next).
Solution We believe speed affects gas consumption. Thus
• Speed is labeled X• Miles per gallon is labeled Y
The coefficient of correlation and the covariance
78
Car x y x2 y2 xy
nx
x1n
1s
nyx
yx1n
1
)y,x(CovFurmulasShortcut
2n1i2
in
1i2
in
1iin
1iii
n1i
The coefficient of correlation and the covariance
Example 14 – solution continued
1 15 7.1 225 50.41 106.52 35 15.5 1225 240.25 542.53 35 18.5 1225 342.25 647.54 40 19.7 1600 388.09 7885 40 22.4 1600 501.76 8966 45 21.3 2025 453.69 958.57 45 22.8 2025 519.84 10268 45 23.1 2025 533.61 1039.59 50 22.8 2500 519.84 114010 50 21.3 2500 453.69 1065
Total 400 194.5 16950 4003.43 8209.5
7.4710
)4.194)(400(5.8209
1101
)y,x(Covfurmulashortcut theUsing
79
Car x y x2 y2 xy
nx
x1n
1s
nyx
yx1n
1
)y,x(CovFurmulasShortcut
2n1i2
in
1i2
in
1iin
1iii
n1i
The coefficient of correlation and the covariance
Example 14 – solution continued
1 15 7.1 225 50.41 106.52 35 15.5 1225 240.25 542.53 35 18.5 1225 342.25 647.54 40 19.7 1600 388.09 7885 40 22.4 1600 501.76 8966 45 21.3 2025 453.69 958.57 45 22.8 2025 519.84 10268 45 23.1 2025 533.61 1039.59 50 22.8 2500 519.84 114010 50 21.3 2500 453.69 1065
Total 400 194.5 16950 4003.43 8209.5
948.410
5.194)43.4003(
1101
s
27.1010
400)16950(
1101
s
:have wefurmulashortcut the From Y. andX of deviation satandard the computefirst wencorrelatio oft coefficienthe calculate To
2
y
2
x
80
The coefficient of correlation and the covariance
Example 14 – solution continued
9938.)948.4)(27.10(
7.47
ss)Y,Xcov(
ryx
Interpretation: Speed and mileage per gallon are strongly positively linearly related for the speed range of 15 to 50 miles per hour.