Linear Correlations
Transcript of Linear Correlations
Linear Correlations
Vignon S. Oussa
December 7, 2011
Vignon S. Oussa () Linear Correlations December 7, 2011 1 / 8
Content
Introduction
Measure of dispersion
Concept of linear correlation of variables
Computations in Mathematica
Misc
questions
Vignon S. Oussa () Linear Correlations December 7, 2011 2 / 8
Content
Introduction
Measure of dispersion
Concept of linear correlation of variables
Computations in Mathematica
Misc
questions
Vignon S. Oussa () Linear Correlations December 7, 2011 2 / 8
Content
Introduction
Measure of dispersion
Concept of linear correlation of variables
Computations in Mathematica
Misc
questions
Vignon S. Oussa () Linear Correlations December 7, 2011 2 / 8
Content
Introduction
Measure of dispersion
Concept of linear correlation of variables
Computations in Mathematica
Misc
questions
Vignon S. Oussa () Linear Correlations December 7, 2011 2 / 8
Content
Introduction
Measure of dispersion
Concept of linear correlation of variables
Computations in Mathematica
Misc
questions
Vignon S. Oussa () Linear Correlations December 7, 2011 2 / 8
Content
Introduction
Measure of dispersion
Concept of linear correlation of variables
Computations in Mathematica
Misc
questions
Vignon S. Oussa () Linear Correlations December 7, 2011 2 / 8
What does it mean when we say smoking causes lung cancer?
Does it mean you will get cancer if you smoke a single cigarette?
People smoke for several years but never had lung cancer
Statistical meaning (You are much likely to get lung cancer if yousmoke more than if you don�t smoke)
Studies showed a clear correlation between smoking and lung cancer.
These studies are bases on statistics
.
Vignon S. Oussa () Linear Correlations December 7, 2011 3 / 8
What does it mean when we say smoking causes lung cancer?
Does it mean you will get cancer if you smoke a single cigarette?
People smoke for several years but never had lung cancer
Statistical meaning (You are much likely to get lung cancer if yousmoke more than if you don�t smoke)
Studies showed a clear correlation between smoking and lung cancer.
These studies are bases on statistics
.
Vignon S. Oussa () Linear Correlations December 7, 2011 3 / 8
What does it mean when we say smoking causes lung cancer?
Does it mean you will get cancer if you smoke a single cigarette?
People smoke for several years but never had lung cancer
Statistical meaning (You are much likely to get lung cancer if yousmoke more than if you don�t smoke)
Studies showed a clear correlation between smoking and lung cancer.
These studies are bases on statistics
.
Vignon S. Oussa () Linear Correlations December 7, 2011 3 / 8
What does it mean when we say smoking causes lung cancer?
Does it mean you will get cancer if you smoke a single cigarette?
People smoke for several years but never had lung cancer
Statistical meaning (You are much likely to get lung cancer if yousmoke more than if you don�t smoke)
Studies showed a clear correlation between smoking and lung cancer.
These studies are bases on statistics
.
Vignon S. Oussa () Linear Correlations December 7, 2011 3 / 8
What does it mean when we say smoking causes lung cancer?
Does it mean you will get cancer if you smoke a single cigarette?
People smoke for several years but never had lung cancer
Statistical meaning (You are much likely to get lung cancer if yousmoke more than if you don�t smoke)
Studies showed a clear correlation between smoking and lung cancer.
These studies are bases on statistics
.
Vignon S. Oussa () Linear Correlations December 7, 2011 3 / 8
What does it mean when we say smoking causes lung cancer?
Does it mean you will get cancer if you smoke a single cigarette?
People smoke for several years but never had lung cancer
Statistical meaning (You are much likely to get lung cancer if yousmoke more than if you don�t smoke)
Studies showed a clear correlation between smoking and lung cancer.
These studies are bases on statistics
.
Vignon S. Oussa () Linear Correlations December 7, 2011 3 / 8
De�nitionA correlation exists between two variables when higher values of onevariable consistently go with higher values of another, or when highervalues of one variable consistently go with lower values of another.
ExampleThere is a correlation between the variables height and weight for people.Taller people trend to weight more than shorter people.
ExampleThere is a correlation between the variables demand for apples and price ofapples; demands tend to be lower when the price is higher.
ExampleThere is a correlation between practice time and skill among piano players;those who practice more tend to be more skilled
Vignon S. Oussa () Linear Correlations December 7, 2011 4 / 8
Given x1, � � � , xN be N measurement of observations in population of Nindividuals with average µ.
Population mean µ =x1 + � � �+ xN
N= average.
Population Variance σ2 =∑Ni=1 (xi � µ)2
N
Pop Stand Deviation σ =p
σ2 =
s∑Ni=1 (xi � µ)2
N.
Vignon S. Oussa () Linear Correlations December 7, 2011 5 / 8
Now let x1, � � � , xn be n observations in the sample of the originalpopulation with mean x
Sample mean x =x1 + � � �+ xn
n
Sample Variance s2 =∑ni=1 (xi � x)
2
n� 1
Sample Stand Deviation s =ps2 =
s∑ni=1 (xi � x)
2
n� 1 .
Vignon S. Oussa () Linear Correlations December 7, 2011 6 / 8
Some Sample data
Company Revenue in 1998GM 178.1Wal-Mart 119.3IBM 78.5Boeing 45.8
Sample mean = 105.425
Sample variance = 3251.42
Sample standard deviation = 57.02
Vignon S. Oussa () Linear Correlations December 7, 2011 7 / 8
De�nitionsBivariate data are data in which 2 variables are measured on anindividual.
We are interested in the fundamental question
ProblemHow can the value of one variable be used to predict the value of the othervariable?
Vignon S. Oussa () Linear Correlations December 7, 2011 8 / 8
Scatter Diagrams and Correlation
4-2
Objectives fJ Draw and interpret scatter diagrams
S Describe the properties of the linear correlation coefficient
£J Compute and interpret the linear correlation coefficient
I!J Determine whether a linear relation exists between two variables
fa Explain the difference between correlation and causation
4-33
IJ Dra'W' a:n.d Interpret Se!atter Diagra:m.s
4-4
The reS()Onse variable is the variable whose value can be explained by the value of the eX()Ianatory or 1nedictor variable.
4-5
A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. The explanatory variable is plotted on the h01izontal axis, and the response variable is plotted on the vertical axis.
EXAMPLE Drawing and Interpreting a Scatter Diagram
The data shown to the right are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the explanatory variable, x, and time (in minutes) to drill five feet is the response variable, y. Draw a scatter diagram of the data.Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6.
4-6
4-7
8-
(1.)
E_ jZ:
6- • • I
50
Drilling Time versus Depth
•• • • •
• • •
• •
I I I
100 150 200
Depth
Various Types of Relations in a Scatter Diagram
Response
Response
•• ••• ••• • ••• •• • • • • •• Explanatory (a) Linear
••• -~···· •• •• •• • • •• •• Explanatory
(c) Nonlinear
Response
Response
•• • •• • • • •• • • • •• ••• • • Explanatory (b) Linear
•• • •• •••• •• • ••••• •• • • •• • • Explanatory
(d) Nonlinear
Response
• • • • • ••••• • • • •
Explanatory (e) No relation
4-10
fJ Des~ibe the Properties of the Linear Correlation Coeffi~ient
4-11
The linear correlation coeflicient or Pearson product moment correlation coefficient is a measure of the strength and direction of the linear relation between two quantitative va1iables. We use the Greek letter p (rho) to represent the population correlation coefficient and r to represent the sample conelation coefficient. We present only the formula for the sample correlation coefficient.
Samtlle Linear Correlation Coefficient':'
r= n -1
where x is the sample mean of the explanatory variable
Sx is the sample standard deviation of the explanatory vmiable
y is the sample mean of the response vmiable
Sy is the sample standard deviation of the response va1iable
n is the number of individuals in the sample
(1)
4-12
Properties of the Linear Correlation Coefficient
1. The linear correlation coefficient is always between -1 and 1, inclusive. That is, -1 < r < 1. 2. If r = + 1, then a perfect positive linear relation exists between the two vatiables. See Figure 4(a). 3. If r = -1, then a perfect negative linear relation exists between the two vatiables. See Figure 4( d). 4. The closer r is to + 1, the stronger is the evidence of positive association between the two variables. See Figures 4(b) and 4(c).
5. The closer r is to -1, the stronger is the evidence of negative association between the two variables. See Figures 4( e) and 4(f). 6. If r is close to 0, then little or no evidence exists of a linear relation between the two variables. Because the linear correlation coefficient is a measme of the strength of the linear relation, r close to 0 does not imt)ly no relation, just no linear relation. See Figures 4(g) and 4(h). 7. The linear correlation coefficient is a unitless measure of association. So the unit of measure for x and y plays no role in the interpretation of r. 8. The correlation coefficient is not resistant. Therefore, an observation that does not follow the overall pattern of the data could affect the value of the linear correlation coefficient.
4-13
• • • • • • • • • (;1) Perfect positive
linear relation, r = 1
I • • • •
•
• : I • • I • • • •
(e) Strong negative
•
linear relation, r = - 0.9
•
•
• •
I I •
! I ~ • • • I I
• • I I •
(b) Strong positive linear relation, r = 0.9
I • • • • • • •
• • • • • • I • • • • • • (I) Moderate negative
• I
linear relation, r = - 0.4
• I
• • I • • • • • • • •
• • I • •••• I • • • • •
(c) Moderate positive linear relation, r = 0.4
I • I •
• • I • • • • • • • • • •
(g) No linear relation, r close to 0.
I •
I •
• • I
• •
• • • • • • • • • • • (d) Perfect negative
linear relation, r = - 1
• • I • • • I • • • • • I I
(h) No linear relation, r close to 0.
•
I I
4-14
fJ Co:tnpute and Interpret the Linear Correlation Coeffie!ient
EXAMPLE Determining the Linear Correlation Coefficient
Determine the linear correlation coefficient of the drilling data.
4-15
4-16
Depth, x Time, y
35 5.88 -1.74712 -1.41633 2.474501
50 5.99 -1.45992 -1.27544 1.862051
75 6.74 -0.98126 -0.31486 0.308958
95 6.1 -0.59833 -1.13456 0.678839
120 7.47 -0.11967 0.620111 -0.07421
130 6.93 0.0718 -0.07151 -0.00513
145 6.42 0.358998 -0.72471 -0.26017
155 7.97 0.550463 1.260501 0.693859
160 7.92 0.646196 1.196462 0.773149
175 7.62 0.933394 0.812228 0.758129
185 6.89 1.12486 -0.12274 -0.13807
190 7.9 1.220592 1.170846 1.429126
126.25 6.985833 8.501037
sx = 52.2287 4 sv = 0. 78077 4
17
18.501037
12 10.773
i i
x y
x x y ys s
rn
− − =
−
=−
=
∑
4-17
184-18
fJ Deter:tnine Whether a Linear Relation Exists bet'W'een TW'o Variables
Does a Linear Relation Exist?
Since is fairly close to +1, we can say there is a positive linear relation between time to drill five feet and depth at which drilling begins.
194-19
204-20
fEJ Explain the Differen~e bet'Ween Correlation and Causation
21
According to data obtained from the Statistical Abstract of the United States, the correlation between the percentage of the female population with a bachelor’s degree and the percentage of births to unmarried mothers since 1990 is 0.940.
Does this mean that a higher percentage of females with bachelor’s degrees causes a higher percentage of births to unmarried mothers?
Certainly not! The correlation exists only because both percentages have been increasing since 1990. It is this relation that causes the high correlation. In general, time series data (data collected over time) will have high correlations because each variable is moving in a specific direction over time (both going up or down over time; one increasing, while the other is decreasing over time).
When data are observational, we cannot claim a causal relation exists between two variables. We can only claim causality when the data are collected through a designed experiment.
4-21
Another way that two variables can be related even though there is not a causal relation is through a lurking variable.
A lurking variable is related to both the explanatory and response variable.
For example, ice cream sales and crime rates have a very high correlation. Does this mean that local governments should shut down all ice cream shops? No! The lurking variable is temperature. As air temperatures rise, both ice cream sales and crime rates rise.
4-22