Multiple regression
-
Upload
jamalia-mcmillan -
Category
Documents
-
view
20 -
download
0
description
Transcript of Multiple regression
Problem: to draw a straight line through the points that best explains the variance
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Problem: to draw a straight line through the points that best explains the variance
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Problem: to draw a straight line through the points that best explains the variance
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Test with F, just like ANOVA:
Variance explained by x-variable / dfVariance still unexplained / df
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Varianceexplained
(change in line lengths2)
Varianceunexplained
(residualline lengths2)
Test with F, just like ANOVA:
Variance explained by x-variable / dfVariance still unexplained / df
Regression
In regression, each x-variable will normally have 1 df
Test with F, just like ANOVA:
Variance explained by x-variable / dfVariance still unexplained / df
Regression
Essentially a cost: benefit analysis –
Is the benefit in variance explained worth the cost in using up degrees of freedom?
RegressionAlso have R2: the proportion of total variance explained by the
variable
Variance explained by x-variable Variance still unexplained
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Variance explainedby x-variable
Unexplainedvariance
Total variance for 32 data points is 300 units.
An x-variable is then regressed against the data, accounting for 150 units of variance.
1. What is the R2?
2. What is the F ratio?
Regression example
Total variance for 32 data points is 300 units.
An x-variable is then regressed against the data, accounting for 150 units of variance.
1. What is the R2?
2. What is the F ratio?
Regression example
R2 = 150/300 = 0.5
F 1,30 = 150/1 = 30 150/30
Why is df error = 30?
Multiple regression
Tree age
Herbivore damage
Higher nutrient treesLower nutrient trees
Damage= m1*age + b
Tree age
Herbivore damage
Tree nutrient concentration
Residuals ofherbivore damage
Damage= m1*age + m2*nutrient + b
0
20
40
60
1 2 3 41 0
50
100
1 2 3 41
Damage= m1*age + m2*nutrient + m3*age*nutrient +b
No interaction (additive): Interaction (non-additive):
y y
Non-linear regression?
Just a special case of multiple regression!
Y = m1 x +m2 x2 +b
X X2 Y1 1 1.12 4 2.03 9 3.64 16 3.15 25 5.26 36 6.77 49 11.3
X2X1
Y = m1 x1 +m2 x2 +b
7
7.5
8
8.5
9
9.5
10
10.5
11
4.5 5.5 6.5 7.5 8.5
Height (ft)
Jump (ft)
X variable parameter SS F1,13 p
Height +0.943 9.96 112 <0.0001of player
7
7.5
8
8.5
9
9.5
10
10.5
11
105 125 145 165 185 205
Weight (lbs)
Jump (ft)
X variable parameter SS p
Weight +0.040 7.92 32 <0.0001of player
F1,13
An idea
Perhaps if we took two people of identical height, the lighter one might actually jump higher? Excess weight may reduce ability to jump high…
7
7.5
8
8.5
9
9.5
10
10.5
11
4 5 6 7 8
Height (lbs)
Jump (ft)
lighterheavier
X variable parameter SS F p
Height +2.133 9.956 803 <0.0001Weight -0.059 1.008 81 <0.0001
•Why did the parameter estimates change?
•Why did the F tests change?
X variable parameter SS F p
Height +2.133 9.956 803 <0.0001Weight -0.059 1.008 81 <0.0001
X variable parameter SS p
Weight +0.040 7.92 32 <0.0001of player
F1,13
Heavy people often tall (tall people often
heavy)
Tall people can jump higher
People light for their height can jump a bit more
Weight
HeightJump
+
+
-
The problem:
The parameter estimate and significance of an x-variable is affected by the x-variables already in the model!
How do we know which variables are significant, and which order to enter them in model?
Solutions
1) Use a logical order. For example it makes sense to test the interaction first
2) Stepwise regression: “tries out” various orders of removing variables.
Stepwise regression
Enters or removes variables in order of significance, checks after each step if the significance of other variables has changed
Enters one by one: forward stepwise
Enters all, removes one by one: backwards stepwise
Forward stepwise regression
• Enter the variable with the highest correlation with y-variable first (p>p enter).
• Next enter the variable to explain the most residual variation (p>p enter).
• Remove variables that become insignificant (p> p leave) due to other variables being added. And so on…
General words of caution!
•Can interpolate between points, but don’t extraoplate (Mark Twain effect)
In the space of 176 the lower Mississippi has shortened itself 242 miles. That is an average of a trifle over 1 1/3 miles per year. Therefore, any calm person, who is not blind or idiotic, can see that in the old Oölithic Silurian Period, just a million years ago next November, the Lower Mississippi River was upwards of 1,300,000 miles long, and stuck out over the Gulf of Mexico like a fishing rod