7/2/2015330 Lecture 51 STATS 330: Lecture 5. 7/2/2015330 Lecture 52 Tutorials These will cover...
-
Upload
joan-craig -
Category
Documents
-
view
214 -
download
0
Transcript of 7/2/2015330 Lecture 51 STATS 330: Lecture 5. 7/2/2015330 Lecture 52 Tutorials These will cover...
04/19/23 330 Lecture 5 2
Tutorials These will cover computing details Held in basement floor tutorial lab,
Building 303 South (303S B75)• Tutorial 1: Wednesday 11-12• Tutorial 2: Friday 10-11 • Tutorial 2: Friday 2-3
Start this Thursday
04/19/23 330 Lecture 5 3
My office hours Tuesday 10:30 – 12:00
Thursday 10:30 – 12:00
Room 265, Building 303S• Come up to the second floor in the Computer
Science lifts, through the doors and turn right
Tutor office hours Tuesday 1-4 pm Thursday 12-2, 3-4 pm Friday 1-3 pm
All in Room 303.133 (First floor, Building 303)
04/19/23 330 Lecture 5 4
Next Week Monday, Thursday lectures by Arden
Miller
No lecture Tuesday
Don’t forget assignment due Aug 4
04/19/23 330 Lecture 5 5
04/19/23 330 Lecture 5 6
Today’s lecture: The multiple regression model
Aim of the lecture: To describe the type of data suitable for
multiple regression To review the multiple regression model To describe some plots that check the
suitability of the model
04/19/23 330 Lecture 5 7
Suitable data Suppose our data frame contains
• A numeric “response” variable Y
• One or more numeric “explanatory” variables X1,…Xk
(also called covariates, independent variables)
and we want to “explain” (model, predict) Y in terms of the explanatory variables, e.g.• explain volume of cherry trees in terms of height and
diameter• explain NOx in terms of C and E
04/19/23 330 Lecture 5 8
The regression model
The model assumes The responses are normally distributed with
means (each response has a different mean) and constant variance 2
The mean response of a typical observation depends on the covariates through a linear relationship
xk xk
The responses are independent
04/19/23 330 Lecture 5 9
The regression plane
The relationship
xk xk
can be visualized as a plane.
The plane is determined by the coefficients ,k
e.g. for k=2 (k=number of covariates)
04/19/23 330 Lecture 5 10
0
0.2
0.4
0.6
0.8 1
0
0.3
0.6
0.9
-1
-0.5
0
0.5
1
Y
X2
X1
Slope =
slope= 2
0
04/19/23 330 Lecture 5 11
The regression model (cont)
The data is scattered above and below the plane:
Size of “sticks” is random, controlled by 2, doesn’t depend on
x1, x2
aaa
Y
X1
O
X2
04/19/23 330 Lecture 5 12
Alternative form of model
Y
xk xk
where is normally distributed with mean zero and variance 2
04/19/23 330 Lecture 5 13
Checking the data How can we tell when this model is suitable? Suitable = data randomly scattered about plane,
“planar” for short k=1, draw scatterplot of y vs x k=2, use spinner, look for “edge”
k2 use coplots: if data are planar the plots should be parallel:
coplot(y~x1|x2*x3)
04/19/23 330 Lecture 5 14
040
80
4 8 12 4 8 12 4 8 12
040
80
040
80
040
80
040
80
4 8 12 4 8 12 4 8 12
040
80
x1
y
-5 0 5 10
Given : x2
12
34
5
Giv
en
: x3
04/19/23 330 Lecture 5 15
Interpretation of coefficients
The regression coefficient 0 gives the average response when all covariates are zero
The regression coefficient 1 • gives the slope of the plane in the x1 direction• measures the average increase in the response for a
unit increase in x1 when all other covariates are held constant
• Is the slope of plots of y versus x1 in a coplot• Is not necessarily the slope of y versus x in a scatter
plot
04/19/23 330 Lecture 5 16
ExampleConsider data from model
y = 5 - 1*x1 + 4*x2 + N(0,1)
x1 and x2 highly correlated, r = 0.95
Regression of y on x1 alone has slope 2.767:
04/19/23 330 Lecture 5 17
Scatter plot: y vs x1
0.0 0.2 0.4 0.6 0.8 1.0
56
78
Slope = 2.767
x1
y
Slope 2.767
04/19/23 330 Lecture 5 18
Coplot
56
78
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
56
78
0.0 0.2 0.4 0.6 0.8 1.0
56
78
0.0 0.2 0.4 0.6 0.8 1.0
x1
y
0.2 0.4 0.6 0.8 1.0
Given : x2
Slopes are all about - 1
Points to note The regression coefficients 1 and 2 refer to the
conditional mean of the response, given the covariates (in this case conditional on x1 and x2)
1 is the slope in the coplot, conditioning on x2
The coefficient in the plot of y versus x1 refers to a different conditional distribution (conditional on x1 only)
The same if and only if x1 and x2 uncorrelated
04/19/23 330 Lecture 5 19
04/19/23 330 Lecture 5 20
Estimation of the coefficients
We estimate the (unknown) regression plane by the “least squares plane” (best fitting plane)
Best fitting plane = plane that minimizes the sum of squared vertical deviations from the plane
That is, minimize the least squares criterion
2110
1
)...( ikki
n
ii xbxbby
04/19/23 330 Lecture 5 21
Estimation of the coefficients (2)
The R command lm calculates the coefficients of the best fitting plane
This function solves the normal equations, a set of linear equations derived by differentiating the least squares criterion with respect to the coefficients
04/19/23 330 Lecture 5 22
Math Stuff (only if you like it)
Arrange data on response and covariates into a vector y and a matrix X
nkn
k
n xx
xx
y
y
1
1111
1
1
, X y
04/19/23 330 Lecture 5 24
Calculation of best fitting plane
> lm(y~x1+x2)
Call:lm(formula = y ~ x1 + x2)
Coefficients:(Intercept) x1 x2 4.988 -1.055 4.060
x.x - y21
0604055.1988.4 is plane Fitted
04/19/23 330 Lecture 5 25
How well does the plane fit?
Judge this by examining the residuals and fitted values: Each observation has a fitted value:• Yi: response for observation i, xi1, xi2: values of explanatory
variables for observation i
• Fitted plane is
• Fitted value for observation i is
(the height of the fitted plane at (xi1,xi2)
22110ˆˆˆ xxy
22110ˆˆˆˆ iii xxy
04/19/23 330 Lecture 5 26
How well does the plane fit? (cont)
The residual is the difference between the response and the fitted value
ii
iiii
yy
xxyr
ˆ
)ˆˆˆ( 22110
04/19/23 330 Lecture 5 27
How well does the plane fit (cont)
For a good fit, residuals must be• Small relative to the y’s• Have no pattern (don’t depend on the fitted
values, x’s etc)
For the model to be useful, we should have a strong relationship between the response and the explanatory variables
04/19/23 330 Lecture 5 28
Measuring goodness of fit
How can we measure the relative size of residuals and the strength of the relationship between y and the x’s?
The “anova identity” is a way of explaining this
04/19/23 330 Lecture 5 29
Measuring goodness of fit (cont)
Consider the “residual sum of squares”
Provides an overall measure of the size of the residuals
04/19/23 330 Lecture 5 30
Measuring goodness of fit (cont)
Consider the “total sum of squares”
Provides an overall measure of the variability of the data
04/19/23 330 Lecture 5 31
Math stuff: ANOVA identity
Math result: Anova Identity
TSS = RegSS + RSS
Since RegSS is non-negative, RSS must be less than
TSS so RSS/TSS is between 0 and 1.
2
1
2
1
2
1
)ˆ()ˆ()( i
n
ii
n
ii
n
ii yyyyyy
RegSS: Regression sum of squares
04/19/23 330 Lecture 5 32
Interpretation The smaller RSS compared to TSS, the
better the fit. If RSS is 0, and TSS is positive, we have a
perfect fit If RegSS=0, RSS=TSS and the plane is
flat (all coefficients except the constant are zero) , so the x’s don’t help predict y
04/19/23 330 Lecture 5 33
R2
Another way to express this is the coefficient of determination R2
This is defined as 1-RSS/TSS = RegSS/TSS
R2 is always between 0 and 1• 1 means a perfect fit• 0 means a flat plane
R2 is the square of the correlation between the observations and the fitted values
04/19/23 330 Lecture 5 34
Estimate of
Recall that controls the scatter of the observations about the regression plane:• the bigger , the more scatter• the bigger , the smaller R2
estimated by
s2=RSS/(n-k-1)
04/19/23 330 Lecture 5 35
Calculationscherry.lm <- lm(volume~diameter+height,
data=cherry.df))
summary(cherry.lm)
The lm function produces an “lm object” that contains all the information from fitting the regression
lm = “linear model”
“ell” not I !!!!
04/19/23 330 Lecture 5 36
Call:lm(formula = volume ~ diameter + height, data = cherry.df)
Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***diameter 4.7082 0.2643 17.816 < 2e-16 ***height 0.3393 0.1302 2.607 0.0145 * ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.882 on 28 degrees of freedomMultiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Info on residuals
coefficients
R2
Estimate of
Cherries