Post on 06-Feb-2016
description
Topic 12: Multiple Linear Regression
Outline
• Multiple Regression– Data and notation–Model – Inference
• Recall notes from Topic 3 for simple linear regression
Data for Multiple Regression
• Yi is the response variable• Xi1, Xi2, … , Xi,p-1 are p-1
explanatory (or predictor) variables • Cases denoted by i = 1 to n
Multiple Regression Model
• Yi is the value of the response variable for the ith case
• β0 is the intercept• β1, β2, … , βp-1 are the regression
coefficients for the explanatory variables
0 1 1 2 2 1 , 1...i i i p i p iY X X X
Multiple Regression Model
• Xi,k is the value of the kth explanatory variable for the ith case
• ei are independent Normally distributed random errors with mean 0 and variance σ2
0 1 1 2 2 1 , 1...i i i p i p iY X X X
Multiple Regression Parameters
• β0 is the intercept• β1, β2, … , βp-1 are the regression
coefficients for the explanatory variables
• σ2 the variance of the error term
Interesting special cases
• Yi = β0 + β1Xi + β2Xi2 +…+ βp-1Xi
p-1+ ei
(polynomial of order p-1)• X’s can be indicator or dummy variables
taking the values 0 and 1 (or any other two distinct numbers)
• Interactions between explanatory variables (represented as the product of explanatory variables)
Interesting special cases• Consider the model Yi= β0 + β1Xi1+ β2Xi2+β3X i1Xi2+ ei
• If X2 a dummy variable–Yi = β0 + β1Xi + ei (when X2=0)
–Yi = β0 + β1Xi1+β2+β3Xi1+ ei (when X2=1)
= (β0+β2) + (β1+β3)Xi1+ ei
–Modeling two different regression lines at same time
Model in Matrix Form
n 1 n p n 1p 1
2
n n
2
n n
~N(0, )
~ ( , )
β
Y X
I
Y N X I
Least Squares
YXXbX equations normal Obtain
)XbY()XbY(SSE minimize to b Find
Least Squares Solution
Fitted (predicted) values
YX)XX(b 1
HY YX)XX(XXbY 1
Residuals
Y)HI( HYY YYe
is symetric and idempotent ( )( ) ( )
I H
I H I H I H
Covariance Matrix of residuals
• Cov(e)=σ2(I-H)(I-H)΄= σ2(I-H)
• Var(ei)= σ2(1-hii)
• hii= X΄i(X΄X)-1Xi
• X΄i =(1,Xi1,…,Xi,p-1)
• Residuals are usually correlated
• Cov(ei,ej)= -σ2hij
Estimation of σ2
2
n( ) ( )
n
E
sp
pSSE MSEdf
s s Root MSE
Y Xb Y Xb
Distribution of b• b = (X΄X)-1X΄Y• Since Y~N(Xβ, σ2I)• E(b)=((X΄X)-1X΄)Xβ=β• Cov(b)=σ2 ((X΄X)-1X΄)((X΄X)-1X΄)΄
=σ2(X΄X)-1
• σ2 (X΄X)-1 is estimated by s2 (X΄X)-1
ANOVA Table• Sources of variation are –Model (SAS) or Regression (KNNL)–Error (SAS, KNNL) or Residual–Total
• SS and df add as before–SSM + SSE =SSTO–dfM + dfE = dfTotal
Sums of Squares
n 2
ii 1
n2
i ii 1
n2
ii 1
ˆSSM Y Y
ˆSSE (Y Y )
SSTO (Y Y)
Degrees of Freedom
1-n-n
1-
Total
E
M
dfpdf
Mean Squares
Total
E
M
SSTO/df MSTSSE/df MSESSM/dfMSM
Mean Squares
n 2
ii 1
n2
i ii 1
n2
ii 1
ˆMSM Y Y / ( 1)
ˆMSE (Y Y ) / (n )
MST (Y Y) / (n 1)
p
p
ANOVA Table
Source SS df MS F
Model SSM dfM MSM MSM/MSE
Error SSE dfE MSE Total SSTO dfTotal MST
ANOVA F test
• H0: β1 = β2 = … = βp-1 = 0
• Ha: βk ≠ 0, for at least one k=1,., p-1
• Under H0, F ~ F(p-1,n-p)
• Reject H0 if F is large, use P-value
P-value of F test• The P-value for the F significance test
tells us one of the following:– there is no evidence to conclude that
any of our explanatory variables can help us to model the response variable using this kind of model (P ≥ .05)–one or more of the explanatory
variables in our model is potentially useful for predicting the response variable in a linear model (P ≤ .05)
R2
• The squared multiple regression correlation (R2) gives the proportion of variation in the response variable explained by all the explanatory variables
• It is usually expressed as a percent• It is sometimes called the coefficient of
multiple determination (KNNL p 226)
R2
• R2 = SSM/SST– the proportion of variation explained
• R2 = 1 – (SSE/SST)– 1 – the proportion not explained
• Can express F test is terms of R2
F = [ (R2)/(p-1) ] / [ (1- R2)/(n-p) ]
Background Reading
• We went over KNNL 6.1 - 6.5