Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic...

8
MULTIPLE REGRESSION Introduction Used when one wishes to determine the relationship between a single dependent variable and a set of independent variables. The dependent variable (Y) is typically continuous. The independent variables (X 1 , X 2 , X 3 . . . X P ) are typically continuous, but they can be fixed as well. When the total number of parameters (one dependent plus one independent variable) is 2 (P=2), then the resultant figure is a line. When the number of parameters is P=3 (one dependent plus two independent variables), then the resultant figure is a plane. The equation for linear multiple regression can be written as: = 0 + 1 1 + 2 2 +. . . + Where b 0 =Y intercept, b 1 through b P = Partial regression coefficients, with respect to X 1 , X 2 , . . . X P . Also, b 1 through b P = Represent the slopes of the regression hyperplane, with respect to X 1 , X 2 , . . . X P when all X are fixed and an ellipsoid when all X are continuous or variable. b 1 is the rate of change of the mean of Y as a function of X 1 when X 2 , . . ., X P are held constant. The linear model can be written as: = ! + ! ! + ! ! + . . . + ! ! + ! Analysis of Variance in Multiple Regression The null hypothesis being tested is: H o 1 = ß 2 = . . . ß P = 0 In words, the null hypothesis is that there is no linear relationship between the dependent variable and the independent variables. Sources of variation df Sum of Squares Mean square F Due to regression P ( ) ! SS Reg /P MS Reg /MS Res Residual N-P-1 ( ) ! SS Res /(N-P-1) Total N-1 ( ) !

Transcript of Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic...

Page 1: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

 

 

 

MULTIPLE REGRESSION

Introduction • Used when one wishes to determine the relationship between a single dependent variable

and a set of independent variables.

• The dependent variable (Y) is typically continuous.

• The independent variables (X1, X2, X3 . . . XP) are typically continuous, but they can be fixed as well.

• When the total number of parameters (one dependent plus one independent variable) is 2

(P=2), then the resultant figure is a line.

• When the number of parameters is P=3 (one dependent plus two independent variables), then the resultant figure is a plane.

• The equation for linear multiple regression can be written as:

𝑌 = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2+. . .+𝑏𝑃𝑋𝑃

• Where b0=Y intercept, • b1 through bP = Partial regression coefficients, with respect to X1, X2, .

. . XP. • Also, b1 through bP = Represent the slopes of the regression

hyperplane, with respect to X1, X2, . . . XP when all X are fixed and an ellipsoid when all X are continuous or variable.

• b1 is the rate of change of the mean of Y as a function of X1 when X2, . . ., XP are held constant.

• The linear model can be written as: 𝑌 = 𝛽! + 𝛽!𝑋! + 𝛽!𝑋!+  . . .+𝛽!𝑋! + 𝜀!  

 Analysis of Variance in Multiple Regression

• The null hypothesis being tested is: Ho:ß1 = ß2 = . . . ßP = 0    

• In words, the null hypothesis is that there is no linear relationship between the dependent variable and the independent variables.  

Sources of variation

df Sum of Squares Mean square F

Due to regression P (𝑌 −  𝑌)! SSReg/P MSReg/MSRes

Residual N-P-1 (𝑌 − 𝑌  )! SSRes/(N-P-1)

Total N-1 (𝑌 −  𝑌)!

Page 2: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

 

 

 

• Where P = number of independent parameters and N = total number of observations.

• The hypothesis Ho:ßI = 0 can be tested for each independent variable using a t-test.

• 𝑡 = !!!!!"(!!)

and df= N – P - 1

• Coefficient of determination: R2 =SSReg/SSRes

• The R2 value generally overestimates the population correlation value; thus, an adjusted R2 value may be desired.

• The bias in R2 occurs because as the number of parameters increases in the model, the

numerator can only increase but the denominator remains fairly constant. Therefore, each additional variable cannot result in a decreased R2, only similar or larger values.

• Adjusted R2 = 𝑅! − !(!!!!)

!!!!!

Regression With Variable-X

• To characterize the joint distribution in the variable-X case, the parameters µ1 = µ2 = . . . = µP and µY; 𝜎! = 𝜎! =.    .    .= 𝜎!  and  𝜎!; and the covariances of the X and Y variables.

• The variances and covariances from the analysis are displayed as a symmetrical matrix, with the variances on the left-to-right diagonal.

X1 X2 X3 Y X1 𝝈𝟏𝟐 𝜎!" 𝜎!" 𝜎!! X2 𝜎!" 𝝈𝟐𝟐 𝜎!" 𝜎!! X3 𝜎!" 𝜎!" 𝝈𝟑𝟐 𝜎!! Y 𝜎!! 𝜎!! 𝜎!! 𝝈𝒀𝟐

• The estimates for the correlations also can be displayed as a symmetrical matrix, with the diagonals equal to 1 because the correlation of a value with itself is one.

X1 X2 X3 Y X1 𝟏 𝑟!" 𝑟!" 𝑟!! X2 𝑟!" 𝟏 𝑟!" 𝑟!! X3 𝑟!" 𝑟!" 𝟏 𝑟!! Y 𝑟!! 𝑟!! 𝑟!! 𝟏

• If tests of significance are wanted on the correlation values (𝐻!:𝜌 = 0), they can be calculated using the formula:

𝑡 =𝑟(𝑁 − 1)!/!

(1− 𝑟!)!/!

Page 3: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

 

 

 

• Standardized regression coefficients that would be obtained from the analysis of X and Y

that were standardized before the analysis can be determined using the formula:

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑  𝛽! = 𝛽!𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑  𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛  𝑜𝑓  𝑋!𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑  𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛  𝑜𝑓  𝑌

Multiple Correlation

• In multiple correlation the statistic is denoted as R.

• This value represents the correlation between Y and the point on the regression plane for all possible combinations of X.

• Each individual in the population has a Y value and a corresponding point on the plane

calculated as:

𝑌! = 𝛽! + 𝛽!𝑋! + 𝛽!𝑋!+  . . .+𝛽!𝑋!

• The value for R is the population simple correlation between all Y and Y’ values.

• R also is the highest possible simple correlation between Y and any linear combination of X1 to XP.

• Thus, the minimum value of R is 0 and the maximum value is 1.0.

• When R approaches 0, this indicates that the regression plane poorly predicts Y better than using the value 𝑌.

• An R = 1.0 indicates a perfect fit of the plane with the points in the population.

Partial Correlation

• Simple correlation may not always allow us to clearly determine the relationship between two variables because other variables may be influencing the results.

• Partial correlation analysis involves the studying the linear relationship between two variables while controlling the effect of one or more factors.

• This technique is often used in “causal” modeling of small numbers of variables.

• For example, assume you have the variables Y, X1, and X2 and you wish to determine if

the correlation between Y and X1 is influenced by X2.

o You can determine if there is a causal relationship by calculating the partial correlation of Y with X1 while controlling for variable X2 (written as rY1.2).

Page 4: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

 

 

 

§ In partial correlation analysis, the first step is to compare the partial correlation (e.g. rY1.2) with the original correlation (rY1).

§ If the partial correlation approaches 0, the inference is that the original correlation may be spurious and that there is no direct causal link between the two original variables

o An example using the lung function dataset. Wish to determine the partial

correlation between father’s lung function (ffev1) and father’s age (fage) while controlling for father’s height (fheight). This partial correlation can be written as rffev1 fage.fheight. The number in the parenthesis is the probability of a greater |r| under Ho: ρ=0.

fage

Partial correlation of ffev1 and fage, controlling for height

ffev1 -0.30948 (0.0001)

-0.326113 (<0.0001)

• Since the partial correlation value of -0.326113 is similar to the original

correlation of -0.30948, we can conclude the original correlation between ffev1 and fage was likely not affected by fheight.

Considerations When Conducting Multiple Regression and Partial Correlation

• Regression is much more sensitive to violations of the assumptions underlying the analyses and problematic data such as outliers.

• If you are analyzing economic or time series data, multicollinearity (i.e. high correlation among independent variables) may be a problem.

• Because of the limited time for this course, we are unable to discuss running diagnostics on your data to identify potential problems.

• If you are going to use multiple regression, I encourage you to work with a statistician to

learn more about running diagnostics on your data.

• Any observation having missing data will be excluded from analyses using SAS. Examples of Analyses

• Using the lung function data. • Dependent variable is father’s fev1 (ffev1). • Independent variables are father’s age (fage) and father’s height (fheight) . • SAS commands for multiple regression.

PROC Reg; Model ffev1=fage fheight; Run;

Page 5: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

 

 

 

• SAS commands for producing the variance-covariance matrix.

PROC Corr covar noprob; VAR ffev1 fage fheight; Run;

• SAS commands for producing the partial correlations of ffev1 and fage with fheight controlled.

Proc Corr; var ffev1 fage; partial fheight; title ‘Partial Correlation of ffev1 and fage with fheight controlled’; run;

Page 6: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

Multiple  regression  of  father  age  and  father  height  on  father  fev1    

The  REG  Procedure  Model:  MODEL1  Dependent  Variable:  

ffev1    

 

 

Number of Observations Read 150

Number of Observations Used 150

Analysis of Variance

Source DF Sum of

Squares Mean

Square F Value Pr > F

Model 2 21.05697 10.52848 36.81 <.0001

Error 147 42.04133 0.28600

Corrected Total 149 63.09830

Root MSE 0.53479 R-Square 0.3337

Dependent Mean 4.09327 Adj R-Sq 0.3247

Coeff Var 13.06500

Parameter Estimates

Variable DF Parameter

Estimate Standard

Error t Value Pr > |t|

Intercept 1 -2.76075 1.13775 -2.43 0.0165

fage 1 -0.02664 0.00637 -4.18 <.0001

fheight 1 0.11440 0.01579 7.25 <.0001

𝑌! = −2.7605 − 0.02664  (𝑓𝑎𝑔𝑒) + 0.11440  (𝑓ℎ𝑒𝑖𝑔ℎ𝑡)    Reject  Ho:ßfage  =  0  at  the  95%  and  99%  levels  of  confidence.    Reject  Ho:ßfheight  =  0  at  the  95%  and  99%  levels  of  confidence.    Both  fage  and  fheight  contribute  significantly  in  explaining  the  variation  in  ffev1.    33.4%  of  the  variation  in  ffev1  is  explained  collectively  by  fage  fheight  

Page 7: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

Variance-­‐Covariance  matrix  and  Simple  Linear  Correlation    

The  CORR  Procedure    

 

 

3 Variables: ffev1 fage fheight

Covariance Matrix, DF = 149

ffev1 fage fheight

ffev1 0.42347852 -1.38761969 0.91223221

fage -1.38761969 47.47203579 -1.07516779

fheight 0.91223221 -1.07516779 7.72389262

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

ffev1 150 4.09327 0.65075 613.99000 2.50000 5.85000

fage 150 40.13333 6.89000 6020 26.00000 59.00000

fheight 150 69.26000 2.77919 10389 61.00000 76.00000

Pearson Correlation Coefficients, N = 150 Prob > |r| under H0: Rho=0

ffev1 fage fheight

ffev1 1.00000

-0.30948 0.0001

0.50440 <.0001

fage -0.30948 0.0001

1.00000

-0.05615 0.4949

fheight 0.50440 <.0001

-0.05615 0.4949

1.00000

Page 8: Multiple linear regression - NDSU · Multiple Correlation • In multiple correlation the statistic is denoted as R. • This value represents the correlation between Y and the point

Variance-­‐Covariance  matrix  and  Simple  Linear  Correlation    

The  CORR  Procedure    

 

 

1 Partial Variables: fheight

2 Variables: ffev1 fage

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum Partial

Variance Partial

Std Dev

fheight 150 69.26000 2.77919 10389 61.00000 76.00000

ffev1 150 4.09327 0.65075 613.99000 2.50000 5.85000 0.31787 0.56380

fage 150 40.13333 6.89000 6020 26.00000 59.00000 47.64212 6.90233

Pearson Partial Correlation Coefficients, N = 150 Prob > |r| under H0: Partial Rho=0

ffev1 fage

ffev1 1.00000

-0.32613 <.0001

fage -0.32613 <.0001

1.00000

Since the partial correlation value of -0.326113 is similar to the

original correlation of -0.30948, we can conclude the original correlation between ffev1 and fage was likely not affected by fheight.