Multiple Regression Analysis - My Webspace fileswebspace.ship.edu/pgmarr/Geo441/Lectures/Lec 14 -...

Post on 12-Aug-2019

223 views 0 download

Transcript of Multiple Regression Analysis - My Webspace fileswebspace.ship.edu/pgmarr/Geo441/Lectures/Lec 14 -...

Multiple Regression Analysis

Where as simple linear regression has 2 variables (1 dependent, 1 independent):

Multiple linear regression has >2 variables (1 dependent, many independent):

The problems and solutions are the same as bivariate regression, except there are more parameters to estimate.

nn xbxbxbay ...ˆ 2211 +++=

bxay +=ˆ

In bivariate regression we fit a line through points plotted in 2-dimenstional space:

In multiple regression with 3 variables we fit a plane through points plotted in 3-dimenstional space:

Additional variables add additional dimensions to the variable space.

In addition to the assumptions of bivariate regression, multiple regression has the assumption of no multicollinearity among the independent variables.

Multicollinearity – when two or more of the independent variables are highly correlated, making it difficult to separate their effects on the dependent variable.

Example: Determine the strength of the relationship between native American male standing height, average yearly minimum temperature, and annual temperature range.

Variables:

MHT Male Standing Height(cm) Dependent

AnnMinTemp Annual Minimum Temp (ºF) Independent

AnnRange Annual Temp Range (ºF) Independent

Model Summaryb

.654a .428 .416 30.04066 1.683Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Durbin-Watson

Predictors: (Constant), AnnRange, AnnMinTempa.

Dependent Variable: MHTb.

ANOVAb

63546.875 2 31773.438 35.208 .000a

84829.502 94 902.442148376.4 96

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predic tors: (Constant), AnnRange, AnnMinTempa.

Dependent Variable: MHTb.

Coefficientsa

1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166

(Constant)AnnMinTempAnnRange

Model1

B Std. Error

UnstandardizedCoeffic ients

Beta

StandardizedCoeffic ients

t Sig. Tolerance VIFCollinearity Statistics

Dependent Variable: MHTa.

41.6% of height explained by min temp and range.

Model is significant.

Slopes are not zero. Some collinearity.

We interpret the regression equation as follows:

The equation can be interpreted as follows:

Every 1ºF increase in minimum temperature adds 4.49 centimetersin male standing height… holding constant the temperature range.

Conversely, every increase of 1ºF in the annual temperature range adds 1.57 centimeters in male standing height, holding constantthe minimum temperature.

range) temp F1.57(º temp) min F4.49(º1665.6Height Standing Male ++=

Normality of the residuals is one of the most important assumptions of linear regression. In this case the residuals arenormally distributed.

The observed and predicted residuals do not display any systematic bias, which would indicate that the independentvariables vary systematically with each other.

Tolerance is the amount of the variance in a given independent variable that can not be explained by other independent variables. In this case 46.2% of the variance in one can not be explained by the other… meaning that 53.8% of the variance IS shared or collinear.

Coefficientsa

1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166

(Constant)AnnMinTempAnnRange

Model1

B Std. Error

UnstandardizedCoeffic ients

Beta

StandardizedCoeffic ients

t Sig. Tolerance VIFCollinearity Statistics

Dependent Variable: MHTa.

Model Summaryb

.654a .428 .416 30.04066 1.683Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Durbin-Watson

Predictors: (Constant), AnnRange, AnnMinTempa.

Dependent Variable: MHTb.

This is why the standard error of the estimate is so large. The standard error of the estimate is the average error expressed in the original units (e.g. centimeters). 30cm is 1 foot of error... in a person’s height.

VIFs (variance inflation factors) higher than 2 are considered problematic (according to SPSS) and our VIFs are over just over 2.1.

Coefficientsa

1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166

(Constant)AnnMinTempAnnRange

Model1

B Std. Error

UnstandardizedCoeffic ients

Beta

StandardizedCoeffic ients

t Sig. Tolerance VIFCollinearity Statistics

Dependent Variable: MHTa.

The standard beta values indicate the strength of the relationship between the independent and dependent variables. Minimum temperature is a much stronger predictor of height than annual range.

Coefficientsa

1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166

(Constant)AnnMinTempAnnRange

Model1

B Std. Error

UnstandardizedCoeffic ients

Beta

StandardizedCoeffic ients

t Sig. Tolerance VIFCollinearity Statistics

Dependent Variable: MHTa.

The question becomes: do these collinearity statistics rise to the level of indicating multicollinearity among the independent variables? In this example they do.

Correlations

1 -.734**.000

97 97-.734** 1.000

97 97

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

AnnMinTemp

AnnRange

AnnMinTemp AnnRange

Correlation is s ignificant at the 0.01 level (2-tai led).**.

Misspecification – an error in the regression equation due to the exclusion of an independent variable that influences the dependent variable OR the inclusion of an independent variable that does not influence the dependent variable.

Misspecification errors are common since it is difficult to know a priori what factors influence the dependent variable.

Misspecification is a hypothesis, not statistical issue.

Data Transformation

Often the association between two variables in not linear. Data transformation (log, etc…) is perfectly acceptable. The type of transformation must be stated in your summary statement.

In this case, log transforming the population data created a linear relationship.

Converting to natural log is easy. For example, the mining town of Argentine has a population of 100, its natural log would be:

Converting back to the original units is also easy:

60517.4)100ln((ln) ==pop

10060517.4 == epop

Calculator transformations:

Converting to a log: use the ln key.

Converting from a log: use the ex key.

SPSS transformations:

Converting to a log: Transform>Compute variable> Arithmetic>Ln

Converting from a log: Transform>Compute variable> Arithmetic>Exp

Population and Elevation in Colorado Mining Towns.

The model is significant. What is the standard error of the estimate telling us? What are the units?

Population = 46852.9 + (-4.238)(Elevation)

Ln(population) = 33.108 – 0.003(elevation)

Population and Elevation in Colorado Mining Towns: Log Transformation

The model is significant. What is the standard error of the estimate telling us? What are the units?

Town Population Elevation (ft) (ln)Population (ln)Predicted (ln)ResidualArgentine 100 11161 4.61 4.90195 -.29678Boreas 200 11535 5.30 3.95677 1.34155Breckenridge 8000 9597 8.99 8.85453 .13267Buckskin Joe 500 10860 6.21 5.66264 .55196Chihuahua 200 10571 5.30 6.39301 -1.09469Dudley 200 10400 5.30 6.82517 -1.52685Fairplay 8000 9931 8.99 8.01043 .97676Hamilton 3000 9997 8.01 7.84364 .16273Horseshoe 800 10544 6.68 6.46125 .22337Lamartine 500 10485 6.21 6.61035 -.39574Lincoln 1500 10384 7.31 6.86560 .44762Montezuma 800 10358 6.68 6.93131 -.24670Mosquito 250 10720 5.52 6.01645 -.49499Park City 300 10587 5.70 6.35258 -.64879Parkville 10000 9944 9.21 7.97758 1.23276Quartzville 200 11424 5.30 4.23729 1.06103Rexford 50 11201 3.91 4.80086 -.88884Sacramento 100 11398 4.61 4.30300 .30217Saints John 200 10798 5.30 5.81933 -.52101Silverheels 150 10771 5.01 5.88757 -.87693Swandyke 200 11093 5.30 5.07380 .22452Silver Plume 5500 9825 8.61 8.27832 .33418

Converting to Original Units from a Log Transformation

Town = HorseshoePopulation = 800Elevation = 10,544ft

Predicted ln(population) = 6.46125Calculated ln(residual) = 0.22337

Converting to original units (people): population = e(6.46125) = 640

Converting the residual: residual = e0.22337 =1.25028A

AThis is the ratio of the difference between the actual and predicted value.

Original population = (640)(1.25028) = 800.2

Residual in original units (people): difference = 800 – 640 = 160

e.g. the equation under-predicted Horseshoe’s population by 160 people.

Observed – Predicted = residual

Town Population Elevation (ft) (ln)Population (ln)Predicted (ln)ResidualArgentine 100 11161 4.61 4.90195 -.29678

Observed Population = 100(ln)Predicted Population = 4.90195(ln)Residual = -0.29678

predictedobservede predicted

−==

Residualpopulation Predicted )(

What are the predicted population and residual values, in the original units?

Iterative Regression

If you are exploring a data base for associations, one method isto use iterative regression.

Iterative Regression – a iterative procedure which either adds or removes variables from a regression model based on their significance.

IMPORTANT:

The SPSS stepwise procedure give results that are inconsistent with the other methods. Due to this inconsistency it is recommended that the stepwise procedure not be used.

A better method of performing iterative regression is to use all variables with the enter procedure, then remove insignificant variables individually. OR use the backwards or forwardsprocedures.

Types of Iterative Regression:

Enter – all variables are entered in a single step.

Stepwise – independent variables are entered based on the smallest F probability. Variables already in the equation are removed if their probability of F becomes too large.

Backward – all variables are entered into the equation and then sequentially removed based on the smallest partial correlation.

Forward - A stepwise variable selection procedure in which variables are sequentially entered into the model.

Harrisburg Housing Value (Iterative using the Enter procedure)

Not significant

Not significant

Predicted value ($) = -233435.212 + 19.515(Square Feet) + 143.475(Year Built) – 3848.55 (Bedrooms) + 10101.928(Half Baths) + 4.545(Parcel Size) – 12.126(Distance to Front St)

With insignificant variables removed.

No changes here.

All slopes are significant.

Standardized Coefficients

Standardized or beta coefficients are slope values that have been standardized so that their variances are 1. They can be used to determine which of the independent variables have a greater effect on the dependent variable when the variables are measured in different units of measurement.

In this case, Square Feet and Distance to Front Street are having the greatest effect.

705 ½ South Front StreetValue = $133,900Square Feet = 2380Parcel Size = 2975Distance to Front Street = 84Year Built = 1900Bedrooms = 3Half Baths = 1

Predicted value ($) = -233435.212 + 19.515(2380) + 143.475(1900) – 3848.55 (3) + 10101.928(1) + 4.545(2975) – 12.126(84)

Predicted value ($) = -233435.212 + 46445.7 + 272602.5 – 11545.65 +10101.928 +13521.375 – 1018.884

Predicted value ($) =109236.3

Residual ($) = 109236.3 – 133900 = -24663.7 This is not surprising considering thatthe r2 was 0.591. Over 40% of the variation in housing value is not explained by this model.

Mapping Regression Residuals

Temperature Recording SitesKyrgystan Region

Average yearly temperature is influenced by:

Elevation: 6.4˚C per 1000 m elevation change.

Latitude: 4.0˚C per 1000 km latitude change.

To what degree can we predict temperature based on both elevation and latitude?

Elevation Latitude

Model Summaryb

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate1 .824a .679 .677 3.48693a. Predictors: (Constant), Elevationb. Dependent Variable: Average Temperature

ANOVAa

ModelSum of Squares df

Mean SquareF Sig.

1Regression 4936.797 1 4936.797 406.031 .000b

Residual 2334.466 192 12.159Total 7271.264 193

a. Dependent Variable: Average temperatureb. Predictors: (Constant), Elevation

Coefficientsa

Model

UnstandardizedCoefficients

Standardized Coefficients

t Sig.B Std. Error Beta

1 (Constant) 14.683 .390 37.691 .000Elevation -.005 .000 -.824 -20.150 .000

a. Dependent Variable: Average Temperature

TemperatureP = 14.683 – (0.005)Elevation

Model: Elevation

The standard error of the estimate is about 3.5 ˚C, which is half of the number of degree change per 1000m elevation.

This model is not very accurate.

Model Summaryb

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate1 .824a .679 .677 3.48693a. Predictors: (Constant), Elevationb. Dependent Variable: Average Temperature

Missing explanatory variable.

Unknown

Model Summaryb

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate1 .254a .065 .060 5.95185a. Predictors: (Constant), Latitudeb. Dependent Variable: Average Temperature

ANOVAa

ModelSum of Squares df

Mean SquareF Sig.

1Regression 469.747 1 469.747 13.260 .000b

Residual 6801.516 192 35.425Total 7271.264 193

a. Dependent Variable: Average Temperatureb. Predictors: (Constant), Latitude

Coefficientsa

Model

UnstandardizedCoefficients

Standardized Coefficients

t Sig.B Std. Error Beta

1 (Constant) 30.470 6.002 5.077 .000Latitude -.531 .146 -.254 -3.641 .000

a. Dependent Variable: Average Temperature

Model: Latitude

This R2 is really low.

The standard error of the estimate is about 6 ˚C, which is nearly the number of degree change per 1000m elevation.

This model is also not very accurate. By itself, the variable latitudeis not a good predictor of temperature.

Model Summaryb

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate1 .254a .065 .060 5.95185a. Predictors: (Constant), Latitudeb. Dependent Variable: Average Temperature

This similarity in pattern suggests that together elevation and latitude combined may produce a strong predictive model.

Model Summaryb

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate1 .952a .907 .906 1.88058a. Predictors: (Constant), Elevation, Latitudeb. Dependent Variable: Average Temperature

ANOVAa

ModelSum of Squares df

Mean SquareF Sig.

1Regression 6595.775 2 3297.887 932.505 .000b

Residual 675.489 191 3.537Total 7271.264 193

a. Dependent Variable: Average Temperatureb. Predictors: (Constant), Elevation, Latitude

Model: Elevation + Latitude

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients

t Sig.

Correlations Collinearity Statistics

BStd. Error

BetaZero-order

Partial PartTolerance VIF

1(Constant) 57.936 2.008 28.852 .000Elevation -.005 .000 -.949 -41.620 .000 -.824 -.949 -.918 .936 1.068Latitude -1.032 .048 -.494 -21.658 .000 -.254 -.843 -.478 .936 1.068

a. Dependent Variable: Average Temperature

The standard error of the estimate is less than 2 ˚C, which is by far the best estimator.

This model is very accurate (90%).

Model Summaryb

Model R R SquareAdjusted R

SquareStd. Error of the

Estimate1 .952a .907 .906 1.88058a. Predictors: (Constant), Latitudeb. Dependent Variable: Average Temperature

Outlier?

Significantly over/under predicted locations.

There does not appear to be any spatial pattern to the distribution of residuals.

• The residuals appear to be spatially random.

• The number of large over/under predictions is about equal.

• It might be a good idea to examine large over/under predicted locations in greater detail.

Over-predictionName Lon Lat Elev Temp ResidHumrogi 71.33 38.28 1737 12.17 3.10Dzhergetal 73.1 41.57 1800 10.43 5.10Gasan-kuli 39.22 52.22 23 16.06 12.13

Under-predictionName Lon Lat Elev Temp ResidKushka 62.35 35.28 57 15.23 -5.99Susamyr 74 42.2 2087 -1.95 -5.06Aksai 76.49 42.07 3135 -7.27 -4.86

An initial inspection does not show any locational influences, with the exception of Gasan-kuli, which is located far from the other sites.

Gasan-kuli

Kushka

Susamyr

Humrogi

Aksai

Dzhergetal

Key Points:

1. Let theory drive your selection of independent variables.• Individual variable analyses (regressions) were misleading.

2. Use the tools available.• Both statistics and graphs.

3. Map residuals and look for patterns.• Patterns may be of interest.• The absence of patterns is NOT a failure.