Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction.

Statistics for the Social SciencesPsychology 340

Spring 2005

Prediction

Statistics for the Social Sciences

Outline (for week)

• Simple bi-variate regression, least-squares fit line– The general linear model

– Residual plots

– Using SPSS

• Multiple regression– Comparing models, (?? Delta r2)

– Using SPSS


Regression

• Last time: with correlation, we examined whether variables X & Y are related

• This time: with regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.


Regression

• Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• For regression this is NOT the case

• The variable that you are predicting goes on the Y-axis (criterion variable)

Predicted variable

Predicting variable

• The variable that you are making the prediction based on goes on the X-axis (predictor variable)

Quiz performance

Hours of study


Regression

• Last time: “Imagine a line through the points”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• But there are lots of possible lines

• One line is the “best fitting line”

• Today: learn how to compute the equation corresponding to this “best fitting line”

Quiz performance

Hours of study


The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = intercept, when X = 0




Y = (X)(slope) + (intercept)

2.0

Change in Y

Change in X= slope

0.5

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

1

2

0




Y = (X)(slope) + (intercept)Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = (X)(0.5) + 2.0


Regression

• A brief review of geometry• Consider a perfect correlation

Y = (X)(0.5) + (2.0)Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Can make specific predictions about Y based on X

X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5


Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Consider a less than perfect correlation• The line still represents the

predicted values of Y given X

Y = (X)(0.5) + (2.0)X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5


Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points)

• Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line


Regression

• The linear model

Y = intercept + slope (X) + error

μY = β0 + β1X + ε

Beta’s () are sometimes called parameters

Come in two types:

• standardized

• unstanderdized μY = β0 + β1X + ε )ZY =()(ZX ) + ε

Now let’s go through an example computing these things


Scatterplot

• Using the dataset from our correlation lecture

6 61 25 6

3 4

3 2

X Y Y

X

1

23456

1 2 3 4 5 6


From the Computing Pearson’s r lecture

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

€

X − X ( )

€

Y −Y ( )

€

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

€

X − X ( )2

5.766.76

1.96

0.36

0.36

€

Y −Y ( )2

4.04.0

4.0

0.0

4.0

14.015.20 16.0

SSYSSX

SP


Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

14.015.20 16.0

SSYSSX

SP

€

slope = b =SP

SSX

€

=14

15.2= 0.92

€

intercept = a = Y − bX

mean 3.6 4.0 €

=4.0 − (0.92)(3.6)

€

=0.688


Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

€

slope = b = 0.92

mean 3.6 4.0

€

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

€

Y = 0.92X + 0.688


Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

€

slope = b = 0.92

mean 3.6 4.0

€

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

€

X

€

Y

€

Y = 0.92X + 0.688

The two means will be on the line


Computing regression line(standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y5.766.76

1.96

0.36

0.36

15.20

€

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

X −X( ) X −X( )2

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34


Computing regression line(standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

• Prediction model– Predicted Z score (on criterion variable) =

standardized regression coefficient multiplied by Z score on predictor variable

– Formula

)ZY =()(ZX )

– The standardized regression coefficient (β)

• In bivariate prediction, β = r


Computing regression line(with z-scores)

slope = =r =0.89

meanintercept =0.0

ZY

ZX

-1

1

2

0

1 2

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

)ZY =()(ZX )

-2

-1-2


Regression

• Also need a measure of error

Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Same line, but different relationships (strength difference)

Y = intercept + slope (X)+ error

• The linear equation isn’t the whole thing


Regression

• Error– Actual score minus the predicted score

• Measures of error– r2 (r-squared)– Proportionate reduction in error

• Note: Total squared error when predicting from the mean = SSTotal=SSY

=SStotal − SSerror

SStotal

– Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror


R-squared

• r2 represents the percent variance in Y accounted for by X

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

r = 0.8 r = 0.5r2 = 0.64 r2 = 0.25

64% variance explained 25% variance explained


Computing Error around the line

• Compute the difference between the predicted values and the observed values (“residuals”)

• Square the differences

• Add up the squared differences

Y

X

1

23456

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror



6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

€

ˆ Y

Y =0.92X + 0.688Predicted values of Y (points on the line)




6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

€

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

Predicted values of Y (points on the line)




6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

€

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

1.6 = (0.92)(1)+0.688

5.3 = (0.92)(5)+0.688

3.45 = (0.92)(3)+0.688

3.45 = (0.92)(3)+0.688




Y

X

123

45

6

1 2 3 4 5 6


X Y

€

ˆ Y 6 61 25 6

3 4

3 2

6.21.6

5.3

3.45

3.45

6.2

1.6

5.3

3.45

Y =0.92X + 0.688



6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

€

ˆ Y

€

Y − ˆ Y ( )-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

residuals• Sum of the squared residuals = SSresidual = SSerror

Quick check

6 - 6.2 =

2 - 1.6 =

6 - 5.3 =

4 - 3.45 =

2 - 3.45 =



6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

€

ˆ Y

€

Y − ˆ Y ( )

€

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR




6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

€

ˆ Y

€

Y − ˆ Y ( )

€

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR


€

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSY



3.09

SSERROR


16.0

SSY

– Proportionate reduction in error =SStotal − SSerror

SStotal

=16.0 − 3.09

16.0= 0.81

• Also (like r2) represents the percent variance in Y accounted for by X

• In fact, it is mathematically identical to r2


Seeing patterns in the error

• Residual plots• The sum of the residuals should always equal 0 (as should the mean).

– the least squares regression line splits the data in half, half of the error is above the line and half is below the line.

• In addition to summing to zero, we also want there the residuals to be randomly distributed.

– That is, there should be no pattern to the residuals. – If there is a pattern, it may suggest that there is more than a simple linear

relationship between the two variables.

• Residual plots are very useful tools to examine the relationship even further.

– These are basically scatterplots of the residuals () against the Explanatory (X) variable

(note: the examples actually plot the residuals that have transformed into z-scores).



• The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Residual plotScatter plot

• The scatterplot shows a nice linear relationship.



Residual plot

• The scatterplot also shows a nice linear relationship.

• The residual plot shows that the residuals get larger as X increases.

• This suggests that the variability around the line is not constant across values of X.

• This is referred to as a violation of homogeniety of variance.

Scatter plot



• The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Residual plotScatter plot

• The scatterplot shows what may be a linear relationship.


Regression in SPSS

• Running the analysis is pretty easy– Analyze: Regression: Linear– Predictor variables go into the ‘independent

variable’ field– (Predicted variable) goes into the “dependent

variable’ field

• You get a lot of output


Regression in SPSS

• The variables in the model

• r

• r2

• Unstandardized coefficients

• Slope (indep var name)• Intercept (constant)

• Standardized coefficients

• We’ll get back to these numbers in a few weeks


Multiple Regression

• Multiple regression prediction models

μY = β0 + β1X1 + β2 X2 + β 3X3 + ε

“fit” “residual”


Prediction in Research Articles

• Bivariate prediction models rarely reported

• Multiple regression results commonly reported

Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction.

Documents

Transcript of Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction.