Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction.
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction.
Statistics for the Social SciencesPsychology 340
Spring 2005
Prediction
Statistics for the Social Sciences
Outline (for week)
• Simple bi-variate regression, least-squares fit line– The general linear model
– Residual plots
– Using SPSS
• Multiple regression– Comparing models, (?? Delta r2)
– Using SPSS
Statistics for the Social Sciences
Regression
• Last time: with correlation, we examined whether variables X & Y are related
• This time: with regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.
Statistics for the Social Sciences
Regression
• Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis”
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
• For regression this is NOT the case
• The variable that you are predicting goes on the Y-axis (criterion variable)
Predicted variable
Predicting variable
• The variable that you are making the prediction based on goes on the X-axis (predictor variable)
Quiz performance
Hours of study
Statistics for the Social Sciences
Regression
• Last time: “Imagine a line through the points”
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
• But there are lots of possible lines
• One line is the “best fitting line”
• Today: learn how to compute the equation corresponding to this “best fitting line”
Quiz performance
Hours of study
Statistics for the Social Sciences
The equation for a line
• A brief review of geometry
Y = (X)(slope) + (intercept)
2.0
Y
X
1
2
3
4
5
6
1 2 3 4 5 60
Y = intercept, when X = 0
Statistics for the Social Sciences
The equation for a line
• A brief review of geometry
Y = (X)(slope) + (intercept)
2.0
Change in Y
Change in X= slope
0.5
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
1
2
0
Statistics for the Social Sciences
The equation for a line
• A brief review of geometry
Y = (X)(slope) + (intercept)Y
X
1
2
3
4
5
6
1 2 3 4 5 60
Y = (X)(0.5) + 2.0
Statistics for the Social Sciences
Regression
• A brief review of geometry• Consider a perfect correlation
Y = (X)(0.5) + (2.0)Y
X
1
2
3
4
5
6
1 2 3 4 5 6
• Can make specific predictions about Y based on X
X = 5
Y = ?Y = (5)(0.5) + (2.0)
Y = 2.5 + 2 = 4.54.5
Statistics for the Social Sciences
Regression
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
• Consider a less than perfect correlation• The line still represents the
predicted values of Y given X
Y = (X)(0.5) + (2.0)X = 5
Y = ?Y = (5)(0.5) + (2.0)
Y = 2.5 + 2 = 4.54.5
Statistics for the Social Sciences
Regression
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
• The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points)
• Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line
Statistics for the Social Sciences
Regression
• The linear model
Y = intercept + slope (X) + error
μY = β0 + β1X + ε
Beta’s () are sometimes called parameters
Come in two types:
• standardized
• unstanderdized μY = β0 + β1X + ε )ZY =()(ZX ) + ε
Now let’s go through an example computing these things
Statistics for the Social Sciences
Scatterplot
• Using the dataset from our correlation lecture
6 61 25 6
3 4
3 2
X Y Y
X
1
23456
1 2 3 4 5 6
Statistics for the Social Sciences
From the Computing Pearson’s r lecture
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
2.4-2.6
1.4
-0.6
-0.6
0.0
2.0-2.0
2.0
0.0
-2.0
0.0
€
X − X ( )
€
Y −Y ( )
€
X − X ( ) Y −Y ( )4.85.2
2.8
0.0
1.2
€
X − X ( )2
5.766.76
1.96
0.36
0.36
€
Y −Y ( )2
4.04.0
4.0
0.0
4.0
14.015.20 16.0
SSYSSX
SP
Statistics for the Social Sciences
Computing regression line(with raw scores)
6 61 25 6
3 4
3 2
X Y
14.015.20 16.0
SSYSSX
SP
€
slope = b =SP
SSX
€
=14
15.2= 0.92
€
intercept = a = Y − bX
mean 3.6 4.0 €
=4.0 − (0.92)(3.6)
€
=0.688
Statistics for the Social Sciences
Computing regression line(with raw scores)
6 61 25 6
3 4
3 2
X Y
€
slope = b = 0.92
mean 3.6 4.0
€
intercept = 0.688
Y
X
1
23456
1 2 3 4 5 6
€
Y = 0.92X + 0.688
Statistics for the Social Sciences
Computing regression line (with raw scores)
6 61 25 6
3 4
3 2
X Y
€
slope = b = 0.92
mean 3.6 4.0
€
intercept = 0.688
Y
X
1
23456
1 2 3 4 5 6
€
X
€
Y
€
Y = 0.92X + 0.688
The two means will be on the line
Statistics for the Social Sciences
Computing regression line(standardized, using z-scores)
• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores
Mean 3.6 4.0
2.4-2.6
1.4
-0.6
-0.6
0.0
2.0-2.0
2.0
0.0
-2.0
0.0
6 61 25 6
3 4
3 2
X Y5.766.76
1.96
0.36
0.36
15.20
€
Y −Y ( )2
4.04.0
4.0
0.0
4.0
16.0Std dev
ZX ZY
1.74 1.790.0
1.1-1.1
0.0
-1.1
1.1
0.0
X −X( ) X −X( )2
Y −Y( )
1.38-1.49
0.8
- 0.34
- 0.34
Statistics for the Social Sciences
Computing regression line(standardized, using z-scores)
• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores
ZX ZY
0.0
1.1-1.1
0.0
-1.1
1.1
0.0
1.38-1.49
0.8
- 0.34
- 0.34
• Prediction model– Predicted Z score (on criterion variable) =
standardized regression coefficient multiplied by Z score on predictor variable
– Formula
)ZY =()(ZX )
– The standardized regression coefficient (β)
• In bivariate prediction, β = r
Statistics for the Social Sciences
Computing regression line(with z-scores)
slope = =r =0.89
meanintercept =0.0
ZY
ZX
-1
1
2
0
1 2
ZX ZY
0.0
1.1-1.1
0.0
-1.1
1.1
0.0
1.38-1.49
0.8
- 0.34
- 0.34
)ZY =()(ZX )
-2
-1-2
Statistics for the Social Sciences
Regression
• Also need a measure of error
Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
• Same line, but different relationships (strength difference)
Y = intercept + slope (X)+ error
• The linear equation isn’t the whole thing
Statistics for the Social Sciences
Regression
• Error– Actual score minus the predicted score
• Measures of error– r2 (r-squared)– Proportionate reduction in error
• Note: Total squared error when predicting from the mean = SSTotal=SSY
=SStotal − SSerror
SStotal
– Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror
Statistics for the Social Sciences
R-squared
• r2 represents the percent variance in Y accounted for by X
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
Y
X
1
2
3
4
5
6
1 2 3 4 5 6
r = 0.8 r = 0.5r2 = 0.64 r2 = 0.25
64% variance explained 25% variance explained
Statistics for the Social Sciences
Computing Error around the line
• Compute the difference between the predicted values and the observed values (“residuals”)
• Square the differences
• Add up the squared differences
Y
X
1
23456
1 2 3 4 5 6
• Sum of the squared residuals = SSresidual = SSerror
Statistics for the Social Sciences
Computing Error around the line
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
€
ˆ Y
Y =0.92X + 0.688Predicted values of Y (points on the line)
• Sum of the squared residuals = SSresidual = SSerror
Statistics for the Social Sciences
Computing Error around the line
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
6.2
€
ˆ Y
Y =0.92X + 0.688
= (0.92)(6)+0.688
Predicted values of Y (points on the line)
• Sum of the squared residuals = SSresidual = SSerror
Statistics for the Social Sciences
Computing Error around the line
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
6.2
€
ˆ Y
Y =0.92X + 0.688
= (0.92)(6)+0.688
1.6 = (0.92)(1)+0.688
5.3 = (0.92)(5)+0.688
3.45 = (0.92)(3)+0.688
3.45 = (0.92)(3)+0.688
• Sum of the squared residuals = SSresidual = SSerror
Statistics for the Social Sciences
Computing Error around the line
Y
X
123
45
6
1 2 3 4 5 6
• Sum of the squared residuals = SSresidual = SSerror
X Y
€
ˆ Y 6 61 25 6
3 4
3 2
6.21.6
5.3
3.45
3.45
6.2
1.6
5.3
3.45
Y =0.92X + 0.688
Statistics for the Social Sciences
Computing Error around the line
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
6.2
0.00
€
ˆ Y
€
Y − ˆ Y ( )-0.200.40
0.70
0.55
-1.45
Y =0.92X + 0.688
1.6
5.3
3.45
3.45
residuals• Sum of the squared residuals = SSresidual = SSerror
Quick check
6 - 6.2 =
2 - 1.6 =
6 - 5.3 =
4 - 3.45 =
2 - 3.45 =
Statistics for the Social Sciences
Computing Error around the line
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
6.2
0.00
0.040.16
0.49
0.30
2.10
3.09
€
ˆ Y
€
Y − ˆ Y ( )
€
Y − ˆ Y ( )2
-0.200.40
0.70
0.55
-1.45
Y =0.92X + 0.688
1.6
5.3
3.45
3.45
SSERROR
• Sum of the squared residuals = SSresidual = SSerror
Statistics for the Social Sciences
Computing Error around the line
6 61 25 6
3 4
3 2
X Y
mean 3.6 4.0
6.2
0.00
0.040.16
0.49
0.30
2.10
3.09
€
ˆ Y
€
Y − ˆ Y ( )
€
Y − ˆ Y ( )2
-0.200.40
0.70
0.55
-1.45
Y =0.92X + 0.688
1.6
5.3
3.45
3.45
SSERROR
• Sum of the squared residuals = SSresidual = SSerror
€
Y −Y ( )2
4.04.0
4.0
0.0
4.0
16.0
SSY
Statistics for the Social Sciences
Computing Error around the line
3.09
SSERROR
• Sum of the squared residuals = SSresidual = SSerror
16.0
SSY
– Proportionate reduction in error =SStotal − SSerror
SStotal
=16.0 − 3.09
16.0= 0.81
• Also (like r2) represents the percent variance in Y accounted for by X
• In fact, it is mathematically identical to r2
Statistics for the Social Sciences
Seeing patterns in the error
• Residual plots• The sum of the residuals should always equal 0 (as should the mean).
– the least squares regression line splits the data in half, half of the error is above the line and half is below the line.
• In addition to summing to zero, we also want there the residuals to be randomly distributed.
– That is, there should be no pattern to the residuals. – If there is a pattern, it may suggest that there is more than a simple linear
relationship between the two variables.
• Residual plots are very useful tools to examine the relationship even further.
– These are basically scatterplots of the residuals () against the Explanatory (X) variable
(note: the examples actually plot the residuals that have transformed into z-scores).
Statistics for the Social Sciences
Seeing patterns in the error
• The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.
Residual plotScatter plot
• The scatterplot shows a nice linear relationship.
Statistics for the Social Sciences
Seeing patterns in the error
Residual plot
• The scatterplot also shows a nice linear relationship.
• The residual plot shows that the residuals get larger as X increases.
• This suggests that the variability around the line is not constant across values of X.
• This is referred to as a violation of homogeniety of variance.
Scatter plot
Statistics for the Social Sciences
Seeing patterns in the error
• The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).
Residual plotScatter plot
• The scatterplot shows what may be a linear relationship.
Statistics for the Social Sciences
Regression in SPSS
• Running the analysis is pretty easy– Analyze: Regression: Linear– Predictor variables go into the ‘independent
variable’ field– (Predicted variable) goes into the “dependent
variable’ field
• You get a lot of output
Statistics for the Social Sciences
Regression in SPSS
• The variables in the model
• r
• r2
• Unstandardized coefficients
• Slope (indep var name)• Intercept (constant)
• Standardized coefficients
• We’ll get back to these numbers in a few weeks
Statistics for the Social Sciences
Multiple Regression
• Multiple regression prediction models
μY = β0 + β1X1 + β2 X2 + β 3X3 + ε
“fit” “residual”
Statistics for the Social Sciences
Prediction in Research Articles
• Bivariate prediction models rarely reported
• Multiple regression results commonly reported