Basic Statistics Linear Regression. X Y Simple Linear Regression.
-
Upload
marianna-dickerson -
Category
Documents
-
view
246 -
download
3
Transcript of Basic Statistics Linear Regression. X Y Simple Linear Regression.
Basic Statistics
Linear Regression
X
Y
Simple Linear Regression
Predicting Y from X
• Recall when we looked at scatter plots in our discussion of correlation, we showed generally the estimate of Y given a value for X, when the correlation was not perfect.
• We will now look at how to use our knowledge of the correlation to predict a value for Y, when we know a value for X.
Variable X
Variable Y
The GREEN line shows our prediction or regression line.
high
high
lowlow
Scatter Plot of Y and X
Estimated Y value
Prediction Equation
• The green line in the previous slide showed us our prediction line.
• We will use the mathematical formula for a straight line as the method for predicting a value for Y when we know the value for X.
• The process is called “Linear Regression” because, in this class, we will only deal with relationships that can be fitted by a straight line.
• The general formula for a straight line is:
XbaY yy ˆ
The Prediction Equation
• ay = the intercept or where the prediction line crosses the Y-axis (the value of Y when X = 0)
• by = the regression coefficient that indicates the amount of change in Y when the value of X increases one unit.
XbaY yy ˆ
A Simple Example• Suppose that a club charges a flat $25 to use
their facilities.• They also charge a $10 fee per hour for using
the tennis courts.• Now, assume that you want to play tennis for
2 hours at this club. How much would you have to pay?
Ŷ= $25 + (2) $10 = $25 + $20 = $45 for two hours of tennis
Linking the Simple Example to Regression
Ŷ= $25 + (2) $10 = $25 + $20 = $45 for two hours of tennis
• In our example:– $25 is ay, the intercept. Even if we didn’t play any
tennis (X = 0), it would cost $25 to use the club.– $10 is by, the regression coefficient (it costs $10 for
each hour of tennis played)
• In this case we predicted how much it would cost (Y) when we knew how long we wanted to play tennis.
Formulae for Sums of Squares
n
YXXYSSxy
n
YYSSy
n
XXSSx
2
2
2
2
These were introduced in our discussion of correlation.
Calculating the Regression Coefficient (b)
SSxSSxy
b
or
nX)(
X
nY)X)((
XYb
2
2
Calculating the Intercept (a)
XbYa You will notice that you must calculate the regression coefficient (b) before you can calculate the intercept (a), since the calculation of a uses b.
An Example
• From our earlier example, suppose that our college statistics professor is interested in predicting how many errors students might make on the mid-term examination based on how many hours they studied. Specifically, the professor wants to know how many errors a student might make if the student studied for 5 hours.
The Stats Professor’s Data
Student X Y X2 Y2 XY
1 4 15 16 225 60
2 4 12 16 144 48
3 5 9 25 81 45
4 6 10 36 100 60
5 7 8 49 64 56
6 7 4 49 16 28
7 7 6 49 36 42
8 9 2 81 4 18
9 9 4 81 16 36
10 12 3 100 9 36
Total X = 70 Y = 73 X2 =546 Y2=695 XY=429
The Resulting Sum of Squares
Student X Y X2 Y2 XY
Total X = 70 Y = 73 X2 =546 Y2=695 XY=429
n
YYSSy
2
2
n
YXXYSSxy
n
XXSSx
2
2 = 546 - 702/10 = 546 - 490 = 56
= 695 - 732/10 = 695 - 523.9 = 162.1
= 429 – (70)(73)/10 = 429 – 511 = -82
Calculating the Regression Coefficient (b)
SSx
SSxyb = - 82 / 56
= - 1.46
This can be interpreted as the change in the value of Y (in our case, errors made on the mid-term), for a unit change in X, or for us, each additional hour studied! Thus, study for another hour and make 1.46 fewer mistakes (on average!).
Calculating the Intercept (a)
XbYa = 7.3 – (-1.46)(7)
= 7.3 + 10.25
= 17.55
Therefore, our prediction equation is Ŷ = 17.55 + (-1.46) (X)
Using Our Prediction Equation
Ŷ = 17.55 + (-1.46) (X)
If the professor wanted to predict the number of errors a student might make if the student had studied for 5 hours, then we would substitute 5 for X in the above equation and obtain:
Ŷ = 17.55 + (-1.46) (5) = 17.55 + (-7.3) = 10.25
Thus, the professor would predict 10.25 errors for a student who had studied for 5 hours.
Measuring Prediction Errors:The Standard Error of the Estimate
2
/ 2
/
nSSx
nYXXYSSy
S xy
OR 2
1 2
/
n
SSyrS xy
Since we know that the estimate is not exact, as statisticians, we must report how much error we feel is in our estimate. The formula is:
Calculating the Standard Error of the Estimate
2
1 2
/
n
SSyrS xy
= 1 - .74(162.1) / 8
= 2.29Thus, when we estimated 10.25 errors, we also would report that the Standard Error of the Estimate is 2.29.
Summarizing Prediction Equations
• The existence of a relationship between two variables allows us to use that knowledge to make predictions.
• The prediction based on our equation will result in less error in prediction than using the mean of the dependent variable.
• Two sums of squares are required to calculate the regression coefficient and the intercept.