Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which...
-
Upload
marylou-daniel -
Category
Documents
-
view
213 -
download
1
Transcript of Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which...
Class 4
Simple Linear Regression
Regression Analysis
• Reality is thought to behave in a manner which may be simulated (predicted) to an acceptable degree of accuracy by a simplified mathematical model.
• Statistical models (which include regression) permit some degree of random error, because some variable of interest cannot be duplicated under seemingly identical conditions.
An Example
• We would like to predict test scores on an academic test. Ten such scores are shown below:
• A possible model of test scores: A test score, y, is obtained by taking the average test score, , and adding a random value, , to it.
65 73 73 75 81 87 92 96 98 100
y = +
Example (cont.)
• How might we estimate ?
• How do we tell if our model is useful?
Improving the Model
• We would have a more useful model if we could remove (explain) some of the variability that we see in the data.
• Perhaps there exists other factors that cause variability in the test score. Can you think of some?
Improving the Model (cont.)
• Here is the data including the hours of study.
Hours of Study Test Score1 652 732 733 754 815 985 876 926 968 100
Improving the Model (cont.)
• We have the same problem:• Select the best line that minimizes the (squared)
distance of the data points to the line.
• This line is referred to as the least square line.
• Our model now looks like
• Our estimated or fitted line will be called
Another view of the Model• This (and all linear regression) model(s) can be
expressed as y = E(y) + .• So in our model, E(y) = 0 + 1x, that is, the mean test
score falls on a straight line as a function of hours of study.
• The random error term, , is assumed to have a normal distribution with mean 0 and variance 2.
• Our ability to effectively use the model depends on this variation.
Analysis of Variance
• It turns out that the variation displayed by the variable y, referred to as the total sum of squares (SST), can be broken into two pieces:
• The part caused by the variable x, called the regression sum of squares (SSR),
• The part left over (the distance from the data points to the regression line), called the sum of squared errors or residual sum of squares (SSE).
SST = SSR + SSE
Getting it done with EXCEL
• Select tools/data analysis/regression.
Regression StatisticsMultiple R 0.9479385R Square 0.8985875Adjusted R Square 0.8859109Standard Error 4.1551753Observations 10
r, correlation
r2 = SSR/SST
s, the square root of s2, our estimate of 2
ra2 = 1 - (1-r2)[(n-1)/(n-p-1)]
Getting it done with EXCEL
ANOVAdf SS MS F Significance F
Regression 1 1223.88 1223.88 70.89 3.01752E-05Residual 8 138.12 17.27Total 9 1362.00
SSRSSESST
Actual sums of squares
Sums of squares divided by degrees of freedom
MSRMSE, also s2 MSR/MSE
p-value
For example,
n
ii yySST
1
2.)(
Getting it done with EXCEL
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 61.75 2.95 20.92 3E-08 54.94 68.55Hours of Study 5.30 0.63 8.42 3E-05 3.85 6.75
Least square estimates
Standard deviation of our estimate
t-test for the hypothesis that the coefficient () is 0
The important part: the p-value for the t-test
Confidence Intervals for our estimates
b0 b1
Hypothesis Testing
• The F-test tests to see if all of the coefficients of the independent variables are zero. For our model:
• The t-test tests to see if each coefficient of an independent variables is zero.
Using the Model
• The model has two basic purposes:• (1) It can be used to provide partial confirmation of
the theory that a particular factor is, indeed, influencing the response variable, y.
• (2) It can be used to estimate the mean, E(y), and predict an actual value of y.
• Under the function wizard, select forecast(new_x, known_y, known_x).
• Confidence interval can be generated (page 555 for more discussion). Let
Using the Model
22
2
22
)(
)(1
xnx
xx
xn
xSSx
i
i
ii
Using the Model
.)(1
,s ofr denominato
at thelook general,in but model, for this 2-n
where
y
:)(forIntervalConfidence
2
ˆ
2
ˆ,2/p
SSx
xx
nss
df
st
yE
py
ydf
p
p
p
Using the Model
.)(1
1
,s ofr denominato
at thelook general,in but model, for this 2-n
where
y
:forIntervalPrediction
2
2
,2/p
SSx
xx
nss
df
st
y
pind
inddf
p
• EXCEL does not provide an automatic calculation for confidence and prediction intervals
• The authors have included a macro in the spreadsheet called PredInt.xls on your data disk.
• Simply open the file and follow the instructions!
Using the Model
A Note on Correlation
• Many people prefer to perform a correlational analysis before they build regression models.
• In EXCEL this can be accomplished in two ways:
• Under the function wizard, use correl(array1,array2) to find the correlation between two variables.
• Under tools/data analysis/correlation to determine the correlation between several variables.
Correlation (cont.)• What correlation does:
• Provides an easy measure to determine if two variables have a linear relationship.
• Positive correlation implies if one variable goes up, the other also tends to go up.
• Negative correlation implies if one variable goes up, the other tends to go down.
• What correlation does not do:• There is no implication of cause and effect.
• There may exist some lurking factor that produces the behavior being witnessed.