Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which...

Class 4

Simple Linear Regression

Regression Analysis

• Reality is thought to behave in a manner which may be simulated (predicted) to an acceptable degree of accuracy by a simplified mathematical model.

• Statistical models (which include regression) permit some degree of random error, because some variable of interest cannot be duplicated under seemingly identical conditions.

An Example

• We would like to predict test scores on an academic test. Ten such scores are shown below:

• A possible model of test scores: A test score, y, is obtained by taking the average test score, , and adding a random value, , to it.

65 73 73 75 81 87 92 96 98 100

y = +

Example (cont.)

• How might we estimate ?

• How do we tell if our model is useful?

Improving the Model

• We would have a more useful model if we could remove (explain) some of the variability that we see in the data.

• Perhaps there exists other factors that cause variability in the test score. Can you think of some?

Improving the Model (cont.)

• Here is the data including the hours of study.

Hours of Study Test Score1 652 732 733 754 815 985 876 926 968 100

Improving the Model (cont.)

• We have the same problem:• Select the best line that minimizes the (squared)

distance of the data points to the line.

• This line is referred to as the least square line.

• Our model now looks like

• Our estimated or fitted line will be called

Another view of the Model• This (and all linear regression) model(s) can be

expressed as y = E(y) + .• So in our model, E(y) = 0 + 1x, that is, the mean test

score falls on a straight line as a function of hours of study.

• The random error term, , is assumed to have a normal distribution with mean 0 and variance 2.

• Our ability to effectively use the model depends on this variation.

Analysis of Variance

• It turns out that the variation displayed by the variable y, referred to as the total sum of squares (SST), can be broken into two pieces:

• The part caused by the variable x, called the regression sum of squares (SSR),

• The part left over (the distance from the data points to the regression line), called the sum of squared errors or residual sum of squares (SSE).

SST = SSR + SSE

Getting it done with EXCEL

• Select tools/data analysis/regression.

Regression StatisticsMultiple R 0.9479385R Square 0.8985875Adjusted R Square 0.8859109Standard Error 4.1551753Observations 10

r, correlation

r2 = SSR/SST

s, the square root of s2, our estimate of 2

ra2 = 1 - (1-r2)[(n-1)/(n-p-1)]


ANOVAdf SS MS F Significance F

Regression 1 1223.88 1223.88 70.89 3.01752E-05Residual 8 138.12 17.27Total 9 1362.00

SSRSSESST

Actual sums of squares

Sums of squares divided by degrees of freedom

MSRMSE, also s2 MSR/MSE

p-value

For example,

n

ii yySST

1

2.)(


Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 61.75 2.95 20.92 3E-08 54.94 68.55Hours of Study 5.30 0.63 8.42 3E-05 3.85 6.75

Least square estimates

Standard deviation of our estimate

t-test for the hypothesis that the coefficient () is 0

The important part: the p-value for the t-test

Confidence Intervals for our estimates

b0 b1

Hypothesis Testing

• The F-test tests to see if all of the coefficients of the independent variables are zero. For our model:

• The t-test tests to see if each coefficient of an independent variables is zero.

Using the Model

• The model has two basic purposes:• (1) It can be used to provide partial confirmation of

the theory that a particular factor is, indeed, influencing the response variable, y.

• (2) It can be used to estimate the mean, E(y), and predict an actual value of y.

• Under the function wizard, select forecast(new_x, known_y, known_x).

• Confidence interval can be generated (page 555 for more discussion). Let

Using the Model

22

2

22

)(

)(1

xnx

xx

xn

xSSx

i

i

ii

Using the Model

.)(1

,s ofr denominato

at thelook general,in but model, for this 2-n

where

y

:)(forIntervalConfidence

2

ˆ

2

ˆ,2/p

SSx

xx

nss

df

st

yE

py

ydf

p

p

p

Using the Model

.)(1

1

,s ofr denominato

at thelook general,in but model, for this 2-n

where

y

:forIntervalPrediction

2

2

,2/p

SSx

xx

nss

df

st

y

pind

inddf

p

• EXCEL does not provide an automatic calculation for confidence and prediction intervals

• The authors have included a macro in the spreadsheet called PredInt.xls on your data disk.

• Simply open the file and follow the instructions!

Using the Model

A Note on Correlation

• Many people prefer to perform a correlational analysis before they build regression models.

• In EXCEL this can be accomplished in two ways:

• Under the function wizard, use correl(array1,array2) to find the correlation between two variables.

• Under tools/data analysis/correlation to determine the correlation between several variables.

Correlation (cont.)• What correlation does:

• Provides an easy measure to determine if two variables have a linear relationship.

• Positive correlation implies if one variable goes up, the other also tends to go up.

• Negative correlation implies if one variable goes up, the other tends to go down.

• What correlation does not do:• There is no implication of cause and effect.

• There may exist some lurking factor that produces the behavior being witnessed.

Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which...

Documents

Transcript of Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which...