Simple Linear Regression Farideh Dehkordi-Vakil. Simple Regression Simple regression analysis is a...
-
date post
20-Dec-2015 -
Category
Documents
-
view
225 -
download
2
Transcript of Simple Linear Regression Farideh Dehkordi-Vakil. Simple Regression Simple regression analysis is a...
Simple Linear Regression
Farideh Dehkordi-Vakil
Simple Regression Simple regression analysis is a statistical tool That
gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x).
The dependent variable is the variable for which we want to make a prediction.
While various non-linear forms may be used, simple linear regression models are the most common.
Introduction• The primary goal of quantitative
analysis is to use current information about a phenomenon to predict its future behavior.
• Current information is usually in the form of a set of data.
• In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor ) variable X and a dependent ( or response) variable Y.
lot size Man-hours30 7320 5060 12880 17040 8750 10860 13530 6970 14860 132
Introduction The goal of the analyst
who studies the data is to find a functional relation
between the response variable y and the predictor variable x.
)(xfy
Statistical relation between Lot size and Man-Hour
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90
Lot size
Man
-Hou
r
Regression Function
The statement that the relation between X and Y is statistical should be interpreted as providing the following guidelines:
1. Regard Y as a random variable.
2. For each X, take f (x) to be the expected value (i.e., mean value) of y.
3. Given that E (Y) denotes the expected value of Y, call the equation
the regression function.
)()( xfYE
Historical Origin of Regression
Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers.
Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group.
Historical Origin of Regression
Basic Assumptions of a Regression Model
A regression model is based on the following assumptions:
1. There is a probability distribution of Y for each level of X.
2. Given that y is the mean value of Y, the standard form of the model is
where is a random variable with a normal distribution.
)(xfY
Statistical relation between Lot Size and number of man-Hours-Westwood Company Example
Statistical relation between Lot size and number of Man-Hours
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90
Pictorial Presentation of Linear Regression Model
Construction of Regression Models Selection of independent variables
• Since reality must be reduced to manageable proportions whenever we construct models, only a limited number of independent or predictor variables can or should be included in a regression model. Therefore a central problem is that of choosing the most important predictor variables.
Functional form of regression relation• Sometimes, relevant theory may indicate the appropriate functional form.
More frequently, however, the functional form is not known in advance and must be decided once the data have been collected and analyzed.
Scope of model In formulating a regression model, we usually need to restrict the
coverage of model to some interval or region of values of the independent variables.
Uses of Regression Analysis
Regression analysis serves Three major purposes.
1. Description
2. Control
3. Prediction The several purposes of regression analysis
frequently overlap in practice
Formal Statement of the Model
General regression model
1. 0, and 1 are parameters
2. X is a known constant
3. Deviations are independent N(o, 2)
XY 10
Meaning of Regression Coefficients
The values of the regression parameters 0, and 1 are not known.We estimate them from data.
1 indicates the change in the mean response per unit increase in X.
Regression Line If the scatter plot of our sample data
suggests a linear relationship between two variables i.e.
we can summarize the relationship by drawing a straight line on the plot.
Least squares method give us the “best” estimated line for our set of sample data.
xy 10
Regression Line We will write an estimated regression line
based on sample data as
The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors
xbby 10ˆ
2
110
1
2)ˆ(
n
i
n
iii xbbyyySSE
Regression Line Using calculus, we obtain estimating
formulas:
or
n
i
n
i
xx
yyxxb
1
2
11
)(
))((
xbyb 10
221 )( xxn
yxxynb
Estimation of Mean Response Fitted regression line can be used to estimate the
mean value of y for a given value of x. Example
The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table.
y x1250 411380 541425 631425 541450 481300 461400 621510 611575 641650 71
Point Estimation of Mean Response
From previous table we have:
The least squares estimates of the regression coefficients are:
81875514365
3260456410 2
xyy
xxn
8.10)564()32604(10
)14365)(564()818755(10
)( 2221
xxn
yxxynb
828)4.56(8.105.14360 b
Point Estimation of Mean Response
The estimated regression function is:
This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8.
eExpenditur 8.10828Sales
10.8x828y
Point Estimation of Mean Response
Fitted values for the sample data are obtained by substituting the x value into the estimated regression function.
For example if the advertising expenditure is $50, then the estimated Sales is:
This is called the point estimate (forecast) of the mean response (sales).
1368)50(8.10828 Sales
Example:Retail sales and floor space
It is customary in retail operations to asses the performance of stores partly in terms of their annual sales relative to their floor area (square feet). We might expect sales to increase linearly as stores get larger, with of course individual variation among stores of the same size. The regression model for a population of stores says that
SALES = 0 + 1 AREA +
Example:Retail sales and floor space The slope 1 is as usual a rate of change: it is the
expected increase in annual sales associated with each additional square foot of floor space.
The intercept 0 is needed to describe the line but has no statistical importance because no stores have area close to zero.
Floor space does not completely determine sales. The term in the model accounts for difference among individual stores with the same floor space. A store’s location, for example, is important.
Residual The difference between the observed value
yi and the corresponding fitted value .
Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand.
iii yye ˆ
iy
Example: weekly advertising expenditure
y x y-hat Residual (e)1250 41 1270.8 -20.81380 54 1411.2 -31.21425 63 1508.4 -83.41425 54 1411.2 13.81450 48 1346.4 103.61300 46 1324.8 -24.81400 62 1497.6 -97.61510 61 1486.8 23.21575 64 1519.2 55.81650 71 1594.8 55.2
Estimation of the variance of the error terms, 2
The variance 2 of the error terms i in the regression model needs to be estimated for a variety of purposes. It gives an indication of the variability of the
probability distributions of y. It is needed for making inference concerning
regression function and the prediction of y.
Regression Standard Error To estimate we work with the variance and take the
square root to obtain the standard deviation. For simple linear regression the estimate of 2 is the
average squared residual.
To estimate , use s estimates the standard deviation of the error term in
the statistical model for simple linear regression.
222. )ˆ(
2
1
2
1iiixy yy
ne
ns
2.. xyxy ss
Regression Standard Errory x y-hat Residual (e) square(e)
1250 41 1270.8 -20.8 432.641380 54 1411.2 -31.2 973.441425 63 1508.4 -83.4 6955.561425 54 1411.2 13.8 190.441450 48 1346.4 103.6 10732.961300 46 1324.8 -24.8 615.041400 62 1497.6 -97.6 9525.761510 61 1486.8 23.2 538.241575 64 1519.2 55.8 3113.641650 71 1594.8 55.2 3047.04
y-hat = 828+10.8X total 36124.76
Sy.x 67.19818
Analysis of Residual Inference based on regression model can be
misleading if the assumptions are violated. Assumptions for the simple linear
regression model are: The underlying relation is linear. The errors are independent. The errors have constant variance. The errors are normally distributed.
Analysis of Residual To examine whether the regression model is
appropriate for the data being analyzed, we can check the residual plots.
Residual plots are: Plot a histogram of the residuals Plot residuals against the fitted values. Plot residuals against the independent variable. Plot residuals over time if the data are chronological.
Analysis of Residual A histogram of the residuals provides a check on the
normality assumption. A Normal quantile plot of the residuals can also be used to check the Normality assumptions.
Moderate departures from a bell shaped curve do not impair the conclusions from tests or prediction intervals.
Plot of residuals against fitted values or the independent variable can be used to check the assumption of constant variance and the aptness of the model.
Analysis of Residual Plot of residuals against time provides a
check on the independence of the error terms assumption.
Assumption of independence is the most critical one.
Analysis of Residual
Analysis of Residual
Variable transformations If the residual plot suggests that the variance is not
constant, a transformation can be used to stabilize the variance.
If the residual plot suggests a non linear relationship between x and y, a transformation may reduce it to one that is approximately linear.
Common linearizing transformations are:
Variance stabilizing transformations are:
)log(,1
xx
2,),log(,1
yyyy
Inference in Regression Analysis The simple linear regression model imposes several
conditions. We should verify these conditions before proceeding to inference.
These conditions concern the population, but we can observe only our sample.
In doing inference we act as if The sample is a SRS from the population. There is a linear relationship in the population. The standard deviation of the responses about the population line
is the same for all values of the explanatory variable. The response varies Normally about the population regression line.
Inference in Regression Analysis Plotting the residuals against the
explanatory variable is helpful in checking these conditions because a residual plot magnifies patterns.
Confidence Intervals and Significance Tests
In our previous lectures we presented confidence intervals and significance tests for means and differences in means.In each case, inference rested on the standard error s of the estimates and on t or z distributions.
Inference for the slope and intercept in linear regression is similar in principal, although the recipes are more complicated.
All confidence intervals, for example , have the form estimate t* Seestimate
t* is a critical value of a t distribution.
Confidence Intervals and Significance Tests
Confidence intervals and tests for the slope and intercept are based on the sampling distributions of the estimates b1 and b0.
Here are the facts: If the simple linear regression model is true, each of b0
and b1 has a Normal distribution. The mean of b0 is 0 and the mean of b1 is 1. The standard deviations of b0 and b1 are multiples of
the model standard deviation .
2
.1
)()(
1
xx
SbSSE xy
b
Confidence Intervals and Significance Tests
Example:Weekly Advertising Expenditure
Let us return to the Weekly advertising expenditure and weekly sales example. Management is interested in testing whether or not there is a linear association between advertising expenditure and weekly sales, using regression model. Use = .05
Example:Weekly Advertising Expenditure
Hypothesis:
Decision Rule:
Reject H0 if or
0:
0:
1
10
aH
H
306.28;025. ttt
306.28;025. ttt
Example:Weekly Advertising Expenditure Test statistic:
)( 1
1
bS
bt
38.24.794
2.67
)()(
2
.1
xx
SbS xy
8.101 b
5.438.2
8.10t
Example:Weekly Advertising Expenditure
Conclusion:
Since t =4.5 > 2.306 then we reject H0.
There is a linear association between advertising expenditure and weekly sales.
Confidence interval for 1
Now that our test showed that there is a linear association between advertising expenditure and weekly sales, the management wishes an estimate of 1 with a 95% confidence coefficient.
))(( 1)2;
2(
1 bStbn
Confidence interval for 1
For a 95 percent confidence coefficient, we require t (.025; 8). From table B in appendix III, we find t(.025; 8) = 2.306.
The 95% confidence interval is:
)3.16,31.5(49.58.10
)38.2(306.28.10
))(( 1)2;
2(
1
bStbn
Example: Do wages rise with experience?
Many factors affect the wages of workers: the industry they work in, their type of job, their education and their experience, and changes in general levels of wages. We will look at a sample of 59 married women who hold customer service jobs in Indiana banks. The following table gives their weekly wages at a specific point in time also their length of service with their employer, in month. The size of the place of work is recorded simply as “large” (100 or more workers) or “small.” Because industry, job type, and the time of measurement are the same for all 59 subjects, we expect to see a clear relationship between wages and length of service.
Example: Do wages rise with experience?
Example: Do wages rise with experience?
Example: Do wages rise with experience?
Example: Do wages rise with experience?
Do wages rise with experience? The hypotheses are:
H0: 1 = 0, Ha: 1 > 0 The t statistic for the significance of regression is:
The P- value is:P(t > 2.85) < .005The t distribution for this problem have n-2 = 57 degrees of freedom.
Conclusion: Reject H0 : There is strong evidence that the mean wages increase as
length of service increases.
85.220697.0
5905.0
1
1 bSE
bt
Example: Do wages rise with experience?
A 95% confidence interval for the slope 1 of the regression line in the population of all married female customer service workers in Indiana bank is
The t distribution for this problem have n-2 = 57 degrees of freedom
)00.1,177.0(
4139.05905.0
)20697.0)(00.2(05905*11
bSEtb
Inference about Correlation The correlation between wages and length of service for
the 59 bank workers is r = 0.3535. This appears in the Excel out put, where it is labeled “Multiple R.”
We expect a positive correlation between length of service and wages in the population of all married female bank workers. Is the sample result convincing that this is true?
This question concerns a new population parameter, the population correlation. This is correlation between length of service and wages when we measure these variables for every member of the population.
Inference about Correlation We will call the population correlation. To assess the evidence that . 0 in the bank worker
population, we must test the hypotheses
H0: = 0
Ha: > 0 It is natural to base the test on the sample correlation r. There is a link between correlation and regression slope. The population correlation is zero, positive, negative
exactly when the slope 1 of the population regression line is zero, positive, or negative.
Inference about Correlation
Correlation Coefficient Recall the the algebraic expression for the
correlation coefficient is.
2222
22
)()(
)()(
))((
yynxxn
yxxynr
yyxx
yyxxr
Example: Do wages rise with experience?
The sample correlation between wages and length of service is r = 0.3535 from a sample of n = 59.
To test H0: = 0
Ha: > 0Use t statistic
853.2)3535.0(1
2593535.0
1
222
r
nrt
Example: Do wages rise with experience?
Compare t = 2.853 with critical values from the t table with n - 2 = 57 degrees of freedom.
Conclusion: P( t > 2.853) < .005, therefore we reject H0.
There is a positive correlation between wages and length of service.
Prediction of a new response ( ) We now consider the prediction of a new
observation y corresponding to a given level x of the independent variable.
In our advertising expenditure and weekly sales, the management wishes to predict the weekly sales corresponding to the advertising expenditure of x = $50.
y
Interval Estimation of a new response ( )
The following formula gives us the point estimator (forecast) for y.
1- % prediction interval for a new observation is:
Where
xbby 10ˆ y
y
)(ˆ)2;
2(
fn
Sty
2
2
. )(
)(11
xx
xx
nSS xyf
Example In our advertising expenditure and weekly sales, the
management wishes to predict the weekly sales if the advertising expenditure is $50 with a 90 % prediction interval.
We require t(.05; 8) = 1.860
1368)50(8.10828ˆ y
11.724.794
)4.5650(
10
112.67
)(
)(11
2
2
2
.
f
xyf
S
xx
xx
nSS
Example The 90% prediction interval is:
)1.1502,9.1233(
)11.72(860.11368
)(ˆ )8;05(.
fSty
Analysis of variance approach to Regression analysis
The analysis of variance approach is based on the partitioning of sums of squares and degrees of freedom associated with the response variable.
Consider the weekly advertising expenditure and the weekly sales example. There is variation in the amount ($) of weekly sales, as in all statistical data. The variation of the yi is conventionally measured in terms of the deviations:
yyi
Analysis of variance approach to Regression analysis The measure of total variation, denoted by SST, is the sum
of the squared deviations:
If SST = 0, all observations are the same(No variability). The greater is SST, the greater is the variation among the y
values. When we use the regression model, the measure of variation
is that of the y observations variability around the fitted line:
2)( yySST i
ii yy ˆ
Analysis of variance approach to Regression analysis
The measure of variation in the data around the fitted regression line is the sum of squared deviations (error), denoted SSE:
For our Weekly expenditure example SSE = 36124.76SST = 128552.5
What accounts for the substantial difference between these two sums of squares?
2)ˆ( ii yySSE
Analysis of variance approach to Regression analysis
The difference is another sum of squares:
SSR stands for regression sum of squares. SSR may be considered as a measure of the
variability of the yi that is associated with the regression line.
The larger is SSR relative to SST, the greater is the role of regression line in explaining the total variability in y observations.
2)ˆ( yySSR i
Analysis of variance approach to Regression analysis
In our example:
This indicates that most of variability in weekly sales can be explained by the relation between the weekly advertising expenditure and the weekly sales.
74.9242776.361245.128552 SSESSTSSR
Formal Development of the Partitioning
We can decompose the total variability in the observations yi as follows:
The total deviation can be viewed as the sum of two components: The deviation of the fitted value around the mean
. The deviation of yi around the fitted regression line.
iiii yyyyyy ˆˆ
iyy
yyi
Formal Development of the Partitioning
The sums of these squared deviations have the same relationship:
Breakdown of degree of freedom:
222 )ˆ()ˆ()( iiii yyyyyy
)2(11 nn
Mean squares A sum of squares divided by its degrees of freedom is called
a mean square (MS) Regression mean square (MSR)
Error mean square (MSE)
Note: mean squares are not additive.
1
SSRMSR
2
n
SSEMSE
Mean squares In our example:
74.924271
74.92427
1
SSRMSR
6.45158
76.36124
2
n
SSEMSE
Analysis of Variance Table The breakdowns of the total sum of squares
and associated degrees of freedom are displayed in a table called analysis of variance table (ANOVA table)
Source of Variation
SS df MS F-Test
Regression SSR 1 MSR
=SSR/1
MSR/MSE
Error SSE n-2 MSE
=SSE/(n-2)
Total SST n-1
Analysis of Variance Table In our weekly advertising expenditure and
weekly sales example the ANOVA table is:
Source of variation
SS df MS
Regression 92427.74 1 92427.74
Error 36124.76 8 4515.6
Total 128552.5 9
F-Test for 1= 0 versus 1 0
The general analysis of variance approach provides us with a battery of highly useful tests for regression models. For the simple linear regression case considered here, the analysis of variance provides us with a test for:
0:
0:
1
10
aH
H
F-Test for 1= 0 versus 1 0 Test statistic:
In order to be able to construct a statistical decision rule, we need to know the distribution of our test statistic F.
When H0 is true, our test statistic, F, follows the F- distribution with 1, and n-2 degrees of freedom.
Table C on page 622 of your text gives the critical values of the F-distribution at = 0.1, 0.5 and .01.
MSE
MSRF
F-Test for 1= 0 versus 1 0
Construction of decision rule: At = 5% level Reject H0 if
Large values of F support Ha and Values of F near 1 support H0.
)2,1;( nFF
F-Test for 1= 0 versus 1 0 Using our example again, let us repeat the earlier test on 1. This
time we will use the F-test. The null and alternative hypothesis are:
Let = .05. Since n=10, we require F(.05; 1, 8). From table 5-3 we find that F(.05; 1, 8) = 5.32. Therefore the decision rule is: Reject H0 if:
0:
0:
1
10
aH
H
32.5F
F-Test for 1= 0 versus 1 0 From ANOVA table we have
MSR = 92427.74 MSE = 4515.6 Our test statistic F is:
Decision: Since 20.47> 5.32, we reject H0, that is there is a linear
association between weekly advertising expenditure and weekly sales.
47.206.4515
74.92427F
F-Test for 1= 0 versus 1 0
Equivalence of F Test and t Test: For given level, the F test of 1 = 0 versus
1 0 is equivalent algebraically to the two sided t-test.
Thus, at a given level, we can use either the t-test or the F-test for testing 1 = 0 versus
1 0.
The t-test is more flexible since it can be used for one sided test as well.
Analysis of Variance Table The complete ANOVA table for our
example is:
Source of Variation
SS df MS F-Test
Regression 92427.74 1 92427.74 20.47
Error 36124.76 8 4515.6
Total 128552.5 9
Computer Output The EXCEL out put for our example is:
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.847950033R Square 0.719019259Adjusted R Square 0.683896667Standard Error 67.19447214Observations 10
ANOVAdf SS MS F Significance F
Regression 1 92431.72331 92431.72 20.4717 0.0019382Residual 8 36120.77669 4515.097Total 9 128552.5
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 828.1268882 136.1285978 6.083416 0.000295 514.2135758 1142.0402AD-Expen (X) 10.7867573 2.384042146 4.524567 0.001938 5.289142698 16.2843719
Coefficient of Determination Recall that SST measures the total variations in yi
when no account of the independent variable x is taken.
SSE measures the variation in the yi when a regression model with the independent variable x is used.
A natural measure of the effect of x in reducing the variation in y can be defined as:
SST
SSE
SST
SSR
SST
SSESSTR
12
Coefficient of Determination R2 is called the coefficient of determination. 0 SSE SST, it follows that:
We may interpret R2 as the proportionate reduction of total variability in y associated with the use of the independent variable x.
The larger is R2, the more is the total variation of y reduced by including the variable x in the model.
10 2 R
Coefficient of Determination If all the observations fall on the fitted regression
line, SSE = 0 and R2 = 1. If the slope of the fitted regression line
b1 = 0 so that , SSE=SST and R2 = 0. The closer R2 is to 1, the greater is said to be the
degree of linear association between x and y. The square root of R2 is called the coefficient of
correlation.
yyi ˆ
2Rr