Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

34
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803

Transcript of Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Page 1: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Inferences in Regression and Correlation Analysis

Ayona Chatterjee

Spring 2008

Math 4803/5803

Page 2: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Topics to be covered

• Interval estimation for regression parameters.

• Tests for regression parameters.

• Prediction intervals.

• Analysis of Variance

• Correlation coefficients

Page 3: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Remembering β1

• Here β1 is the slope of the regression line.

• The main interest is to check if:– H0: β1= 0

– H1: β1 0

• The main reason to test if β1=0 is that, when β1=0, there is no linear association between Y and X.

Page 4: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Sampling Distribution for β1

• For the normal error regression model the sampling distribution for b1 is also normal with

• This is because b1 is a linear combination of the observations Yi and since Yi is normally distributed, so is b1.

• Remember we can use MSE to estimate 2.

2

2

12

11

)(}{

}{

XXb

bE

i

Page 5: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Properties of ki

• The coefficient ki have the following properties:

22

)(

1

1

0

XXk

Xk

k

ii

ii

i

Page 6: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

The t-distribution

• Note: • is distributed as t(n-2) for the normal error

regression model.• The 1- confidence limits for β1are:

Lets do an example.

}{ 1

11

bs

b

}{)2;2/1( 11 bsntb

Page 7: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Interpretation

• If the confidence interval does not include zero, we can conclude that β1 0, the association between Y and X is sometimes described to be a linear statistical association.

Page 8: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Sampling Distribution of β0

• For a normal error regression model, the sampling distribution of b0 is normal with mean and variance:

2

22

02

00

)(

1}{

}{

XX

X

nb

bE

i

Page 9: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Confidence Interval for β0

• Similar to the previous confidence interval, CI for β0 is :

• Let us find the 90% confidence interval for β0 for the ampules data discussed in chapter 1.

}{)2;2/1( 00 bsntb

Page 10: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Interval Estimation of E{Yh}

• Let Xh denote the level of X for which we wish to estimate the mean response.

• Here E{Yh} denotes the mean response when we observe Xh.

• Here E{Yh} is normally distributed with mean and variance:

2

222

10

)(

)(1}ˆ{

}ˆ{

XX

XX

nY

XYE

i

hh

hh

Page 11: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Confidence Interval for E{Yh}

• The 1- confidence limits for E{Yh} are:

}ˆ{)2;2/1(ˆhh YsntY

Page 12: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Prediction interval for Yh(new)

• We only look at predicting a new Yh when the parameters are known.

• We denote the level of X for the new trial as Xh and the new observation on Y as Yh(new).

• Assume the regression model applicable for the basic data is still appropriate.

• Note the distinction between predicting E{Yh} and Yh(new).

Page 13: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Prediction Interval : Example

• Suppose for the GPA example we know that 0= 0.1 and 1=0.95.

• Thus E{Y} = 0.1 + 0.95 X. • It is known that = 0.12.• The admission office is considering an

applicant whose high school GPA is Xh=3.5. Thus for this student

• E{Yh}= ?

Page 14: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Example continued

• The E{Yh} has a standard deviation of 0.12.

• With the assumption that this is a Normal Regression Model we find the prediction interval as:– E{Yh} 3

– The probability is 0.997 that this prediction interval will give a correct prediction for the applicant with a highs school GPA of 3.5.

Page 15: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Modification

• As noted the previous PI was quite wide.

• In general, when the regression parameters of normal error regression model are known, the 1- prediction limits are:– E{Yh} z(1- /2)

Page 16: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Comment

• Prediction intervals resemble confidence intervals but are conceptually different. A confidence interval is for a parameter and gives the range in which the parameter will lie. A prediction interval on the other hand is the range an observation will lie.

Page 17: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Analysis of Variance

• Partitioning Sums of Squares:• The total deviation can be partitioned as

deviation of fitted regression values around the mean plus the deviation around the fitted regression line.

SSESSRSSTO

YYYYYY iiii

)ˆ()ˆ(

Page 18: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

ANOVA Table

Source of

Variation

Degrees of

Freedom

(df)

Sums of

Squares (SS)

Mean Sum of Squares

(MS)

Regression 1 SSR SSR/1

Error n-2 SSE SSE/n-2

Total n-1 SSTO

Page 19: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Modified ANOVA Table

• The total sums of squares are partitioned in two:– Total uncorrected sums of squares SSTOU = – Correction for mean sums of squares =

• This splits the degrees of freedom to 1 for the correction for mean and n for SSTOU.

2 iY

2Yn

Page 20: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

F-Tests

• Using the ANOVA table we can test is – H0: β1=0

– H1: β1 0

– We use F*=MSR/MSE as the test statistics.– Output from Minitab gives a p-value, if p-value

is less than 0.05, we reject the null hypothesis and conclude that β1 0 and there is a significant linear relationship between X and Y.

Page 21: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Coefficient of Determination

• Descriptive measure to describe the linear association between X and Y.

• SSTO is the measure of uncertainty in predicting Y without using information on X.

• SSE is the reduced uncertainty in Y while using X.• Thus SSTO-SSE = SSR is the reduction in

uncertainty due to X.

Page 22: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

R2

• We define the coefficient of determination R2 as the proportionate reduction in total variation associated with using the predictor variable X.

• R2= SSR / SSTO

• Larger the R2 more is the reduction in variation in Y due to X.

Page 23: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Remember

• R2 only gives the degree of linear relation between X and Y.

• A R2 close to one does not imply good fit. – May not capture the true nature for the relation.

• A R2 close to zero does not imply no relation between X and Y.– The relation can be curvilinear.

Page 24: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Coefficient of Correlation

• The measure of linear association between X and Y is given by

• The range is –1 r 1.

• The sign depends upon the slope of the regression line, if the slope is positive r is positive.

2Rr

Page 25: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Correlation Analysis

• When X values may not be known constants.• When we want to study the effect of X on Y and Y

on X.• We use Correlation Analysis instead of Regression

Analysis.• Example:

– Study relation between blood pressure and age in humans.

– Height and weight of a person.

Page 26: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Bivariate Normal Distribution

• We will consider a correlation model between two variable Y1 and Y2 using the bivariate normal distribution.

• If Y1 and Y2 are jointly normally then the marginal distributions of Y1 and Y2 are also normal.

• The conditional distribution of Y1 given Y2 is also normal.

Page 27: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Bivariate normal parameters

• Note 1 and 2 are the means of Y1 and Y2 respectively.

• Similarly 1and 2 are the standard deviations for Y1 and Y2 respectively.

12 is the correlation coefficient between Y1 and Y2 .

Page 28: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Parameters of Interest

• The first parameter represent the intercept of the line of regression of Y1 on Y2 .

• The second parameter is the slope of the regression line.

)1( 22|1

21

22|1

2

1122|1

2

112212|1

Page 29: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Comments

• Two distinct regression lines are of interest, one when we have Y1 regressed on Y2 and the other with Y2 regressed on Y1.

• In general the regression lines are not the same.

• Only if the standard deviations are equal will the lines coincide.

Page 30: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Point estimator for 12

• The maximum likelihood estimator for 12 is

denoted by r12 and is given by:

1 1 2 212 1/ 22 2

1 1 2 2

12

( )( )

( ) ( )

1 1

i i

i i

Y Y Y Yr

Y Y Y Y

r

Page 31: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

To test if 12=0

• Note is 12=0 this implies that Y1

and Y2 are independent.

• We apply a t-test with the test statistic as given.

• Reject the null hypothesis is t* > t(1-α/2, n-2)

* 12

212

2

1

r nt

r

Page 32: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Spearman Rank Correlation Coefficient

• Suppose that Y1 and Y2 are not Bivariate normal.

• A non-parametric rank correlation method is applied to make inferences about Y1 and Y2

• Define Ri1 as the rank of Yi1.

Page 33: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Spearman Rank Correlation Coefficient

• The rank correlation coefficient is defined as

• In case of ties among some data values, each of the tied values is given the average of the ranks involved.

1 1 2 21/ 22 2

1 1 2 2

( )( )

( ) ( )

1 1

i is

i i

s

R R R Rr

R R R R

r

Page 34: Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Spearman Rank Correlation Coefficient

• H0: There is no association between Y1 and Y2.

• H1: There is association between Y1 and Y2.

• Similar t-test as before with n-2 degrees of freedom.

*

2

2

1s

s

r nt

r