Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Post on 29-Dec-2015

217 views 1 download

Transcript of Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Inferences in Regression and Correlation Analysis

Ayona Chatterjee

Spring 2008

Math 4803/5803

Topics to be covered

• Interval estimation for regression parameters.

• Tests for regression parameters.

• Prediction intervals.

• Analysis of Variance

• Correlation coefficients

Remembering β1

• Here β1 is the slope of the regression line.

• The main interest is to check if:– H0: β1= 0

– H1: β1 0

• The main reason to test if β1=0 is that, when β1=0, there is no linear association between Y and X.

Sampling Distribution for β1

• For the normal error regression model the sampling distribution for b1 is also normal with

• This is because b1 is a linear combination of the observations Yi and since Yi is normally distributed, so is b1.

• Remember we can use MSE to estimate 2.

2

2

12

11

)(}{

}{

XXb

bE

i

Properties of ki

• The coefficient ki have the following properties:

22

)(

1

1

0

XXk

Xk

k

ii

ii

i

The t-distribution

• Note: • is distributed as t(n-2) for the normal error

regression model.• The 1- confidence limits for β1are:

Lets do an example.

}{ 1

11

bs

b

}{)2;2/1( 11 bsntb

Interpretation

• If the confidence interval does not include zero, we can conclude that β1 0, the association between Y and X is sometimes described to be a linear statistical association.

Sampling Distribution of β0

• For a normal error regression model, the sampling distribution of b0 is normal with mean and variance:

2

22

02

00

)(

1}{

}{

XX

X

nb

bE

i

Confidence Interval for β0

• Similar to the previous confidence interval, CI for β0 is :

• Let us find the 90% confidence interval for β0 for the ampules data discussed in chapter 1.

}{)2;2/1( 00 bsntb

Interval Estimation of E{Yh}

• Let Xh denote the level of X for which we wish to estimate the mean response.

• Here E{Yh} denotes the mean response when we observe Xh.

• Here E{Yh} is normally distributed with mean and variance:

2

222

10

)(

)(1}ˆ{

}ˆ{

XX

XX

nY

XYE

i

hh

hh

Confidence Interval for E{Yh}

• The 1- confidence limits for E{Yh} are:

}ˆ{)2;2/1(ˆhh YsntY

Prediction interval for Yh(new)

• We only look at predicting a new Yh when the parameters are known.

• We denote the level of X for the new trial as Xh and the new observation on Y as Yh(new).

• Assume the regression model applicable for the basic data is still appropriate.

• Note the distinction between predicting E{Yh} and Yh(new).

Prediction Interval : Example

• Suppose for the GPA example we know that 0= 0.1 and 1=0.95.

• Thus E{Y} = 0.1 + 0.95 X. • It is known that = 0.12.• The admission office is considering an

applicant whose high school GPA is Xh=3.5. Thus for this student

• E{Yh}= ?

Example continued

• The E{Yh} has a standard deviation of 0.12.

• With the assumption that this is a Normal Regression Model we find the prediction interval as:– E{Yh} 3

– The probability is 0.997 that this prediction interval will give a correct prediction for the applicant with a highs school GPA of 3.5.

Modification

• As noted the previous PI was quite wide.

• In general, when the regression parameters of normal error regression model are known, the 1- prediction limits are:– E{Yh} z(1- /2)

Comment

• Prediction intervals resemble confidence intervals but are conceptually different. A confidence interval is for a parameter and gives the range in which the parameter will lie. A prediction interval on the other hand is the range an observation will lie.

Analysis of Variance

• Partitioning Sums of Squares:• The total deviation can be partitioned as

deviation of fitted regression values around the mean plus the deviation around the fitted regression line.

SSESSRSSTO

YYYYYY iiii

)ˆ()ˆ(

ANOVA Table

Source of

Variation

Degrees of

Freedom

(df)

Sums of

Squares (SS)

Mean Sum of Squares

(MS)

Regression 1 SSR SSR/1

Error n-2 SSE SSE/n-2

Total n-1 SSTO

Modified ANOVA Table

• The total sums of squares are partitioned in two:– Total uncorrected sums of squares SSTOU = – Correction for mean sums of squares =

• This splits the degrees of freedom to 1 for the correction for mean and n for SSTOU.

2 iY

2Yn

F-Tests

• Using the ANOVA table we can test is – H0: β1=0

– H1: β1 0

– We use F*=MSR/MSE as the test statistics.– Output from Minitab gives a p-value, if p-value

is less than 0.05, we reject the null hypothesis and conclude that β1 0 and there is a significant linear relationship between X and Y.

Coefficient of Determination

• Descriptive measure to describe the linear association between X and Y.

• SSTO is the measure of uncertainty in predicting Y without using information on X.

• SSE is the reduced uncertainty in Y while using X.• Thus SSTO-SSE = SSR is the reduction in

uncertainty due to X.

R2

• We define the coefficient of determination R2 as the proportionate reduction in total variation associated with using the predictor variable X.

• R2= SSR / SSTO

• Larger the R2 more is the reduction in variation in Y due to X.

Remember

• R2 only gives the degree of linear relation between X and Y.

• A R2 close to one does not imply good fit. – May not capture the true nature for the relation.

• A R2 close to zero does not imply no relation between X and Y.– The relation can be curvilinear.

Coefficient of Correlation

• The measure of linear association between X and Y is given by

• The range is –1 r 1.

• The sign depends upon the slope of the regression line, if the slope is positive r is positive.

2Rr

Correlation Analysis

• When X values may not be known constants.• When we want to study the effect of X on Y and Y

on X.• We use Correlation Analysis instead of Regression

Analysis.• Example:

– Study relation between blood pressure and age in humans.

– Height and weight of a person.

Bivariate Normal Distribution

• We will consider a correlation model between two variable Y1 and Y2 using the bivariate normal distribution.

• If Y1 and Y2 are jointly normally then the marginal distributions of Y1 and Y2 are also normal.

• The conditional distribution of Y1 given Y2 is also normal.

Bivariate normal parameters

• Note 1 and 2 are the means of Y1 and Y2 respectively.

• Similarly 1and 2 are the standard deviations for Y1 and Y2 respectively.

12 is the correlation coefficient between Y1 and Y2 .

Parameters of Interest

• The first parameter represent the intercept of the line of regression of Y1 on Y2 .

• The second parameter is the slope of the regression line.

)1( 22|1

21

22|1

2

1122|1

2

112212|1

Comments

• Two distinct regression lines are of interest, one when we have Y1 regressed on Y2 and the other with Y2 regressed on Y1.

• In general the regression lines are not the same.

• Only if the standard deviations are equal will the lines coincide.

Point estimator for 12

• The maximum likelihood estimator for 12 is

denoted by r12 and is given by:

1 1 2 212 1/ 22 2

1 1 2 2

12

( )( )

( ) ( )

1 1

i i

i i

Y Y Y Yr

Y Y Y Y

r

To test if 12=0

• Note is 12=0 this implies that Y1

and Y2 are independent.

• We apply a t-test with the test statistic as given.

• Reject the null hypothesis is t* > t(1-α/2, n-2)

* 12

212

2

1

r nt

r

Spearman Rank Correlation Coefficient

• Suppose that Y1 and Y2 are not Bivariate normal.

• A non-parametric rank correlation method is applied to make inferences about Y1 and Y2

• Define Ri1 as the rank of Yi1.

Spearman Rank Correlation Coefficient

• The rank correlation coefficient is defined as

• In case of ties among some data values, each of the tied values is given the average of the ranks involved.

1 1 2 21/ 22 2

1 1 2 2

( )( )

( ) ( )

1 1

i is

i i

s

R R R Rr

R R R R

r

Spearman Rank Correlation Coefficient

• H0: There is no association between Y1 and Y2.

• H1: There is association between Y1 and Y2.

• Similar t-test as before with n-2 degrees of freedom.

*

2

2

1s

s

r nt

r