Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression...

29
Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1

Transcript of Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression...

Page 1: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Correlation and Regression

October 25, 2017

STAT 151 Class 9 Slide 1

Page 2: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Outline of Topics

1 Associations

2 Scatter plot

3 Correlation

4 Regression

5 Testing and estimation

6 Goodness-of-fit

STAT 151 Class 9 Slide 2

Page 3: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Example

We are often interested in the association between two or more variables. Supposethe Midterm (X ) and Final (Y ) exam scores of a sample∗ of n = 8 students arerecorded and we wish to study the association between X and Y in the populationof students.

Midterm (X ) Final (Y )

55 4560 7580 8577 6235 5075 7292 7865 53

We consider three approaches:

(1) a graphical summary – scatter plot (c.f., Class 3)(2) a numerical measure – correlation coefficient (c.f., Class 3)(3) a model – regression

∗ A SRS of independent observationsSTAT 151 Class 9 Slide 3

Page 4: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Scatter plot (1): Example

Each observation (student) isrepresented by a symbol onthe plot

A scatter plot is useful forgiving an overall impression ofthe kind of relationshipbetween the variables, e.g.,linear, nonlinear or noapparent relationship

●●

0 20 40 60 80 1000

2040

6080

Midterm

Fin

al

linearnonlinearnone

STAT 151 Class 9 Slide 4

Page 5: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Scatter plot (2)

Outliers are observations thatdeviate from the general trendof the rest of the data

If we have a new observation(X ,Y ) = (99, 10), it will appearas the red open circle

The scatter plot shows the newobservation is unusual

Scatter plots are generally notuseful when there are more thantwo variables, e.g., Projects,Midterm, Final, etc.

●●

●●

0 20 40 60 80 100

020

4060

80

Midterm

Fin

al

STAT 151 Class 9 Slide 5

Page 6: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Pearson correlation (Egon Sharp Pearson, 1895-1980)

In Class 3, cov(X ,Y ) is used to measure association between X and Y :

●●

0 20 40 60 80

0

20

40

60

80

100

X (Midterm)

Y (

Fin

al)

X

Y

cov(X,Y)= 183.57

●●

0 2 4 6 8 10

0

20

40

60

80

100

X (Midterm)

X

Y

cov(X,Y)= 18.357

cov(X ,Y ) is not invariant to scale transformation, e.g., its value changesif midterm is recorded as (0,10) instead of (0,100)

The sign of cov(X ,Y ) (+ vs. -) can be used to tell direction of theassociation, but its magnitude has no meaning

STAT 151 Class 9 Slide 6

Page 7: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Pearson correlation (Egon Sharp Pearson, 1895-1980)

A Pearson (product moment) correlation coefficient, r ≡ corr(X ,Y ), is anumber that summarizes the linear relationship between X and Y

For X from a population with mean µX and variance σ2X , a Z -score:

ZX =X − µXσX

tells us X relative to the rest of the population

r = E︸︷︷︸average

(ZXZY ) = E

(X − µXσX

Y − µYσY

)=

E(X − µX )(Y − µY )

σXσY=

cov(X ,Y )

σXσY

measures, on average, whether X and Y are in tandem relative to their populations

Using n observations (X1, Y1),..., (Xn, Yn)

r =

∑(Xi − X )(Yi − Y )

n − 1√∑(Xi − X )2

n − 1

√∑(Yi − Y )2

n − 1

=

∑(Xi − X )(Yi − Y )√∑

(Xi − X )2√∑

(Yi − Y )2

STAT 151 Class 9 Slide 7

Page 8: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Correlation: Example

For calculation, the equivalent formula is more convenient:

r =

∑XiYi−nX Yn−1√∑

X 2i −nX 2

n−1

√∑Y 2i −nY 2

n−1

=

∑XiYi − nX Y√∑

X 2i − nX 2

√∑Y 2i − nY 2

X recorded as (0,100)

X = 67.375, Y = 65,∑8

i=1 XiYi = 36320∑8i=1 X

2i = 38493,

∑8i=1 Y

2i = 35296

r =

36320−8(67.375)(65)8−1√

38493−8(67.375)2

8−1

√35296−8(65)2

8−1

=183.57

17.63874× 14.61897≈ 0.712

X recorded as (0,10)

X = 6.7375, Y = 65,∑8

i=1 XiYi = 3632∑8i=1 X

2i = 384.93,

∑8i=1 Y

2i = 35296

r =

3632−8(6.7375)(65)8−1√

384.93−8(6.7375)2

8−1

√35296−8(65)2

8−1

=18.357

1.763874× 14.61897≈ 0.712

On average, ZXZY = 0.712 > 0⇒ ZX and ZY are of the same sign (both + orboth −) ⇒ they are either both big or both small relative to their own populations

STAT 151 Class 9 Slide 8

Page 9: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Sample correlation under various relationships (Fig. 3)

−1 ≤ r ≤ 1

The magnitude of r measuresthe strength of the association.If |r | ≈ 1, the association isstrong (B, C and D); if |r | ≈ 0,the association is weak (A) ornon-linear

The sign of r measures thedirection of the association. Ifr > 0, large X tends to beassociated with large Y (B andC); if r < 0, large X tends to beassociated with small Y (D)

●●

●●●

●●

●●

●●●●

0.0 0.4 0.8

−5

05

10

(A) r = −0.063

X

Y

●●

●●

●●

●●

● ●

●●●

●●

●●

0.0 0.4 0.8

−5

05

10

(C) r = 0.652

X

Y

●●

●●

●●

●●

●●

0.0 0.4 0.8

−5

05

10

(B) r = 0.935

X

Y

●●●

●●

● ●

●●●

●●

0.0 0.4 0.8

−5

05

10

(D) r = −0.439

X

Y

STAT 151 Class 9 Slide 9

Page 10: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Correlation measures linear relationships (Fig. 4)

r measures linearassociations (A)

A non-linear relationshipmay distort the value of r(B)

Outliers may distort thevalue of r (C)

A restrictive range (opencircles) in X or Y may leadto a smaller r (D)

●●

A

●●

C

●●

●● ● ●

●●

B

●●

D

STAT 151 Class 9 Slide 10

Page 11: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Prediction under a linear model (Fig. 5)

●●

0 20 40 60 80 1000

2040

6080

Midterm

Fin

al

A regression analysisallows us to determineif Midterm score (X )can be used to predict Finalscore (Y ). The scatterplot suggests there maybe a linear relationshipbetween X and Y (i.e.,each additional point in theMidterm is associated withb extra points in the Final).

A regression analysis uses asample of students to determine whether a linear relationship exists for thepopulation of students.

STAT 151 Class 9 Slide 11

Page 12: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Simple linear regression

We postulate that the relationship between Midterm score (X ) and Final score (Y )in the population be represented by a straight line:

Y = a + bX

where a is the intercept and b is the slope. The variable X is called an independentor predictor variable and Y is called a dependent or outcome variable.

A simple linear regression is a regression with only one predictor and therelationship between the predictor and the outcome variable is assumed to be linear.

The intercept a gives the prediction of Y when X = 0 or b = 0. Often a is not ofinterest or may even be meaningless, e.g., if X represents the height of a person andY represents the weight, then no person has a height (X ) of zero.

The value of b is the change in Y for every unit difference in X .

Figure 5 shows that the observations do not fall on the straight line. In fact, there isno straight line that fits all observations. We assume

Y = a + bX + e, e ∼ N(0, σ2)

STAT 151 Class 9 Slide 12

Page 13: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Simple linear regression (2)

Y = a + bX︸ ︷︷ ︸(A)

+ e︸︷︷︸(B)

, e ∼ N(0, σ2).

(A) a + bX is the average value of Y for observations with a particular valueof X

(B) Each observation Y differs from the average by an amount e, ande ∼ N(0, σ2)

(A)+(B) For each known value of X , the values of Y ∼ N(a + bX , σ2). Therefore,in a regression, we assume we have known values of X at X1, ...,Xn andwe investigate how Y changes at these values, which is captured by theregression model

We use maximum likelihood estimation (MLE), which is equivalent to amethod called ordinary least squares (OLS) in this setting

STAT 151 Class 9 Slide 13

Page 14: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Maximum Likelihood (1) – Data

Midterm (X ) Final (Y )

55 45

60 75

80 85

77 62

35 50

75 72

92 78

65 53

a + b(55)

...

a + b(60)

a + b(65)

STAT 151 Class 9 Slide 14

Page 15: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Maximum Likelihood (2)

We have a sample Y1, ...,Yn at X1, ...,Xn, respectively. AssumingYi ∼ N(a + bXi , σ

2), where a, b, σ2 are unknown, we can find the MLE of theseparameters. The MLEs are a, b, σ2 that jointly maximize the likelihood

L(a, b, σ2) =n∏

i=1

1√2πσ2

e−{Yi − (a + bXi )}2

2σ2

Taking (natural) logarithm of L(a, b, σ2) gives the log-likelihood

`(a, b, σ2) = logn∏

i=1

1√2πσ2

e−{Yi − (a + bXi )}2

2σ2 =n∑

i=1

[−{Yi − (a + bXi )}2

2σ2− 1

2log2π − logσ

]

The MLEs are found by∂`(a, b, σ2)

∂a= 0,

∂`(a, b, σ2)

∂b= 0,

∂`(a, b, σ2)

∂σ2= 0

b =

∑n

i=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2

=

∑n

i=1XiYi − nX Y∑n

i=1X 2i − n(X )2

=cov(X ,Y )

var(X ),

a = Y − bX , σ2 =1

n

∑n

i=1{Yi − (a + bXi )}2

STAT 151 Class 9 Slide 15

Page 16: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Least squares

For any value of σ2 in the log-likelihood

`(a, b, σ2) =n∑

i=1

[−{Yi − (a + bXi )}2

2σ2− 1

2log2π − logσ

]

●●

0 20 40 60 80 100

020

4060

80

Midterm

Fin

al

`(a, b, σ2) is maximized if

n∑i=1

{Yi − (a + bXi )}2

is minimized (hence“least squares”). The bestfitting line using MLE or OLSis the line that minimizes thesum of squared deviations ofthe observations from the line

STAT 151 Class 9 Slide 16

Page 17: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Example

Using our sample of n = 8 students, what is the predicted Final scorefor a student who scored 65 on the Midterm using the MLE (OLS)estimates?

b =36320− 8(67.375)(65)

38493− 8(67.375)2= 0.59, a = 65− 0.59(67.375) = 25.247

The fitted regression line is

Final = 25.247 + 0.59×Midterm

For a student whose Midterm score is 65, her predicted Final score is

25.247 + 0.59× 65 = 63.597

STAT 151 Class 9 Slide 17

Page 18: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Quality of the regression - Residual plots

Under the regression model

Yi = a + bXi + ei

ei ∼ N(0, σ2)

⇓ei = Yi − Yi

= Yi − (a + bXi )

If the model is correct, ei sshould resemble a set ofrandom observations from anormal distribution with meanzero like panel (a)

●●

●●

● ●

● ●● ●

●●●●

● ●

●●

● ●●

●●

●●

●●

●●

0.0 0.4 0.8

−6

−2

02

46

(a) Random

● ●● ●

●● ●

●●

0.0 0.4 0.8

−6

−2

02

46

(b) Non−linear

●●●●●●

●●

● ●●

●●

● ●●

●●

● ●●

● ●

●●

● ●

0.0 0.4 0.8

−6

−2

02

46

(c) Skewed distribution

X

resi

dual

s

0

● ●

● ●

●●

●●●

●●

●●

●●

●● ●

0.0 0.4 0.8

−6

−2

02

46

(d) Non−constant varinace

STAT 151 Class 9 Slide 18

Page 19: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Residual plot - Example

Based on the regression model

Y = 25.247 + .59× X

ei = Yi − Yi

= Yi − (25.247 + .59× Xi )

Yi Yi ei45 57.70 -12.7075 60.65 14.3585 72.45 12.5562 70.68 -8.6850 45.90 4.1072 69.50 2.5078 79.53 -1.5353 63.60 -10.60

●●

●●

Xre

sidu

als

0

STAT 151 Class 9 Slide 19

Page 20: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Notes about a regression analysis

A linear regression model makes 3 assumptions:

1. The relationship between X and Y is linear, i.e., Y = a + bX + e2. The values of Yi ’s are normally distributed about the regression line3. The variances of Yi ’s about the regression line are the same

The regression line is fitted by MLE (= OLS), which means the sumof the squared distances of the observations to the regression line isminimized

Prediction can only be made in the range of X used to obtain theregression line. In the example, since the lowest and the highestMidterm scores in the 8 students are 35 and 92, therefore, predictioncan be made for other students who Midterm scores are within thisrange. For someone whose Midterm score falls outside (35,92), noprediction is possible. This restriction does not apply to thedependent variable, so the predicted Final score can be outside therange of Y values observed in the 8 students

STAT 151 Class 9 Slide 20

Page 21: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Observed relationship – Fact or Fiction?

Final =

a︷ ︸︸ ︷25.247 +

b︷︸︸︷0.59×Midterm

shows each additional point in the Midterm is associated with anextra 0.59 point in the Final for the 8 students.

Our estimate b comes from a sample and hence there issampling error, i.e., b − b

Does the association generalise to the population of students?

Two approaches to answering this question:

(1) Test the hypotheses:

H0 : b = 0 (no relationship) vs. H1 : b 6= 0 (some relationship)

(2) Find an interval estimate:

b ±margin of error of b

STAT 151 Class 9 Slide 21

Page 22: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Hypothesis testing

For a sample of students such that midterm (X ) and final (Y ) are unrelated:

(1) b is expected to be zero(2) sampling variation allows b 6= 0 but it is unlikely to be “far from” 0

Value of b0 critical value

expected unexpectedunexpected

5%

We use a test statistic to determine whether b for our sample is far from 0:

z∗ =

our sample︷︸︸︷b −

X and Y unrelated︷︸︸︷0√

var(b)︸ ︷︷ ︸allowance for sampling variation

=0.59− 0√

var(b)

STAT 151 Class 9 Slide 22

Page 23: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Hypothesis testing (2) — estimating var(b)

var(b) = var

∑n

i=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2

= var

∑n

i=1(Xi − X )Y ∗∗i∑n

i=1(Xi − X )2

Earlier, we learned

var(b) =

∑n

i=1(Xi − X )2var(Yi )

†[∑n

i=1(Xi − X )2

]2=

∑n

i=1(Xi − X )2σ2[∑n

i=1(Xi − X )2

]2=

σ2∑n

i=1(Xi − X )2

where σ2 can be estimated using the MLE

σ2 =

∑n

i=1{Yi − (a + bXi )}2

n††=

∑n

i=1(Yi − Yi )

2

n

∗∗∑(Xi − X )(Yi − Y ) =∑

(Xi − X )Yi −∑

(Xi − X )Y =∑

(Xi − X )Yi − Y

=0︷ ︸︸ ︷∑(Xi − X )

†X1, ...,Xn are assumed known and hence constants†† Sometimes, the denominator of σ2 uses n − 2 to give an unbiased estimator for σ2

STAT 151 Class 9 Slide 23

Page 24: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Hypothesis testing (3)

For large n, we find:

z∗ =b − 0√var(b)

=b − 0

σ/√∑n

i=1 X2i − n(X )2

=0.59− 0

10.305/√

38493− 8(67.375)2

= 2.671 > 1.96

For small n, we replace the critical value of 1.96 by a new critical value thatdepends on the degree of freedom (df ), defined as df = n − 2. Critical values forselected df s are given below:

df = n − 2 5 6 10 20 120 >120

critical value 2.571 2.447 2.228 2.086 1.98 1.96

In our study, df = 8− 2 = 6, the critical value is 2.447. Since |z∗| > 2.447,therefore, we arrive at the same conclusion of rejecting H0 : b = 0.

We are rarely interested in a one-sided test of b.

STAT 151 Class 9 Slide 24

Page 25: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

95% Confidence and prediction intervals

Parameter MLE (OLS) 95% confidence interval

Slope b b b ± 1.96§SD(b) = b ± 1.96σ√

1∑ni=1 X

2i −n(X )2

Average value a + bX a + bX ± 1.96SD(a + bX )

of Y given X = a + bX ± 1.96σ

√1n + (X−X )2∑n

i=1 X2i −n(X )2

§§

(a + bX )

Individual value a + bX +

0︷︸︸︷e a + bX ± 1.96SD(a + bX + e)

of Y given X = a + bX ± 1.96σ

√1 + 1

n + (X−X )2∑ni=1 X

2i −n(X )2

(a + bX + e)

§ For small values of n, 1.96 can be replaced by an appropriate value in the t-table§§ a + bX = (Y − bX ) + bX = Y + b(X − X )‡ Also called a prediction interval

STAT 151 Class 9 Slide 25

Page 26: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Example

●●

0 20 40 60 80 100

020

4060

8010

0

Midterm

Fin

alPredictionConfidence

STAT 151 Class 9 Slide 26

Page 27: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Goodness-of-fit: R2

How well does the model “fit” the data? We answer this question using aGoodness-of-fit measure called the coefficient of determination R2 (“R-square”).R2 can be justified as follows.

Consider using n observations (X1,Y1), ..., (Xn,Yn) of (X ,Y ) to predict the nextobservation, Yn+1 of Y . Two possible estimates are:

(1) Y =1

n

∑n

i=1Yi and (2) Yi = a + bXi

How do they compare? Since Yn+1 is unknown, we cannot tell whether Y and Yi iscloser to Yn+1. However, we can compare their performances in predicting theobserved Yi , i = 1, ..., n. For Yi , the error incurred by these estimates are:

(Yi − Y ) and (Yi − Yi )

R2 is then defined as

Total error using Y − Total error using Yi

Total error using Y=

∑n

i=1(Yi − Y )2 −

∑n

i=1(Yi − Yi )

2∑n

i=1(Yi − Y )2

STAT 151 Class 9 Slide 27

Page 28: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

R2

R2 =

SST︷ ︸︸ ︷∑n

i=1(Yi − Y )2−

SSE︷ ︸︸ ︷∑n

i=1(Yi − Yi )

2∑n

i=1(Yi − Y )2

●●

0 20 40 60 80

020

4060

80

Midterm

Fin

al

●●

0 20 40 60 80

020

4060

80

Midterm

Fin

al

Errors

SSE

●●

0 20 40 60 800

2040

6080

Midterm

Fin

al

●●

0 20 40 60 800

2040

6080

Midterm

Fin

alErrors

SST

SSE is defined as the sum of the errors whereas SST is defined as the sum ofthe errors; SSE ≤ SST since SSE is total errors from the least squares line

STAT 151 Class 9 Slide 28

Page 29: Correlation and Regression - mysmu.edu · OutlineAssociations Scatter plot Correlation Regression Testing and estimation Goodness-of- t Correlation and Regression ... A simple linear

Outline Associations Scatter plot Correlation Regression Testing and estimation Goodness-of-fit

Example

For a simple linear regression model, a simple relationship existsbetween R2 and r :‡‡

R2 = corr(X ,Y )2 = r2 = 0.7122 = 0.507

in our example between Midterm and Final score, so the error isreduced by about half compared to without the model.

Multiplying R2 by 100% gives the percent variation explained

R2 × 100% = 50.7%,

which tells us that about 50.7% of the differences in Final scorebetween students can be accounted for by their Midterm score; whilethe remaining differences, i.e., 49.3% are due to other (unknown)factors.

‡‡ When there is more than one predictor, r cannot be calculated; in that case, R2

gives the “correlation” between the outcome and the predictorsSTAT 151 Class 9 Slide 29