Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 1

STAT 651

Lecture #18


Topics in Lecture #18 Regression and Scatterplots


Book Chapters in Lecture #18 Chapters 11.1, 11.2


Linear Regression and Correlation

Linear regression and correlation are aimed at understanding how two variables are related

The variables are called Y and X

Y is called the dependent variable

X is called the independent variable

We want to know how, and whether, X influences Y



The basic tool of regression is the scatterplot

This simply plots the data in a graph

X is along the horizontal (or X) axis

Y is along the vertical (or Y) axis


GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Note how GPA’s generally get lower as height increases: data do not fall exactly on a line


Log(1+Aortic Valve Area) and Body Surface Area

Note how AVA’s generally get larger as Body Surface Area increases: data do not fall exactly on a line

Healthy Children, Stenosis Data

Body Surface Area

2.52.01.51.0.50.0

Lo

g(1

+ A

ort

ic V

alv

e A

rea

)

1.6

1.4

1.2

1.0

.8

.6

.4

.2

0.0

-.2


Exam Grades in Stat651

70.00 80.00 90.00 100.00

Exam 1 Grade

50.00

60.00

70.00

80.00

90.00

100.00

Exa

m 2

Gra

de

Exam #1 and Exam #2, 2001



Let Y = GPA, X = height

A linear prediction equation is a line, such as

The intercept of the line =

The slope of the line =

0 1ˆ ˆ ˆY = X

0̂

1̂



= place where line crosses

Y axis when X=0

1 2 3 4 5

0 1ˆ ˆ ˆY = X

0̂

0

1̂

0̂



= how much the line increases

when X is increased by 1 unit

1 2 3 4 5

0 1ˆ ˆ ˆY = X

0̂

0

1̂

1̂



Every one of us will draw a slightly different line through the data

We need an algorithm to construct a line that in some sense “best fits the data”

The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible


GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Try your hand at computing a line through these data:

Draw on the paper by eye, and compare it to my line drawn by eye on the next slide



Every one of us will draw a slightly different line through the data

We need an algorithm to construct a line that in some sense “best fits the data”





The data are

Any line is

Total squared distance is

The slope & intercept are chosen to minimize this total squared distance

iY , for i 1,...,n

0 1 iˆ ˆ X

2n

i 0 1 ii 1

ˆ ˆY X


GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Distance of observation (in red) to the line (in blue)





The slope is

2n

i 0 1 ii 1

ˆ ˆY X

n

i ii 1

1 n 2

ii 1

Y Y X Xˆ

X X





The intercept is

This is algebra! The estimates are called the least squares estimates

SPSS calculates these automatically

2n

i 0 1 ii 1

ˆ ˆY X

0 1ˆ ˆY X



Intercept & Slope are in “B” column” Constant = intercept

Coefficientsa

5.529 .897 6.165 .000

-3.72E-02 .013 -.273 -2.807 .006

(Constant)

Height in inches

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Grade Point Average (GPA)a.



Intercept = 5.529 & Slope = -0.0372 are in “B” column

Coefficientsa

5.529 .897 6.165 .000

-3.72E-02 .013 -.273 -2.807 .006

(Constant)

Height in inches

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Grade Point Average (GPA)a.




This means the least squares line is

GPA = 5.529 – 0.0372 * Height

Interpretation #1: The slope of the line is negative, indicating a possible negative relationship between height and GPA





GPA = 5.529 – 0.0372 * Height

Interpretation #2: The Least Squares line suggests that for every inch in height added, the GPA decreases by 0.0372





GPA = 5.529 – 0.0372 * Height

Interpretation #3: The Least Squares line suggests that is someone is 64 inches tall, his/her GPA might be predicted by

5.529 – 0.0372 * 64 = 3.148





GPA = 5.529 – 0.0372 * Height

Interpretation #4: There is something fishy here. Why should height predict GPA? Makes no sense to me.



SPSS can draw a scatterplot and put the least squares line into it

Use “graphs”, “interactive”, “scatterplot”

Fix “title”

Under “fit”, click on “total”

Can remove the label for the line & Rsquare

SPSS Demo on the exam scores for 2001: need to change status of variable from categorical



Depending on how you input the data, SPSS may insist that your numerical variable is actually categorical

When you do the interactive plot, right click then click on “scale” to convert it



You can manipulatethe graph by double clicking on it and then moving things around

Note how the least squares line is part of the graph

Linear Regression

60.00 65.00 70.00 75.00

Height in inches

2.00

2.50

3.00

3.50

4.00

Gra

de

Po

int

Ave

rag

e (G

PA

)

Grade Point Average (GPA) = 5.53 + -0.04 * heightR-Square = 0.07

Height and GPA (Head Circimference Data)



Linear Regression

70.00 80.00 90.00 100.00

Exam 1 Grade

50.00

60.00

70.00

80.00

90.00

100.00

Exa

m 2

Gra

de

Exam 2 Grade = 23.70 + 0.68 * e1R-Square = 0.47

Exam #1 and Exam #2, 2001

You can manipulatethe graph by double clicking on it and then moving things around

Note how the least squares line is part of the graph



The population parameters and are simply the least squares estimates computed on all the members of the population, not just the sample

Population parameters:

Sample statistics:

0 1

0 1and

0 1ˆ ˆand



Formally speaking, the linear regression model says that Y and X are related:

Here is the error (or deviation from the line)

Also, and are the population intercept and slope

0 1Y = X

0 1




The meaning of the line is:

Take the (sub)population all of whom have independent variable value X

The mean of this (sub)population is

0 1Y = X

0 1X

0 1X




In order to make inference about the population slope and intercept, we need to make a few assumptions

0 1Y = X



Assumption #1: A straight line really fits the data (sort of by inspection)

There is no point fitting a straight line to this type of data:

0 1Y = X

**

*

**

*

*

*



Assumption #2: The errors are at least vaguely normally distributed

This is important for inference, especially the normal ranges we will construct later

0 1Y = X



Assumption #3: The errors have somewhat the same variances

This is important for inference, especially the normal ranges we will construct later

It is also important for making inferences about the population slope

0 1Y = X

1



Assumption #1: A straight line really fits the data

Assumption #2: The errors are at least vaguely normally distributed

Assumption #3: The errors have somewhat the same variances

We will build graphical ways to check these assumptions

0 1Y = X

Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Documents

Transcript of Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.