Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

37
Copyright (c) Bani K. Mal lick 1 STAT 651 Lecture #18
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Page 1: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 1

STAT 651

Lecture #18

Page 2: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 2

Topics in Lecture #18 Regression and Scatterplots

Page 3: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 3

Book Chapters in Lecture #18 Chapters 11.1, 11.2

Page 4: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 4

Linear Regression and Correlation

Linear regression and correlation are aimed at understanding how two variables are related

The variables are called Y and X

Y is called the dependent variable

X is called the independent variable

We want to know how, and whether, X influences Y

Page 5: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 5

Linear Regression and Correlation

The basic tool of regression is the scatterplot

This simply plots the data in a graph

X is along the horizontal (or X) axis

Y is along the vertical (or Y) axis

Page 6: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 6

GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Note how GPA’s generally get lower as height increases: data do not fall exactly on a line

Page 7: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 7

Log(1+Aortic Valve Area) and Body Surface Area

Note how AVA’s generally get larger as Body Surface Area increases: data do not fall exactly on a line

Healthy Children, Stenosis Data

Body Surface Area

2.52.01.51.0.50.0

Lo

g(1

+ A

ort

ic V

alv

e A

rea

)

1.6

1.4

1.2

1.0

.8

.6

.4

.2

0.0

-.2

Page 8: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 8

Exam Grades in Stat651

70.00 80.00 90.00 100.00

Exam 1 Grade

50.00

60.00

70.00

80.00

90.00

100.00

Exa

m 2

Gra

de

Exam #1 and Exam #2, 2001

Page 9: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 9

Linear Regression and Correlation

Let Y = GPA, X = height

A linear prediction equation is a line, such as

The intercept of the line =

The slope of the line =

0 1ˆ ˆ ˆY = X

Page 10: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 10

Linear Regression and Correlation

= place where line crosses

Y axis when X=0

1 2 3 4 5

0 1ˆ ˆ ˆY = X

0

Page 11: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 11

Linear Regression and Correlation

= how much the line increases

when X is increased by 1 unit

1 2 3 4 5

0 1ˆ ˆ ˆY = X

0

Page 12: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 12

Linear Regression and Correlation

Every one of us will draw a slightly different line through the data

We need an algorithm to construct a line that in some sense “best fits the data”

The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible

Page 13: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 13

GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Try your hand at computing a line through these data:

Draw on the paper by eye, and compare it to my line drawn by eye on the next slide

Page 14: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 14

GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Try your hand at computing a line through these data:

Draw on the paper by eye, and compare it to my line drawn by eye on the next slide

Page 15: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 15

Linear Regression and Correlation

Every one of us will draw a slightly different line through the data

We need an algorithm to construct a line that in some sense “best fits the data”

The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible

Page 16: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 16

Linear Regression and Correlation

The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible

The data are

Any line is

Total squared distance is

The slope & intercept are chosen to minimize this total squared distance

iY , for i 1,...,n

0 1 iˆ ˆ X

2n

i 0 1 ii 1

ˆ ˆY X

Page 17: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 17

GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Distance of observation (in red) to the line (in blue)

Page 18: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 18

Linear Regression and Correlation

Total squared distance is

The slope & intercept are chosen to minimize this total squared distance

The slope is

2n

i 0 1 ii 1

ˆ ˆY X

n

i ii 1

1 n 2

ii 1

Y Y X Xˆ

X X

Page 19: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 19

Linear Regression and Correlation

Total squared distance is

The slope & intercept are chosen to minimize this total squared distance

The intercept is

This is algebra! The estimates are called the least squares estimates

SPSS calculates these automatically

2n

i 0 1 ii 1

ˆ ˆY X

0 1ˆ ˆY X

Page 20: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 20

Linear Regression and Correlation

Intercept & Slope are in “B” column” Constant = intercept

Coefficientsa

5.529 .897 6.165 .000

-3.72E-02 .013 -.273 -2.807 .006

(Constant)

Height in inches

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Grade Point Average (GPA)a.

Page 21: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 21

Linear Regression and Correlation

Intercept = 5.529 & Slope = -0.0372 are in “B” column

Coefficientsa

5.529 .897 6.165 .000

-3.72E-02 .013 -.273 -2.807 .006

(Constant)

Height in inches

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Grade Point Average (GPA)a.

Page 22: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 22

Linear Regression and Correlation

Intercept = 5.529 & Slope = -0.0372 are in “B” column

This means the least squares line is

GPA = 5.529 – 0.0372 * Height

Interpretation #1: The slope of the line is negative, indicating a possible negative relationship between height and GPA

Page 23: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 23

Linear Regression and Correlation

Intercept = 5.529 & Slope = -0.0372 are in “B” column

This means the least squares line is

GPA = 5.529 – 0.0372 * Height

Interpretation #2: The Least Squares line suggests that for every inch in height added, the GPA decreases by 0.0372

Page 24: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 24

Linear Regression and Correlation

Intercept = 5.529 & Slope = -0.0372 are in “B” column

This means the least squares line is

GPA = 5.529 – 0.0372 * Height

Interpretation #3: The Least Squares line suggests that is someone is 64 inches tall, his/her GPA might be predicted by

5.529 – 0.0372 * 64 = 3.148

Page 25: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 25

Linear Regression and Correlation

Intercept = 5.529 & Slope = -0.0372 are in “B” column

This means the least squares line is

GPA = 5.529 – 0.0372 * Height

Interpretation #4: There is something fishy here. Why should height predict GPA? Makes no sense to me.

Page 26: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 26

Linear Regression and Correlation

SPSS can draw a scatterplot and put the least squares line into it

Use “graphs”, “interactive”, “scatterplot”

Fix “title”

Under “fit”, click on “total”

Can remove the label for the line & Rsquare

SPSS Demo on the exam scores for 2001: need to change status of variable from categorical

Page 27: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 27

Linear Regression and Correlation

Depending on how you input the data, SPSS may insist that your numerical variable is actually categorical

When you do the interactive plot, right click then click on “scale” to convert it

Page 28: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 28

Linear Regression and Correlation

You can manipulatethe graph by double clicking on it and then moving things around

Note how the least squares line is part of the graph

Linear Regression

60.00 65.00 70.00 75.00

Height in inches

2.00

2.50

3.00

3.50

4.00

Gra

de

Po

int

Ave

rag

e (G

PA

)

Grade Point Average (GPA) = 5.53 + -0.04 * heightR-Square = 0.07

Height and GPA (Head Circimference Data)

Page 29: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 29

Linear Regression and Correlation

Linear Regression

70.00 80.00 90.00 100.00

Exam 1 Grade

50.00

60.00

70.00

80.00

90.00

100.00

Exa

m 2

Gra

de

Exam 2 Grade = 23.70 + 0.68 * e1R-Square = 0.47

Exam #1 and Exam #2, 2001

You can manipulatethe graph by double clicking on it and then moving things around

Note how the least squares line is part of the graph

Page 30: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 30

Linear Regression and Correlation

The population parameters and are simply the least squares estimates computed on all the members of the population, not just the sample

Population parameters:

Sample statistics:

0 1

0 1and

0 1ˆ ˆand

Page 31: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 31

Linear Regression and Correlation

Formally speaking, the linear regression model says that Y and X are related:

Here is the error (or deviation from the line)

Also, and are the population intercept and slope

0 1Y = X

0 1

Page 32: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 32

Linear Regression and Correlation

Formally speaking, the linear regression model says that Y and X are related:

The meaning of the line is:

Take the (sub)population all of whom have independent variable value X

The mean of this (sub)population is

0 1Y = X

0 1X

0 1X

Page 33: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 33

Linear Regression and Correlation

Formally speaking, the linear regression model says that Y and X are related:

In order to make inference about the population slope and intercept, we need to make a few assumptions

0 1Y = X

Page 34: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 34

Linear Regression and Correlation

Assumption #1: A straight line really fits the data (sort of by inspection)

There is no point fitting a straight line to this type of data:

0 1Y = X

**

*

**

*

*

*

Page 35: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 35

Linear Regression and Correlation

Assumption #2: The errors are at least vaguely normally distributed

This is important for inference, especially the normal ranges we will construct later

0 1Y = X

Page 36: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 36

Linear Regression and Correlation

Assumption #3: The errors have somewhat the same variances

This is important for inference, especially the normal ranges we will construct later

It is also important for making inferences about the population slope

0 1Y = X

1

Page 37: Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.

Copyright (c) Bani K. Mallick 37

Linear Regression and Correlation

Assumption #1: A straight line really fits the data

Assumption #2: The errors are at least vaguely normally distributed

Assumption #3: The errors have somewhat the same variances

We will build graphical ways to check these assumptions

0 1Y = X