Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Copyright (c) Bani K. Mallick 1
STAT 651
Lecture #18
Copyright (c) Bani K. Mallick 2
Topics in Lecture #18 Regression and Scatterplots
Copyright (c) Bani K. Mallick 3
Book Chapters in Lecture #18 Chapters 11.1, 11.2
Copyright (c) Bani K. Mallick 4
Linear Regression and Correlation
Linear regression and correlation are aimed at understanding how two variables are related
The variables are called Y and X
Y is called the dependent variable
X is called the independent variable
We want to know how, and whether, X influences Y
Copyright (c) Bani K. Mallick 5
Linear Regression and Correlation
The basic tool of regression is the scatterplot
This simply plots the data in a graph
X is along the horizontal (or X) axis
Y is along the vertical (or Y) axis
Copyright (c) Bani K. Mallick 6
GPA and Height
Height in inches
807570656055
Gra
de
Po
int
Ave
rag
e (
GP
A)
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Note how GPA’s generally get lower as height increases: data do not fall exactly on a line
Copyright (c) Bani K. Mallick 7
Log(1+Aortic Valve Area) and Body Surface Area
Note how AVA’s generally get larger as Body Surface Area increases: data do not fall exactly on a line
Healthy Children, Stenosis Data
Body Surface Area
2.52.01.51.0.50.0
Lo
g(1
+ A
ort
ic V
alv
e A
rea
)
1.6
1.4
1.2
1.0
.8
.6
.4
.2
0.0
-.2
Copyright (c) Bani K. Mallick 8
Exam Grades in Stat651
70.00 80.00 90.00 100.00
Exam 1 Grade
50.00
60.00
70.00
80.00
90.00
100.00
Exa
m 2
Gra
de
Exam #1 and Exam #2, 2001
Copyright (c) Bani K. Mallick 9
Linear Regression and Correlation
Let Y = GPA, X = height
A linear prediction equation is a line, such as
The intercept of the line =
The slope of the line =
0 1ˆ ˆ ˆY = X
0̂
1̂
Copyright (c) Bani K. Mallick 10
Linear Regression and Correlation
= place where line crosses
Y axis when X=0
1 2 3 4 5
0 1ˆ ˆ ˆY = X
0̂
0
1̂
0̂
Copyright (c) Bani K. Mallick 11
Linear Regression and Correlation
= how much the line increases
when X is increased by 1 unit
1 2 3 4 5
0 1ˆ ˆ ˆY = X
0̂
0
1̂
1̂
Copyright (c) Bani K. Mallick 12
Linear Regression and Correlation
Every one of us will draw a slightly different line through the data
We need an algorithm to construct a line that in some sense “best fits the data”
The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible
Copyright (c) Bani K. Mallick 13
GPA and Height
Height in inches
807570656055
Gra
de
Po
int
Ave
rag
e (
GP
A)
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Try your hand at computing a line through these data:
Draw on the paper by eye, and compare it to my line drawn by eye on the next slide
Copyright (c) Bani K. Mallick 14
GPA and Height
Height in inches
807570656055
Gra
de
Po
int
Ave
rag
e (
GP
A)
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Try your hand at computing a line through these data:
Draw on the paper by eye, and compare it to my line drawn by eye on the next slide
Copyright (c) Bani K. Mallick 15
Linear Regression and Correlation
Every one of us will draw a slightly different line through the data
We need an algorithm to construct a line that in some sense “best fits the data”
The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible
Copyright (c) Bani K. Mallick 16
Linear Regression and Correlation
The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible
The data are
Any line is
Total squared distance is
The slope & intercept are chosen to minimize this total squared distance
iY , for i 1,...,n
0 1 iˆ ˆ X
2n
i 0 1 ii 1
ˆ ˆY X
Copyright (c) Bani K. Mallick 17
GPA and Height
Height in inches
807570656055
Gra
de
Po
int
Ave
rag
e (
GP
A)
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Distance of observation (in red) to the line (in blue)
Copyright (c) Bani K. Mallick 18
Linear Regression and Correlation
Total squared distance is
The slope & intercept are chosen to minimize this total squared distance
The slope is
2n
i 0 1 ii 1
ˆ ˆY X
n
i ii 1
1 n 2
ii 1
Y Y X Xˆ
X X
Copyright (c) Bani K. Mallick 19
Linear Regression and Correlation
Total squared distance is
The slope & intercept are chosen to minimize this total squared distance
The intercept is
This is algebra! The estimates are called the least squares estimates
SPSS calculates these automatically
2n
i 0 1 ii 1
ˆ ˆY X
0 1ˆ ˆY X
Copyright (c) Bani K. Mallick 20
Linear Regression and Correlation
Intercept & Slope are in “B” column” Constant = intercept
Coefficientsa
5.529 .897 6.165 .000
-3.72E-02 .013 -.273 -2.807 .006
(Constant)
Height in inches
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Grade Point Average (GPA)a.
Copyright (c) Bani K. Mallick 21
Linear Regression and Correlation
Intercept = 5.529 & Slope = -0.0372 are in “B” column
Coefficientsa
5.529 .897 6.165 .000
-3.72E-02 .013 -.273 -2.807 .006
(Constant)
Height in inches
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Grade Point Average (GPA)a.
Copyright (c) Bani K. Mallick 22
Linear Regression and Correlation
Intercept = 5.529 & Slope = -0.0372 are in “B” column
This means the least squares line is
GPA = 5.529 – 0.0372 * Height
Interpretation #1: The slope of the line is negative, indicating a possible negative relationship between height and GPA
Copyright (c) Bani K. Mallick 23
Linear Regression and Correlation
Intercept = 5.529 & Slope = -0.0372 are in “B” column
This means the least squares line is
GPA = 5.529 – 0.0372 * Height
Interpretation #2: The Least Squares line suggests that for every inch in height added, the GPA decreases by 0.0372
Copyright (c) Bani K. Mallick 24
Linear Regression and Correlation
Intercept = 5.529 & Slope = -0.0372 are in “B” column
This means the least squares line is
GPA = 5.529 – 0.0372 * Height
Interpretation #3: The Least Squares line suggests that is someone is 64 inches tall, his/her GPA might be predicted by
5.529 – 0.0372 * 64 = 3.148
Copyright (c) Bani K. Mallick 25
Linear Regression and Correlation
Intercept = 5.529 & Slope = -0.0372 are in “B” column
This means the least squares line is
GPA = 5.529 – 0.0372 * Height
Interpretation #4: There is something fishy here. Why should height predict GPA? Makes no sense to me.
Copyright (c) Bani K. Mallick 26
Linear Regression and Correlation
SPSS can draw a scatterplot and put the least squares line into it
Use “graphs”, “interactive”, “scatterplot”
Fix “title”
Under “fit”, click on “total”
Can remove the label for the line & Rsquare
SPSS Demo on the exam scores for 2001: need to change status of variable from categorical
Copyright (c) Bani K. Mallick 27
Linear Regression and Correlation
Depending on how you input the data, SPSS may insist that your numerical variable is actually categorical
When you do the interactive plot, right click then click on “scale” to convert it
Copyright (c) Bani K. Mallick 28
Linear Regression and Correlation
You can manipulatethe graph by double clicking on it and then moving things around
Note how the least squares line is part of the graph
Linear Regression
60.00 65.00 70.00 75.00
Height in inches
2.00
2.50
3.00
3.50
4.00
Gra
de
Po
int
Ave
rag
e (G
PA
)
Grade Point Average (GPA) = 5.53 + -0.04 * heightR-Square = 0.07
Height and GPA (Head Circimference Data)
Copyright (c) Bani K. Mallick 29
Linear Regression and Correlation
Linear Regression
70.00 80.00 90.00 100.00
Exam 1 Grade
50.00
60.00
70.00
80.00
90.00
100.00
Exa
m 2
Gra
de
Exam 2 Grade = 23.70 + 0.68 * e1R-Square = 0.47
Exam #1 and Exam #2, 2001
You can manipulatethe graph by double clicking on it and then moving things around
Note how the least squares line is part of the graph
Copyright (c) Bani K. Mallick 30
Linear Regression and Correlation
The population parameters and are simply the least squares estimates computed on all the members of the population, not just the sample
Population parameters:
Sample statistics:
0 1
0 1and
0 1ˆ ˆand
Copyright (c) Bani K. Mallick 31
Linear Regression and Correlation
Formally speaking, the linear regression model says that Y and X are related:
Here is the error (or deviation from the line)
Also, and are the population intercept and slope
0 1Y = X
0 1
Copyright (c) Bani K. Mallick 32
Linear Regression and Correlation
Formally speaking, the linear regression model says that Y and X are related:
The meaning of the line is:
Take the (sub)population all of whom have independent variable value X
The mean of this (sub)population is
0 1Y = X
0 1X
0 1X
Copyright (c) Bani K. Mallick 33
Linear Regression and Correlation
Formally speaking, the linear regression model says that Y and X are related:
In order to make inference about the population slope and intercept, we need to make a few assumptions
0 1Y = X
Copyright (c) Bani K. Mallick 34
Linear Regression and Correlation
Assumption #1: A straight line really fits the data (sort of by inspection)
There is no point fitting a straight line to this type of data:
0 1Y = X
**
*
**
*
*
*
Copyright (c) Bani K. Mallick 35
Linear Regression and Correlation
Assumption #2: The errors are at least vaguely normally distributed
This is important for inference, especially the normal ranges we will construct later
0 1Y = X
Copyright (c) Bani K. Mallick 36
Linear Regression and Correlation
Assumption #3: The errors have somewhat the same variances
This is important for inference, especially the normal ranges we will construct later
It is also important for making inferences about the population slope
0 1Y = X
1
Copyright (c) Bani K. Mallick 37
Linear Regression and Correlation
Assumption #1: A straight line really fits the data
Assumption #2: The errors are at least vaguely normally distributed
Assumption #3: The errors have somewhat the same variances
We will build graphical ways to check these assumptions
0 1Y = X