12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and...
-
Upload
edgar-flowers -
Category
Documents
-
view
217 -
download
0
Transcript of 12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and...
04/21/23 330 lecture 11 1
STATS 330: Lecture 11
04/21/23 330 lecture 11 2
Outliers and high-leverage points
An outlier is a point that has a larger or smaller y value that the model would suggest• Can be due to a genuine large error • Can be caused by typographical errors in
recording the data A high leverage point is a point with
extreme values of the explanatory variables
04/21/23 330 lecture 11 3
Outliers The effect of an outlier depends on whether it is
also a high leverage point A “high leverage” outlier
• Can attract the fitted plane, distorting the fit, sometimes extremely
• In extreme cases may not have a big residual• In extreme cases can increase R2
A “low leverage” outlier • Does not distort the fit to the same extent• Usually has a big residual• Inflates standard errors, decreases R2
04/21/23 330 lecture 11 4
2 3 4 5 6 7
1520
2530
35
x
y
2 3 4 5 6 7
1520
2530
35
x
y
0 1 2 3 4 5 6
05
1015
2025
30
x
y
0 1 2 3 4 5 6
05
1015
2025
30
x
y
No outliersNo high-leverage points
Low leverage
Outlier: big residual
High leverageNot an outlier
High-leverage outlier
04/21/23 330 lecture 11 5
Example: the education data
(ignoring urban)
3000 3500 4000 4500 5000 5500 6000
200
250
300
350
400
450
500
550
280
300
320
340
360
380
400
percap
un
de
r18ed
uc
High leverage
point
04/21/23 330 lecture 11 6
An outlier also?
300 320 340 360 380
35
00
40
00
45
00
50
00
55
00
under18
pe
rca
p
1
2
3
4
5
6
78
910
11
12
13
14
1516
1718
19
20
21
22
23
24
25
26
27
28
29
3031
32
33
3435
363738
39
40
41
42
43
44
45
46
47
4849
50
200 250 300 350 400 450
-50
05
01
00
predict(educ.lm)
resi
du
als
(ed
uc.
lm)
1
2
3
4
5
6
7
8
9
10
11 12
13
1415
16
17
18
19
20
21
2223
24
2526
2728
29
3031
3233
34
35
36
37
38
39
40
41
4243
44
45
46
47
48
49
50
Residual somewhat extreme
04/21/23 330 lecture 11 7
Measuring leverageIt can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y1,…,yn by the equation
niniiiiiyhyhyhy
XXXXH whereHyy
......ˆ
,')'(ˆ
11
1
The hij depend on the explanatory variables. The quantities hii are called “hat matrix diagonals” (HMD’s) and measure the influence yi has on the ith fitted value.They can also be interpreted as the distance between theX-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are
04/21/23 330 lecture 11 8
Interpreting the HMDs
Each HMD lies between 0 and 1 The average HMD is p/n
• (p=no of reg coefficients, p=k+1)
An HMD more than 3p/n is considered extreme
04/21/23 330 lecture 11 9
Example: the education data
educ.lm<-lm(educ~percapita+under18,
data=educ.df)
hatvalues(educ.lm)[50]
50
0.3428523
> 9/50
[1] 0.18
Clearly extreme!
n=50, p=3
04/21/23 330 lecture 11 10
Studentized residuals
How can we recognize a big residual? How big is big?
The actual size depends on the units in which the y-variable is measured, so we need to standardize them.
Can divide by their standard deviations Variance of a typical residual e is
var(e) = (1-h) 2
where h is the hat matrix diagonal for the point.
04/21/23 330 lecture 11 11
Studentized residuals (2)
“Internally studentised” (Called
“standardised” in R)
“Externally studentised”(Called
“studentised” in R)
2)1( sh
e
i
i
2)1(ii
i
sh
e
s2 is Usual Estimate of 2
s2i is estimate of
2 after deleting the ith data point
04/21/23 330 lecture 11 12
Studentized residuals (3)
How big is big? Both types of studentised residual are
approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t-distribution)
Thus, studentised residuals should be between -2 and 2 with approximately 95% probability.
04/21/23 330 lecture 11 13
Studentized residuals (4)Calculating in R:
library(MASS) # load the MASS librarystdres(educ.lm)#internally studentised (standardised in R) studres(educ.lm)#externally studentised (studentised in R)
> stdres(educ.lm)[50] 50 3.275808 > studres(educ.lm)[50] 50 3.700221
What does studentised mean?
04/21/23 330 lecture 11 14
04/21/23 330 lecture 11 15
Recognizing outliers If a point is a low influence outlier, the residual
will usually be large, so large residual and a low HMD indicates an outlier
If a point is a high leverage outlier, then a large error usually will cause a large residual.
However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not.
High-leverage outlier
04/21/23 330 lecture 11 16
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.5
1.0
1.5
2.0
2.5
3.0
Plot of y versus x, Example 5
x
y
Fitted LineTrue line
1.0 1.5 2.0 2.5 3.0
-0.5
0.0
0.5
1.0
Fitted values
Re
sid
ua
ls
Residuals vs Fitted
21
1524
Small residual!
Small residual!
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
-2-1
01
23
Leverage
Sta
nd
ard
ize
d r
esi
du
als
lm(educ ~ percap + under18)
Cook's distance 1
0.5
0.5
1
Residuals vs Leverage
50
7
45
04/21/23 330 lecture 11 17
Leverage-residual plot
Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot)
Point 50 is high leverage, big residual,
is an outlier
plot(educ.lm, which=5)
Interpreting LR plots
04/21/23 330 lecture 11 18
leverage
0
2
-2
Standardised residual
High leverage outlier
High leverage, outlier
Possible high leverage outlier
OK
Low leverage outlier
Low leverage outlier
3p/n
0.0 0.2 0.4 0.6 0.8
1.0
1.5
2.0
2.5
Plot of y versus x, Example 1
x
y
1
9 23
Fitted LineTrue line
0.00 0.05 0.10 0.15-2
-10
12
Leverage
Sta
nd
ard
ize
d r
esi
du
als
Cook's distance0.5
0.5
Residuals vs Leverage
1
9
23
04/21/23 330 lecture 11 19
Residuals and HMD’s
No big studentized residuals, no big HMD’s
(3p/n = 0.2 for this example)
0.0 0.2 0.4 0.6 0.8
1.0
1.5
2.0
2.5
Plot of y versus x, Example 2
x
y
24
1
19
FittedTrue
0.00 0.05 0.10 0.15
-2-1
01
23
4Leverage
Sta
nd
ard
ize
d r
esi
du
als
Cook's distance0.5
0.5
1
Residuals vs Leverage
24
119
04/21/23 330 lecture 11 20
Residuals and HMD’s (2)
One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit
Point 24
Point 24
0.0 0.5 1.0 1.5
1.5
2.0
2.5
3.0
3.5
4.0
Plot of y versus x, Example 3
xx
yy
1
9 23
Fitted LineTrue line
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
-2-1
01
2Leverage
Sta
nd
ard
ize
d r
esi
du
als
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
1
9
23
04/21/23 330 lecture 11 21
Residuals and HMD’s (3)
No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential.
Point 1
Point 1
0.0 0.5 1.0 1.5
12
34
5
Plot of y versus x, Example 4
xx
yy
1
2
9
Fitted LineTrue line
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
-10
12
34
5Leverage
Sta
nd
ard
ize
d r
esi
du
als
Cook's distance1
0.5
0.5
1
Residuals vs Leverage
1
29
04/21/23 330 lecture 11 22
Residuals and HMD’s (4)
One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential
Point 1
Point 1
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Plot of y versus x, Example 5
x
y
1
7
29
Fitted LineTrue line
0.0 0.2 0.4 0.6-2
-10
12
Leverage
Sta
nd
ard
ize
d r
esi
du
als
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
17
29
04/21/23 330 lecture 11 23
Residuals and HMD’s (5)
No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential
Point 1
Point 1
04/21/23 330 lecture 11 24
Influential points How can we tell if a high-leverage
point/outlier is affecting the regression? By deleting the point and refitting the
regression: a large change in coefficients means the point is affecting the regression
Such points are called influential points Don’t want analysis to be driven by one
or two points
04/21/23 330 lecture 11 25
“Leave-one out” measures
We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as • Coefficients• Fitted values• Standard errors
We discuss each in turn
04/21/23 330 lecture 11 26
Example: education data
With point 50 Without point 50
Const -557 -278
percap 0.071 0.059
under18 1.555 0.932
04/21/23 330 lecture 11 27
Standardized difference in
coefficients: DFBETASFormula:
)ˆ.(.
)(ˆˆ
j
jj
es
i
Problem when: Greater than 1 in absolute valueThis is the criterion coded into R
04/21/23 330 lecture 11 28
Standardized difference in fitted
values: DFFITSFormula:
)ˆ.(.
)(ˆˆ
i
ii
yes
iyy
Problem when: Greater than (p/(N-p) ) in absolute value (p=number of regression coefficients)
04/21/23 330 lecture 11 29
COV RATIO & Cooks D
• Cov Ratio: Measures change in the standard errors
of the estimated coefficients Problem indicated: when Cov Ratio more than 1+3p/n or less than 1-3p/n
• Cook’s D Measures overall change in the coefficients Problem when: More than qf(0.50, p,n-p) (lower 50% of F distribution), roughly 1 in most cases
04/21/23 330 lecture 11 30
Calculations> influence.measures(educ.lm)Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df)
dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf10 0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257 *
44 0.02289 -0.02948 0.00298 -0.0340 1.283 3.94e-04 0.1690 *
50 -2.36876 2.23393 1.50181 2.4733 0.821 1.66e+00 0.3429 *> p=3, n=50, 3p/n=0.18, 3(p/(n-p)) =0.758,
qf(0.5,3,47)= 0.8002294
04/21/23 330 lecture 11 31
Plotting influence
# set up plot window with 2 x 4 array of plots
par(mfrow=c(2,4))
# plot dffbetas, dffits, cov ratio,
# cooks D, HMD’s
influenceplots(educ.lm)
04/21/23 330 lecture 11 32
0 10 20 30 40 50
0.0
0.5
1.0
1.5
2.0
dfb.1_
Obs. number
dfb.
1_
50
0 10 20 30 40 50
0.0
0.5
1.0
1.5
dfb.prcp
Obs. number
dfb.
prcp
50
0 10 20 30 40 50
0.0
0.5
1.0
1.5
2.0
dfb.un18
Obs. number
dfb.
un18
50
0 10 20 30 40 50
0.0
0.5
1.0
1.5
2.0
2.5
DFFITS
Obs. number
DF
FIT
S
50
0 10 20 30 40 50
0.00
0.10
0.20
ABS(COV RATIO-1)
Obs. number
AB
S(C
OV
RA
TIO
-1)
10
44
0 10 20 30 40 50
0.0
0.5
1.0
1.5
Cook's D
Obs. number
Coo
k's
D
50
0 10 20 30 40 50
0.00
0.10
0.20
0.30
Hats
Obs. number
Hat
s
50
04/21/23 330 lecture 11 33
Remedies for outliers
Correct typographical errors in the data Delete a small number of points and refit
(don’t want fitted regression to be determined by one or two influential points)
Report existence of outliers separately: they are often of scientific interest
Don’t delete too many points (1 or 2 max)
04/21/23 330 lecture 11 34
Summary: Doing it in R
LR plot:
plot(educ.lm, which=5) Full diagnostic display
plot(educ.lm) Influence measures: influence.measures(educ.lm)
Plots of influence measures
par(mfrow=c(2,4))
influenceplots(educ.lm)
04/21/23 330 lecture 11 35
HMD Summary Hat matrix diagonals
• Measure the effect of a point on its fitted value• Measure how outlying the x-values are (how
“high –leverage” a point is)• Are always between 0 and 1 with bigger
values indicating high leverage• Points with HMD’s more than 3p/n are
considered “high leverage”