12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and...

04/21/23 330 lecture 11 1

STATS 330: Lecture 11

04/21/23 330 lecture 11 2

Outliers and high-leverage points

An outlier is a point that has a larger or smaller y value that the model would suggest• Can be due to a genuine large error • Can be caused by typographical errors in

recording the data A high leverage point is a point with

extreme values of the explanatory variables

04/21/23 330 lecture 11 3

Outliers The effect of an outlier depends on whether it is

also a high leverage point A “high leverage” outlier

• Can attract the fitted plane, distorting the fit, sometimes extremely

• In extreme cases may not have a big residual• In extreme cases can increase R2

A “low leverage” outlier • Does not distort the fit to the same extent• Usually has a big residual• Inflates standard errors, decreases R2

04/21/23 330 lecture 11 4

2 3 4 5 6 7

1520

2530

35

x

y

2 3 4 5 6 7

1520

2530

35

x

y

0 1 2 3 4 5 6

05

1015

2025

30

x

y

0 1 2 3 4 5 6

05

1015

2025

30

x

y

No outliersNo high-leverage points

Low leverage

Outlier: big residual

High leverageNot an outlier

High-leverage outlier

04/21/23 330 lecture 11 5

Example: the education data

(ignoring urban)

3000 3500 4000 4500 5000 5500 6000

200

250

300

350

400

450

500

550

280

300

320

340

360

380

400

percap

un

de

r18ed

uc

High leverage

point

04/21/23 330 lecture 11 6

An outlier also?

300 320 340 360 380

35

00

40

00

45

00

50

00

55

00

under18

pe

rca

p

1

2

3

4

5

6

78

910

11

12

13

14

1516

1718

19

20

21

22

23

24

25

26

27

28

29

3031

32

33

3435

363738

39

40

41

42

43

44

45

46

47

4849

50

200 250 300 350 400 450

-50

05

01

00

predict(educ.lm)

resi

du

als

(ed

uc.

lm)

1

2

3

4

5

6

7

8

9

10

11 12

13

1415

16

17

18

19

20

21

2223

24

2526

2728

29

3031

3233

34

35

36

37

38

39

40

41

4243

44

45

46

47

48

49

50

Residual somewhat extreme

04/21/23 330 lecture 11 7

Measuring leverageIt can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y1,…,yn by the equation

niniiiiiyhyhyhy

XXXXH whereHyy

......ˆ

,')'(ˆ

11

1

The hij depend on the explanatory variables. The quantities hii are called “hat matrix diagonals” (HMD’s) and measure the influence yi has on the ith fitted value.They can also be interpreted as the distance between theX-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are

04/21/23 330 lecture 11 8

Interpreting the HMDs

Each HMD lies between 0 and 1 The average HMD is p/n

• (p=no of reg coefficients, p=k+1)

An HMD more than 3p/n is considered extreme

04/21/23 330 lecture 11 9

Example: the education data

educ.lm<-lm(educ~percapita+under18,

data=educ.df)

hatvalues(educ.lm)[50]

50

0.3428523

> 9/50

[1] 0.18

Clearly extreme!

n=50, p=3

04/21/23 330 lecture 11 10

Studentized residuals

How can we recognize a big residual? How big is big?

The actual size depends on the units in which the y-variable is measured, so we need to standardize them.

Can divide by their standard deviations Variance of a typical residual e is

var(e) = (1-h) 2

where h is the hat matrix diagonal for the point.

04/21/23 330 lecture 11 11

Studentized residuals (2)

“Internally studentised” (Called

“standardised” in R)

“Externally studentised”(Called

“studentised” in R)

2)1( sh

e

i

i

2)1(ii

i

sh

e

s2 is Usual Estimate of 2

s2i is estimate of

2 after deleting the ith data point

04/21/23 330 lecture 11 12

Studentized residuals (3)

How big is big? Both types of studentised residual are

approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t-distribution)

Thus, studentised residuals should be between -2 and 2 with approximately 95% probability.

04/21/23 330 lecture 11 13

Studentized residuals (4)Calculating in R:

library(MASS) # load the MASS librarystdres(educ.lm)#internally studentised (standardised in R) studres(educ.lm)#externally studentised (studentised in R)

> stdres(educ.lm)[50] 50 3.275808 > studres(educ.lm)[50] 50 3.700221

What does studentised mean?

04/21/23 330 lecture 11 14

04/21/23 330 lecture 11 15

Recognizing outliers If a point is a low influence outlier, the residual

will usually be large, so large residual and a low HMD indicates an outlier

If a point is a high leverage outlier, then a large error usually will cause a large residual.

However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not.

High-leverage outlier

04/21/23 330 lecture 11 16

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.5

1.0

1.5

2.0

2.5

3.0

Plot of y versus x, Example 5

x

y

Fitted LineTrue line

1.0 1.5 2.0 2.5 3.0

-0.5

0.0

0.5

1.0

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

21

1524

Small residual!

Small residual!

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

-2-1

01

23

Leverage

Sta

nd

ard

ize

d r

esi

du

als

lm(educ ~ percap + under18)

Cook's distance 1

0.5

0.5

1

Residuals vs Leverage

50

7

45

04/21/23 330 lecture 11 17

Leverage-residual plot

Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot)

Point 50 is high leverage, big residual,

is an outlier

plot(educ.lm, which=5)

Interpreting LR plots

04/21/23 330 lecture 11 18

leverage

0

2

-2

Standardised residual

High leverage outlier

High leverage, outlier

Possible high leverage outlier

OK

Low leverage outlier

Low leverage outlier

3p/n

0.0 0.2 0.4 0.6 0.8

1.0

1.5

2.0

2.5


x

y

1

9 23


0.00 0.05 0.10 0.15-2

-10

12

Leverage

Sta

nd

ard

ize

d r

esi

du

als

Cook's distance0.5

0.5


1

9

23

04/21/23 330 lecture 11 19

Residuals and HMD’s

No big studentized residuals, no big HMD’s

(3p/n = 0.2 for this example)

0.0 0.2 0.4 0.6 0.8

1.0

1.5

2.0

2.5


x

y

24

1

19

FittedTrue

0.00 0.05 0.10 0.15

-2-1

01

23

4Leverage

Sta

nd

ard

ize

d r

esi

du

als

Cook's distance0.5

0.5

1


24

119

04/21/23 330 lecture 11 20

Residuals and HMD’s (2)

One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit

Point 24

Point 24

0.0 0.5 1.0 1.5

1.5

2.0

2.5

3.0

3.5

4.0


xx

yy

1

9 23


0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

-2-1

01

2Leverage

Sta

nd

ard

ize

d r

esi

du

als

Cook's distance

1

0.5

0.5

1


1

9

23

04/21/23 330 lecture 11 21


No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential.

Point 1

Point 1

0.0 0.5 1.0 1.5

12

34

5


xx

yy

1

2

9


0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

-10

12

34

5Leverage

Sta

nd

ard

ize

d r

esi

du

als

Cook's distance1

0.5

0.5

1


1

29

04/21/23 330 lecture 11 22


One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential

Point 1

Point 1

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0


x

y

1

7

29


0.0 0.2 0.4 0.6-2

-10

12

Leverage

Sta

nd

ard

ize

d r

esi

du

als

Cook's distance

1

0.5

0.5

1


17

29

04/21/23 330 lecture 11 23


No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential

Point 1

Point 1

04/21/23 330 lecture 11 24

Influential points How can we tell if a high-leverage

point/outlier is affecting the regression? By deleting the point and refitting the

regression: a large change in coefficients means the point is affecting the regression

Such points are called influential points Don’t want analysis to be driven by one

or two points

04/21/23 330 lecture 11 25

“Leave-one out” measures

We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as • Coefficients• Fitted values• Standard errors

We discuss each in turn

04/21/23 330 lecture 11 26

Example: education data

With point 50 Without point 50

Const -557 -278

percap 0.071 0.059

under18 1.555 0.932

04/21/23 330 lecture 11 27

Standardized difference in

coefficients: DFBETASFormula:

)ˆ.(.

)(ˆˆ

j

jj

es

i

Problem when: Greater than 1 in absolute valueThis is the criterion coded into R

04/21/23 330 lecture 11 28

Standardized difference in fitted

values: DFFITSFormula:

)ˆ.(.

)(ˆˆ

i

ii

yes

iyy

Problem when: Greater than (p/(N-p) ) in absolute value (p=number of regression coefficients)

04/21/23 330 lecture 11 29

COV RATIO & Cooks D

• Cov Ratio: Measures change in the standard errors

of the estimated coefficients Problem indicated: when Cov Ratio more than 1+3p/n or less than 1-3p/n

• Cook’s D Measures overall change in the coefficients Problem when: More than qf(0.50, p,n-p) (lower 50% of F distribution), roughly 1 in most cases

04/21/23 330 lecture 11 30

Calculations> influence.measures(educ.lm)Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df)

dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf10 0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257 *

44 0.02289 -0.02948 0.00298 -0.0340 1.283 3.94e-04 0.1690 *

50 -2.36876 2.23393 1.50181 2.4733 0.821 1.66e+00 0.3429 *> p=3, n=50, 3p/n=0.18, 3(p/(n-p)) =0.758,

qf(0.5,3,47)= 0.8002294

04/21/23 330 lecture 11 31

Plotting influence

# set up plot window with 2 x 4 array of plots

par(mfrow=c(2,4))

# plot dffbetas, dffits, cov ratio,

# cooks D, HMD’s

influenceplots(educ.lm)

04/21/23 330 lecture 11 32

0 10 20 30 40 50

0.0

0.5

1.0

1.5

2.0

dfb.1_

Obs. number

dfb.

1_

50

0 10 20 30 40 50

0.0

0.5

1.0

1.5

dfb.prcp

Obs. number

dfb.

prcp

50

0 10 20 30 40 50

0.0

0.5

1.0

1.5

2.0

dfb.un18

Obs. number

dfb.

un18

50

0 10 20 30 40 50

0.0

0.5

1.0

1.5

2.0

2.5

DFFITS

Obs. number

DF

FIT

S

50

0 10 20 30 40 50

0.00

0.10

0.20

ABS(COV RATIO-1)

Obs. number

AB

S(C

OV

RA

TIO

-1)

10

44

0 10 20 30 40 50

0.0

0.5

1.0

1.5

Cook's D

Obs. number

Coo

k's

D

50

0 10 20 30 40 50

0.00

0.10

0.20

0.30

Hats

Obs. number

Hat

s

50

04/21/23 330 lecture 11 33

Remedies for outliers

Correct typographical errors in the data Delete a small number of points and refit

(don’t want fitted regression to be determined by one or two influential points)

Report existence of outliers separately: they are often of scientific interest

Don’t delete too many points (1 or 2 max)

04/21/23 330 lecture 11 34

Summary: Doing it in R

LR plot:

plot(educ.lm, which=5) Full diagnostic display

plot(educ.lm) Influence measures: influence.measures(educ.lm)

Plots of influence measures

par(mfrow=c(2,4))

influenceplots(educ.lm)

04/21/23 330 lecture 11 35

HMD Summary Hat matrix diagonals

• Measure the effect of a point on its fitted value• Measure how outlying the x-values are (how

“high –leverage” a point is)• Are always between 0 and 1 with bigger

values indicating high leverage• Points with HMD’s more than 3p/n are

considered “high leverage”

12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and...

Documents

Transcript of 12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and...