PE I: Multivariable Regression - Outliers (Chapter...

PE I: Multivariable RegressionOutliers

(Chapter 4.9)

Andrius Buteikis, [email protected]://web.vu.lt/mif/a.buteikis/

mailto:[email protected]

http://web.vu.lt/mif/a.buteikis/

Multiple Regression: Model Assumptions

Much like in the case of the univariate regression with one independent variable, the multipleregression model has a number of required assumptions:

(MR.1): Linear Model The Data Generating Process (DGP), or in other words, thepopulation, is described by a linear (in terms of the coefficients) model:

Y = Xβ + ε (MR.1)

(MR.2): Strict Exogeneity Conditional expectation of ε, given all observations of theexplanatory variable matrix X , is zero:

E (ε|X) = 0 (MR.2)

This assumption also implies that E(ε) = E (E(ε|X)) = 0, E(εX) = 0and Cov(ε,X) = 0. Furthermore, this property implies that: E (Y |X) = Xβ

(MR.3): Conditional Homoskedasticity The variance-covariance matrix of the errorterm, conditional on X is constant:

Var (ε|X) =

Var(ε1) Cov(ε1, ε2) ... Cov(ε1, εN)

Cov(ε2, ε1) Var(ε2) ... Cov(ε2, εN)...

.... . .

...Cov(εN , ε1) Cov(εN , ε2) ... Var(εN)

= σ2ε I (MR.3)

(MR.4): Conditionally Uncorrelated Errors The covariance between different errorterm pairs, conditional on X , is zero:

Cov (εi , εj |X) = 0, i 6= j (MR.4)

This assumption implies that all error pairs are uncorrelated. For cross-sectional data,this assumption implies that there is no spatial correlation between errors.

(MR.5) There exists no exact linear relationship between the explanatory variables.This means that:

c1Xi1 + c2Xi2 + ...+ ckXik = 0, ∀i = 1, ...,N ⇐⇒ c1 = c2 = ... = ck = 0 (MR.5)

This assumption is violated if there exists some cj 6= 0.Alternatively, this requirement means that:

rank (X) = k + 1

or, alternatively, that:det(

X>X)6= 0

This assumption is important, because a linear relationship between independent variablesmeans that we cannot separately estimate the effects of changes in each variableseparately.(MR.6) (optional) The residuals are normally distributed:

ε|X ∼ N(0, σ2

ε I)

(MR.6)

Outliers

An outlier is an observation which is significantly different from other values in a random samplefrom a population.

If we collect all of the various problems that can arise - we can rank them in terms ofseverity:

outliers > non − linearity > heteroscedasticity > non − normality

Outlier CausesOutliers can be cause by:I measurement errors;I being from a different process, compared to the rest of the data;I not having a representative sample (e.g. measuring a single observation from a different city,

when the remaining observations are all from one city);

Outlier ConsequencesOutliers can lead to misleading results in parameter estimation and hypothesis testing. Thismeans that a single outlier can make it seem like:I a non-linear model may be better suited to the data sample, as opposed to a linear model;I the residuals are heteroskedastic, when in fact only a residual has a larger variance, which is

different from the rest;I the distribution is skewed (i.e. non-normal), because of a single observation/residual, which

is significantly different form the test.set.seed(123)#N <- 100x <- rnorm(mean = 8, sd = 2, n = N)y <- 4 + 5 * x + rnorm(mean = 0, sd = 0.5, n = N)y[N] <- -max(y)

Outlier DetectionThe broad definition of outliers means that the decision whether an observation should beconsidered an outlier is left to the econometrician/statistician/data scientist.Nevertheless, there are a number of different methods, which can be used to identify abnormalobservations.Specifically, for regression models, outliers are also detected by comparing the true and fittedvalues. Assume that our true model is the linear regression:

Y = Xβ + ε (1)

Then, assume that we estimate β via OLS. Consequently, we can write the fitted values as:

Y = Xβ = X(X>X

)−1 X>Y = HY

where H = X(X>X

)−1 X> is called the hat matrix (or the projection matrix), which is theorthogonal projection that maps the vector of the response values, Y, to the vector offitted/predicted values, Y. It describes the influence that each response value has on each fittedvalue, which is why H is sometimes also referred to as the influence matrix.

To understand the projection matrix a bit better do not treat the fitted values as something thatis separate from the true values.I Instead assume that you have two sets of values: Y and Y.I Ideally, we would want Y = Y.I Assuming that the linear relationship, Y = Xβ + ε, holds, this will generally not be possible

because of the random shocks ε

However, the closest approximation would be the conditional expectation of Y, given a designmatrix X, since we know that the conditional expectation is the best predictor from the proof inCh. 3.7.

The Conditional Expectation is The Best Predictor (Ch. 3.7)We begin by outlining the main properties of the conditional moments, which will be useful(assume that X and Y are random variables):

I Law of total expectation: E [E (h(Y )|X )] = E [h(Y )];I Conditional variance: Var(Y |X ) := E

((Y − E [Y |X ])2|X

)= E(Y 2|X )− (E [Y |X ])2;

I Variance of conditional expectation:Var(E [Y |X ]) = E

[(E [Y |X ])2]− (E [E [Y |X ]])2 = E

[(E [Y |X ])2]− (E [Y ])2;

I Expectation of conditional variance: E [Var(Y |X )] = E[(Y − E [Y |X ])2] =

E[E[Y 2|X

]]− E

[(E [Y |X ])2] = E

[Y 2]− E

[(E [Y |X ])2];

I Adding the third and fourth properties together gives us:Var(Y ) = E

[Y 2]− (E [Y ])2 = Var(E [Y |X ]) + E [Var(Y |X )].

For simplicity, assume that we are interested in the prediction of Y via the conditionalexpectation:

Y = E (Y|X)

We will show that, in general, the conditional expectation is the best predictor of Y.

Going back to out projection matrix . . .

Using the OLS definition of β, the best predictor (i.e. the conditional expectation) maps thevalues of Y to the values of Y via the projection matrix H.

The projection matrix can be utilized when calculating leverage scores and Cook’s distance,which are used to identify influential observations.

Leverage Score of ObservationsLeverage measures how far away an observation of a predictor variable, X, is from the mean ofthe predictor variable.For the linear regression model, the leverage score for the i-th observation is defined as the i-thdiagonal element of the projection matrix H = X

(X>X

)−1 X>, which is equivalent to taking apartial derivative of Yi with respect to Yi :

hii = ∂Yi∂Yi

= (H)ii

Defining the leverage score via the partial derivative allows us to interpret the leverage score asthe observation self-influence, which describes how the actual value, Yi , influences the fittedvalue, Yi .

The leverage score hii is bounded:

0 ≤ hii ≤ 1

Proof.Noting that H is symmetric and the fact that it is an idempotent matrix:

H2 = HH = X(X>X

)−1 X>X(X>X

)−1 X> = XI(X>X

)−1 X> = H

we can examine the diagonal elements of the equality H2 = H to get the following bounds of Hii :

hii = h2ii +

∑i 6=j

h2ij ≥ 0

hii ≥ h2ii =⇒ hii ≤ 1

We can also relate the residuals to the leverage score:

ε = Y− Y = (I−H)Y

Examining the variance-covariance matrix of the regression errors we see that:

Var(ε) = Var((I−H)Y) = (I−H)Var(Y) (I−H)> = σ2 (I−H) (I−H)> = σ2 (I−H) ,

where we have used the fact that (I−H) is idempotent and Var(Y) = σ2I.

Since the diagonal elements of the variance-covariance matrix are the variances of eachobservation, we have that Var(εi) = (1− hii)σ2.

Thus, we can see that a leverage score of hii ≈ 0 would indicate that the i-th observation has noinfluence on the error variance, which would mean that its variance close to the true(unobserved) variance σ2.

Observations with leverage score values larger than 2(k + 1)/N are considered to bepotentially highly influential.

Assume that we estimate the model via OLS:mdl_1_fit <- lm(y ~ 1 + x)

Studentized ResidualsThe studentized residuals are related to the standardized residuals, as they are defined as:

ti = εi

σ√1− hii

The main distinction comes from the calculation of σ, which can be calculated in two ways:I Standardized residuals calculate the internally studentized residual variance estimate:

σ2 = 1N − (k + 1)

N∑j=1

ε2j

I If we suspect that the i-th residual of being improbably large (i.e. it cannot be from thesame normal distribution as the remaining of the residuals) - we exclude it from varianceestimation by calculating the externally studentized residual variance estimate:

σ2(i) = 1

N − (k + 1)− 1

N∑j=1j 6=i

ε2j

If the residuals are independent and ε ∼ N (0, σ2I), then the distribution of the studentizedresiduals depends on the calculation of the variance estimate:I If the residuals are internally studentized - they have a tau distribution:

ti ∼√v · tv−1√

t2v−1 + v − 1

, where v = N − (k + 1)

I If the residuals are externally studentized - they have a Student’s t-distribution (we willalso refer to them as ti(i)):

ti = ti(i) ∼ t(N−(k−1)−1)

Observations with studentized residual values larger than 3 in *absolute* value couldbe considered outliers.

We can plot the studentized and standardized residuals:olsrr::ols_plot_resid_stud(mdl_1_fit)

100

Threshold: abs(3)

−200

−150

−100

−50

0

0 25 50 75 100Observation

Del

eted

Stu

dent

ized

Res

idua

ls

Observation

normal

outlier

Studentized Residuals Plot

olsrr::ols_plot_resid_stand(mdl_1_fit)

100

Threshold: abs(2)

−10.0

−7.5

−5.0

−2.5

0.0

2.5


Sta

ndar

dize

d R

esid

uals

Standardized Residuals Chart

We can examine the same plots on the model, with the outlier observation removed from thedata:olsrr::ols_plot_resid_stud(lm(y[-N] ~ 1 + x[-N]))

Threshold: abs(3)

−4

−2

0

2


Del

eted

Stu

dent

ized

Res

idua

ls

Observation

normal

Studentized Residuals Plot

olsrr::ols_plot_resid_stand(lm(y[-N] ~ 1 + x[-N]))

39

49

64

7496

Threshold: abs(2)

−2

0

2


Sta

ndar

dize

d R

esid

uals

Standardized Residuals Chart

While the studentized residuals appear to have no outliers, the standardized residuals indicatethat a few observations may be influential. Since we have simulated the data, we know that ourdata contained only one outlier. Consequently, we should not treat all observations outside thethreshold as definite outliers.

We may also be interested in plotting the studentized residuals against the leverage points:olsrr::ols_plot_resid_lev(mdl_1_fit)

6 16 1826 4457 70 7297

100

Threshold: 0.04

−200

−150

−100

−50

0

0.02 0.04 0.06 0.08Leverage

RS

tude

nt

Observation

normal

leverage

outlier

Outlier and Leverage Diagnostics for y

olsrr::ols_plot_resid_lev(lm(y[-N] ~ 1 + x[-N]))

616

1826

35

39

44

49

57

64

70

72

7496

97

Threshold: 0.04

−5.0

−2.5

0.0

2.5

5.0

0.02 0.04 0.06 0.08Leverage

RS

tude

nt

Observation

normal

leverage

outlier

Outlier and Leverage Diagnostics for y[−N]

This plot combined the leverage score, which shows influential explanatory variableobservations, and the studentized residual plot, which shows outlier residuals of the differencebetween the actual and fitted dependent variables.

Influential observationsInfluential observations are defined as observations, which have a large effect on the results ofa regression.

DFBETASThe DFBETAi vector measures how much an observation i has effected the estimate of aregression coefficient vector . It measures the difference between the regression coefficients,calculated for all of the data, and the regression coefficients, calculated with the observation ideleted:

DFBETAi =β − β(i)√

σ2(i)diag ((X>X)−1)

Observations with a DFBETA value larger than 2/√N in absolute value should be

carefully inspected.The recommended general cutoff (absolute) value is 2.

We can calculate the appropriate DFBETAS for the last 5 observations as follows:dfbetas_manual <- NULLfor(i in (N-4):N){

mdl_2_fit <- lm(y[-i] ~ 1 + x[-i])numerator <- mdl_1_fit$coef - mdl_2_fit$coefdenominator<- sqrt((summary(mdl_2_fit)$sigma^2) * diag(solve(t(cbind(1, x)) %*% cbind(1, x))))dfbetas_manual <- rbind(dfbetas_manual, numerator / denominator)

}print(dfbetas_manual)

## (Intercept) x## [1,] 0.028743821 -0.022789554## [2,] 0.030744687 -0.034844559## [3,] 0.020403791 -0.024298429## [4,] 0.006702931 -0.004242548## [5,] -29.230784828 25.362876769

While these calculations are a bit more involved, we can use the built-in functions as well:print(tail(dfbetas(mdl_1_fit), 5))

## (Intercept) x## 96 0.028743821 -0.022789554## 97 0.030744687 -0.034844559## 98 0.020403791 -0.024298429## 99 0.006702931 -0.004242548## 100 -29.230784828 25.362876769

If we wanted, we could also plot these values:olsrr::ols_plot_dfbetas(mdl_1_fit)

100

Threshold: 0.2

−30

−20

−10

0


DF

BE

TAS

Influence Diagnostics for (Intercept)

100Threshold: 0.2

0

10

20


DF

BE

TAS

Influence Diagnostics for x

page 1 of 1

If we were to remove the last observation and examine the DFBETAS plot:olsrr::ols_plot_dfbetas(lm(y[-N] ~ 1 + x[-N]))

8

25

43

44

64

7496

Threshold: 0.2

−0.2

0.0

0.2

0.4


DF

BE

TAS

Influence Diagnostics for (Intercept)

8 43

44

64

74

97Threshold: 0.2

−0.25

0.00

0.25


DF

BE

TAS

Influence Diagnostics for x[−N]

page 1 of 1

We see that there are some observations, which may be worth examining. In this case, we knowthat there are no more outliers because we have simulated the data ourselves. So this is a goodexample that you should not blindly trust the above charts, as the influential observations arenot necessarily outliers.

DFFITSDFFITS measures how much an observation i has effected the fitted value of a regression. It isdefined as a Studentized difference between the fitted values from a regression, estimated on allof the data, and the fitted values from a regression, estimated on the data with observation ideleted:

DFFITSi =Yi − Yi(i)

σ2(i)√hii

= ti(i)√

hii1− hii

where ti(i) is the externally studentized residual.tmp_val <- dffits(mdl_1_fit)print(format(tail(cbind(tmp_val), 10), scientific = FALSE))

## tmp_val## 91 " -0.0005235787"## 92 " 0.0031760359"## 93 " 0.0091761236"## 94 " 0.0199891169"## 95 " -0.0226748095"## 96 " 0.0376495768"## 97 " -0.0379725909"## 98 " -0.0287153104"## 99 " 0.0125545090"## 100 "-32.6910440413"

Observations with a DFFITS value larger than 2√

(k + 1)/N in absolute value

olsrr::ols_plot_dffits(mdl_1_fit)

100

Threshold: 0.28

−30

−20

−10

0


DF

FIT

SInfluence Diagnostics for y

olsrr::ols_plot_dffits(lm(y[-N] ~ 1 + x[-N]))

843

44

49

64

74

Threshold: 0.28

−0.4

−0.2

0.0

0.2

0.4

0.6


DF

FIT

SInfluence Diagnostics for y[−N]

Similarly to what we have observed with DFBETAS - we should not blindly trust that each valueoutside the cutoff region is an outlier. Instead, we should treat them as influential observations,which need additional analysis to determine whether they are acceptable.

Cook’s distanceCook’s D measures the aggregate impact of each observation on the group of regressioncoefficients, as well as the group of fitted values. It can be used to:I indicate influential data points (i.e. potential outliers);I indicate regions, where more observations would be needed;

Cook’s distance for observation i is defined as:

Di =∑N

j=1(Yj − Yj(i))2

(k + 1)σ2 = ε2i(k + 1)σ2

[hii

(1− hii)2

]where:I Yj(i) is the fitted value of Yj , obtained by excluding the i-th observation and re-estimating

the same model via OLS.I σ2 = ε>ε

N − (k + 1) is the mean squared error of the error term.Note: in practical terms, it may be easier to use the leverage score expression of Di instead ofre-estimating the model for each observation case.

tmp_val <- cooks.distance(mdl_1_fit)print(format(tail(cbind(tmp_val), 10), scientific = FALSE))

## tmp_val## 91 "0.0000001384804"## 92 "0.0000050955563"## 93 "0.0000425310902"## 94 "0.0002017917033"## 95 "0.0002596785491"## 96 "0.0007154000322"## 97 "0.0007282312180"## 98 "0.0004164378945"## 99 "0.0000796089714"## 100 "1.2596831219103"

Cook’s distance values, which are:I larger than 4/N (the traditional cut-off);I larger than 3× 1

N∑N

i=1 Di

could be considered highly influential.

We can plot the Di points:olsrr::ols_plot_cooksd_bar(mdl_1_fit)

100Threshold: 0.04

0.0

0.5

1.0


Coo

k's

D Observation

normal

outlier

Cook's D Bar Plot

olsrr::ols_plot_cooksd_chart(mdl_1_fit)

100Threshold: 0.04

0.0

0.5

1.0


Coo

k's

DCook's D Chart

As well as plot the Di on the data without the outlier observation:olsrr::ols_plot_cooksd_bar(lm(y[-N] ~ 1 + x[-N]))

8

43

44

49

64

74

Threshold: 0.04

0.00

0.05

0.10

0.15


Coo

k's

D Observation

normal

outlier

Cook's D Bar Plot

olsrr::ols_plot_cooksd_chart(lm(y[-N] ~ 1 + x[-N]))

843

44

49

64

74

Threshold: 0.04

0.00

0.05

0.10

0.15


Coo

k's

DCook's D Chart

We again see a similar result, as with DFBETAS and DFFITS.

Also note that R has a lot of different plots for the default lm model output:par(mfrow = c(3, 2), mar = c(2, 2, 2, 2))for(i in 1:6){

plot(mdl_1_fit, which = i)}

20 30 40 50 60

−10

0−

60−

20

Fitted values

Residuals vs Fitted

100

72 64

−2 −1 0 1 2

−10

−6

−2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

100

7264

20 30 40 50 60

0.0

1.0

2.0

3.0

Fitted values

Scale−Location100

72 64

0 20 40 60 80 100

0.0

0.4

0.8

1.2

Obs. number

Coo

k's

dist

ance

Cook's distance100

7218

0.00 0.02 0.04 0.06 0.08

−10

−6

−2

Cook's distance

10.5

Residuals vs Leverage

100

7218

0.0

0.4

0.8

1.2

Coo

k's

dist

ance

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

02

4

6810

Cook's dist vs Leverage hii (1 − hii)100

7218

par(mfrow = c(3, 2), mar = c(2, 2, 2, 2))for(i in 1:6){

plot(lm(y[-N] ~ 1 + x[-N]), which = i)}

20 30 40 50 60

−1.

00.

01.

0

Fitted values

Residuals vs Fitted64

4974

−2 −1 0 1 2

−2

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q64

4974

20 30 40 50 60

0.0

0.5

1.0

1.5

Fitted values

Scale−Location64

4974

0 20 40 60 80 100

0.00

0.10

Obs. number

Coo

k's

dist

ance

Cook's distance64

448

0.00 0.02 0.04 0.06 0.08

−2

01

23

4

Cook's distance

0.5

Residuals vs Leverage64

448

0.00

0.05

0.10

0.15

Coo

k's

dist

ance

0 0.02 0.04 0.06 0.08

00.5

1

1.5

22.533.5

Cook's dist vs Leverage hii (1 − hii)64

448

Addressing OutliersAfter determining that a specific observation is indeed an outlier, we want to address them insome way.

Capping the OutliersIf we find that the explanatory variables X1,i , ...,Xk,i of an outlier variable Yi are similar to otherobservations, with non-outlier values of Yi , we may cap the value of the outlier, to match thevalues.

Replacing Outliers with Imputed ValuesIf we are certain that the outlier is due to some error in the data itself - we could try to imputethe observations by treating them as missing values and substituting them for some averagevalue of Y .The Expectation-maximization (EM) algorithm could be utilized for missing data imputation.

Deleting OutliersIn some cases, if we are absolutely sure that the observation is an outlier, which is eithercompletely unlikely, or impossible to encounter again, we could drop it.

Robust RegressionIn addition to the methods mentioned before, we could also run a Robust regression.

https://en.wikipedia.org/wiki/Imputation_(statistics)

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

https://en.wikipedia.org/wiki/Robust_regression

In our example, we know that the last observation was differently generated, and is thus anoutlier, which we can delete.We can compare how would our model look like with the whole dataset, and if we were to dropthe outlier observation:plot(x, y)lines(x, mdl_1_fit$fitted.values, col = "red")lines(x[-N], lm(y[-N] ~ 1 + x[-N])$fitted.values, col = "blue")points(x[N], y[N], pch = 19, col = "red")legend("topleft", lty = 1, col = c("red", "blue"), legend = c("with outlier", "deleted outlier"))

4 6 8 10 12

−60

−20

020

4060

x

y

with outlierdeleted outlier

PE I: Multivariable Regression - Outliers (Chapter...

Documents

Transcript of PE I: Multivariable Regression - Outliers (Chapter...