Robust Methods in Regression Analysis: Comparison and...

Robust Methods in Regression Analysis: Comparison and Improvement

By

Mohammad Abd- Almonem H. Al-Amleh

Supervisor

Professor Faris M. Al-Athari

This Thesis was Submitted in Partial Fulfillment of the Requirements for the Master's

Degree of Science in Mathematics

Faculty of Graduate Studies

Zarqa University

December, 2015

iii

Dedication

To my dear mom, to my dear wife, and to my beloved daughters …

Dana and Ruba.

iv

Acknowledgments

I wish to thank all who have provided me support and assistance during my research. I am

deeply grateful to my supervisor Professor Faris M. Al-Athari. He has provided the

guidance and instruction that I value greatly.

I would like to thank the committee members for their cooperation.

v

List of Contents

Committee Decision …………………………………………………………….……… ii

Dedication …………………………….,………………………………………….……..iii

Acknowledgement ……………………………………………………………….………iv

List of Contents …………………………………………………………………….…….v

List of Tables ……………………………………………………………………………..vii

List of Figures…………………………………………………………………….……..…ix

List of Abbreviations…………………………………………………………………..…..x

Abstract……………………………………………………………….…………….….…xi

Introduction…………………………………………………………………………………1

Chapter one: Regression Analysis

1.1 Linear Regression Model ………………………………………………………....……6

1.2 Least Squares Method……………..……………………………………..……………..8

Chapter 2: Outliers and Influential Points

2.1 Introduction……………………………………………………………………………12

2.2 Residual Analysis ……………………………………………………………..………13

2.3 The Effect of Outliers to Least Squares Method………………..…………………….14

2.4 Types of Outliers………………………………………………….…………………..16

2.5 Simple Criteria for Model Comparison……………………………………………….18

2.6 Identifying Outlying Y-observation…………………………………………..……….18

vi

2.7 Identifying Outlying X-observation……………………………………………..…….21

2.8 Identifying Influential Cases……..…………………………………………..……….22

2.9 Example……………………………………………………………………..…………23

Chapter 3: Robust Regression Methods

3.1 Overview……………………………………………………………………….……..29

3.2 Properties of Robust Estimators…….…………………………………………...……31

3.3 Robust Regression Methods…………………………………………………….…….33

Chapter 4: Comparison among Robust Regression Methods

4.1 Real Life Data…………………..……………………………………………………..46

4.2 Simulation Study……………………………………………………………………...55

4.3 Discussion…………………………………………………………………..……...….65

Conclusion…………………………………………………………………………………66

References………………………………………………………………………..………..68

Abstract ( in Arabic) ………………………………………………………..…………….72

vii

List of Tables Table (2.1): Steel employment by Country in Europe…………………………………….24

Table (2.2): Fitted values, residuals, leverage, studentized residuals, R-student residuals,

and Cook's distances.............................................................................................................25

Table (2.3): Estimated parameters,√𝑀𝑆𝑅𝑒𝑠 , leverage, 𝑅2 for different scenarios in the

Steel Employment data………………………….…………………………...…….………28

Table (3.1): Different objective functions and p(u), and their properties: range of u,

influence function 𝜑(𝑢), and weight 𝑤(𝑢)………………………………….…………….37

Table (4.1): The estimated parameters in 7 robust methods and OLS for the Steel

Employment data set that contain outliers and clean data ………………………………...47

Table (4.2): Fitted values, residuals, and weights of OLS (10)…………………..…….…51

Table (4.3): Fitted values, residuals, and weights of M-H (10)……………………………51

Table (4.4): Fitted values, residuals, and weights of M-T (10)……………………………52

Table (4.5): Fitted values, and residuals of RM (10)………………………….………….52

Table (4.6): Fitted values, and residuals of LMS (10)………………………...………….53

Table (4.7): Fitted values, and residuals of LTS (10)……………………………..….…..53

Table (4.8): Fitted Values, and residuals of S(10)……………………………………..…54

Table (4.9): Fitted values, residuals, and weights of MM (10)………………………..….54

Table (4.10): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point

Estimates, with n = 20 Case1:ϵ~ N(0,1)……………………………………………...….57


Estimates, with n = 100 Case1:ϵ~ N(0,1)…………………………………………….…57


Estimates, with n = 20 Case 2: ϵ ~ N (0,1) with 10% identical outliers in y direction

(where we let the first 10% of y's equal to 40)…………………………………………….58


Estimates, with n = 100 Case 2:ϵ ~ N (0,1) with 10% identical outliers in y direction

(where we let the first 10% of y's equal to 40)…………………………………………….58


Estimates, with n = 20 Case3:ϵ~ N(0,1) with 25% identical outliers in y direction (where

we let the first 25 % of y's equal to 40)………………………………...………………….59

viii


Estimates, with n = 100 Case1:ϵ~ N(0,1) with 25% identical outliers in y direction

(where we let the first 25 % of y's equal to 40)…………………………………………....59


Estimates, with n = 20 Case4:ϵ~ N(0,1) with 10 % identical outliers in x direction (where

we let the first 10 % of x's equal to 30)..........................................………………………..60


Estimates, with n = 100 Case 4:ϵ~ N(0,1) with 10 % identical outliers in x direction

(where we let the first 10 % of x's equal to 30)……………………………………………60



(where we let the first 25 % of x's equal to 30)……………................................................61



(where we let the first 25 % of x's equal to 30)…………………………………………..61


Estimates ,with n = 20 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100) ………………………..62


Estimates, with n = 100 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100)………………………62


Estimates, with n = 20 Case1:ϵ~ Laplace (0,4)………………………………………...63


Estimates, with n = 100 Case1:ϵ~ Laplace (0,4)………………………………………..63


Estimates, with n = 20 Case1:ϵ~ 𝑡3……………………………………………………..64


Estimates, with n = 50 Case1: ϵ~ 𝑡3……………….………………………………….…64

ix

List of Figures

Figure (2.1): the effect of outlying points on ordinary least squares method……………..15

Figure (2.2): Scatter plot for different types of outlying observations…………………….17

Figure (2.3): plot of the steel employment data…………………………………….…..…24

Figure (2.4): the least squares line for the steel employment example……………………25

Figure (2.5): plot the residuals of least squares estimates in the steel employment……….26

Figure (2.6): plot of the hat values ℎ𝑖𝑖 of least squares estimates in the steel employment

example……………………………………………………………………………………27

Figure (2.7): plot of the studentized residuals of least squares estimates in the steel

employment example…………………………………………………………………...…27

Figure (2.8): plot of the R-student residuals of least squares estimates in the steel

employment example……………………………………………………………………...27

Figure (2.9): plot of the Cook's distances of least squares estimates in the steel

employment example……………………………………………………………………...28

Figure (3.1): plot of Huber's weight function, with tuning constant a=1.345…………..…38

Figure (3.2): plot of Tukey's weight function, with tuning constant b=4.685……………..38

Figure (3.3): plot of ordinary least squares weight function……………………………....38

Figure (4.1): fitted line for OLS(10), M-H(10), OLS(8), and M-H(8) in the steel

employment example……………………………………………………………….....…..48

Figure (4.2): fitted line for OLS(10), M-T(10), OLS(8), and M-T(8) in the steel

employment example………………………………………………………………......….48

Figure (4.3): fitted line for OLS(10), RM(10), OLS(8), and RM(8) in the steel employment

example……………………………………………………………………………………49

Figure (4.4): fitted line for OLS(10), LMS(10), OLS(8), and LMS(8) in the steel

employment example……………………………………………………………......…….49

Figure (4.5): fitted line for OLS(10), LTS(10), OLS(8), and LTS(8) in the steel

employment example……………………………………………………………...………49

Figure (4.6): fitted line for OLS(10), S(10), OLS(8), and S(8) in the steel employment

example……………………………………………………………………………………50

Figure (4.7): fitted line for OLS(10), MM(10), OLS(8), and MM(8) in the steel

employment example…………………………………………………………………...…50

x

List of Abbreviations

𝐸(. ) Expected value

Cov(. ) Variance-Covariance matrix

𝑦𝑖∧

Fitted values

𝑒𝑖 Residuals

𝜀𝑖 Errors

𝜎2 Variance

𝑀𝑆𝑅𝐸𝑆 Residual mean square

SSE Residual sum of squares

H Hat matrix

𝑅2 Coefficient of determination

OLS Ordinary Least Squares Method

LAD Least Absolute Deviation Method

M-H M-estimation with Huber weights

M-T M-estimation with Tukey weights

RM Repeated Median Method

LMS Least Median of Squares Method

LTS Least Trimmed sum of Squares

S S-estimators

MM MM-estimators

MSE Mean Square Error

RMSE Relative Mean Square Error

xi

Robust Methods in Regression Analysis: Comparison and Improvement

By

Mohammad Abd- Almonem H. Al-Amleh

Supervisor

Professor Faris M. Al-Athari

Abstract

Ordinary least squares method is the best method that can be used in regression analysis

under some assumptions. Violation of these assumptions as the presence of outliers may

lead to adverse effect on the estimates and results associated to this method. However,

robust methods are proposed to cope with these unusual observations. The purpose of this

thesis is to define the behavior of outliers in linear regression and their influence on the

ordinary least squares method. In addition, we investigated various robust regression

methods including: M-estimation, repeated medians, least median of squares, least trimmed

squares estimation, scale estimation (S-estimation), and MM-estimation.

Finally, we compared these robust estimates based on their robustness and efficiency

through a simulation study. A real data set application is also provided to compare the

robust estimates with traditional least squares estimator.

In this thesis, we found that robust regression methods are not sensitive to the presence of

outliers, and they can down weight or ignore these observations, in addition we found that

MM-estimators are the best robust estimators that can be used efficiently in different

contamination models.

1

Introduction

Regression Analysis is a statistical technique for investigating and modeling the

relationship between dependent and independent variables. Applications of regression are

numerous and occur in almost every field, including engineering, the physical and

chemical sciences, economics, management, biological sciences, and the social sciences. In

fact, regression analysis may be the most widely used statistical technique. The main

objectives of the regression analysis are:

1. Prediction of future observations.

2. Assessment of the effect, or relationship between independent variables on the

response or dependent variable.

3. A general description of the data structure.

4. Control purposes.

Frequently in regression analysis applications, the data set contains some observations

which are outliers or extremes, i.e., observations which are well separated from the rest of

the data. These outlying observations may involve large residuals and often have dramatic

effects on the fitted least squares regression function. It is, therefore, important to study the

outlying observations carefully and their effect on the adequacy of the model. An

observation can be an outlier due to the dependent variable or any one or more of the

independent variables having values outside the expected limits. A data point may be an

outlier or a potentially influential point because of errors in the conduct of the study

(machine malfunction; recording, coding, or data entry errors; failure to follow the

experimental protocol) or because the data point is from a different population. The

2

danger of outlying observations to the least squares estimation is that they can have a

strong adverse effect on the estimate and they may remain unnoticed. Therefore, statistical

techniques that are able to cope with or to detect outlying observations have been

developed. Robust regression is an important method for analyzing data that are

contaminated with outliers. It can be used to detect outliers and to provide resistant results

in the presence of outliers.

Purposes of the Thesis

The main purposes of this thesis are:

1. Studying outliers and influential points and their effect on the least squares method.

2. Reviewing various robust methods that are used in regression analysis, and studying

their properties as breakdown point, and efficiency.

3. Comparing the robust methods and their performance in different contamination

situations and heavy tailed distributions.

Methodology

In order to study the effect of outliers and influential observations on the least squares

method, we used real data containing outliers, and then the least squares estimates were

compared in presence of outliers and without outliers. The same technique was used to

compare the robust regression methods and their properties. In addition, a simulation study

was used to compare the robust methods and the least squares method in different

scenarios. Calculations and simulations were done by using the program R version 3.2.2.

In fact, we used references [6], [8], [9], and [29] as a guide for this program.

3

Literature Survey

There are various methods that can be used to construct a linear regression model and

estimate the regression coefficients. One of the most popular methods is the ordinary least

squares method. It was discovered by Carl Friedrich Gauss in Germany and Adrien Marie

Legendre in France around 1805. The purpose of this procedure is to optimize fit by

minimizing the sum of the squares of the errors. Since OLS estimator could be calculated

explicitly from the data, it became the only possible approach in regression for many

decades. After Gauss had introduced the normal (Gaussian) distribution as the error

distribution, ordinary least squares became optimal, in addition, very important

mathematical results are obtained. Even now, OLS procedure is still used because of its

tradition and ease of computation.

Edgeworth proposed the least absolute deviation method 1887. He said that since the errors

are squared in the least squares procedures, the outliers have a large effect on the estimates.

Hence, he suggested minimizing the sum of the absolute values of the errors (Birkes and

Dodge, 1993).

Recently, there has been an increasing interest in other methods by realizing that it is very

difficult for a real life data to satisfy the necessary assumptions. Another reason for that is

the advances in computer technology, which have decreased the computational difficulties

of other methods.

Many robust methods have been proposed to deal with outliers. Huber introduced a class of

estimators known as M-estimators that aim to minimize: ∑ 𝑝(𝑢𝑖)𝑛𝑖=1 , where 𝑝(𝑢𝑖) is some

symmetric function of the errors. Several 𝑝 functions have been proposed, Huber and

Hampel are two examples of these functions (Huber, 2009).

4

Least Median of Squares (LMS) estimate is introduced by Rouusseeuw (1984) , in which

he replaced the sum of squared deviations in ordinary least squares by the median of the

squared deviations, which is a robust estimator of location.

Rousseeuw (1984) developed the least trimmed sum of squares (LTS ) which minimizes h

ordered squares residuals, where h is a constant that must be determined, .The largest

squared residuals are excluded from the summation in this method.

S-estimation is a high breakdown value method introduced by Rousseeuw and Yohai

(1984) that minimizes the scale of the residuals. Generalized S-estimates (GS-estimates) ,

that were proposed by (Croux et al. 1994), maintain high breakdown point as S-estimates

and have slightly higher efficiency.

MM-estimates proposed by Yohai (1987) can simultaneously attain high breakdown point

and efficiencies. Gervini and Yohai (2002) proposed a new class of high breakdown point

and high efficiency robust estimate called robust and efficient weighted least squares

estimator (REWLSE).

Lee et al. (2011) proposed a new class of robust methods based on the regularization of

case specific parameters for each response. They further proved that the M estimator

(Huber's function) is a special case of their proposed estimator. Another estimators can be

found in (Marona et al. 2006).

Another robust measure that is related to regression analysis is the robust coefficient of

determination that was introduced by Renaud and Victoria-Feser (2010).

A comparison among robust methods and least squares method was studied by Alma

(2011), in his study four robust regression methods were compared, Least Trimmed of

Squares (LTS), M-Estimate, S-Estimate, and MM-Estimate. The study concluded that S-

5

estimate and MM-estimate perform the best overall against a comprehensive set of outlier

conditions.

Another comparison study was applied by (Mohebbi et al, 2007), in which two robust

methods were compared, Huber M-estimate, and least absolute deviations (LAD), in

addition to nonparametric method , the conclusion was that the least absolute deviation and

Huber M-estimate are more appropriate in heavy tailed distributions, while the

nonparametric and the (LAD) regression are better choices for skewed data.

Noor and Mohammad (2013), compared three robust methods and some nonparametric

methods in simple linear regression models, they found that LAD and M-estimators are the

best methods when they were compared to nonparametric methods and OLS.

6

Chapter 1: Regression Analysis

(1.1) Linear Regression Model

Regression analysis is used for explaining and modeling the relationships between a single

variable y, called the response, output, or dependent variable, and one or more predictors,

input, or explanatory variables, 𝑥1 ,𝑥2, … , 𝑥𝑘. When 𝑘 = 1 it is called simple regression,

but when 𝑘 > 1 it is called multiple regression or multivariate regression. Suppose that

there is a relationship between a response variable 𝑦, and 𝑘 explanatory variables:

𝑥1 ,𝑥2, … , 𝑥𝑘 , then the multiple linear regression model can be expressed as :

𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑘𝑥𝑘 + 𝜀 . (1.1)

The parameters 𝛽𝑗 , 𝑗 = 0,1,2, … , 𝑘 are called the regression coefficients. The model is

called linear because the parameters are linear.

The error 𝜀 is the difference between the observed values of 𝑦 and the function:

𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑘𝑥𝑘. It is convenient to think of 𝜀 as a statistical error, that is, it

is a random variable that accounts for the failure of the model to fit the data exactly.

(Montgomery, 2006)

In terms of the observed data, the linear regression model can be expressed as:

𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖1 + 𝛽2𝑥𝑖2 +⋯+ 𝛽𝑘𝑥𝑖𝑘 + 𝜀𝑖 , 𝑖 = 1,2, … , 𝑛. (1.2)

So, we can write each observation as:

𝑦1 = 𝛽0 + 𝛽1𝑥11 + 𝛽2𝑥12 +⋯+ 𝛽𝑘𝑥1𝑘 + 𝜀1

7

𝑦2 = 𝛽0 + 𝛽1𝑥21 + 𝛽2𝑥22 +⋯+ 𝛽𝑘𝑥2𝑘 + 𝜀2

⋮

𝑦𝑛 = 𝛽0 + 𝛽1𝑥𝑛1 + 𝛽2𝑥𝑛2 +⋯+ 𝛽𝑘𝑥𝑛𝑘 + 𝜀𝑛

These n equations can be written in matrix form as:

(

𝑦1⋮𝑦𝑛) = (

1 𝑥11… 𝑥1𝑘⋮ ⋮ ⋮1 𝑥𝑛1… 𝑥𝑛𝑘

)(𝛽0⋮𝛽𝑘

) + (

𝜀1⋮𝜀𝑛)

Or equivalently:

𝑌 = 𝑋𝛽 + 𝜀. (1.3)

Where 𝑋 is a 𝑛 × (𝑘 + 1) matrix, with elements to be known constants, 𝛽 is the vector of

the (𝑘 + 1) parameters, 𝜀 is the vector of the errors, and 𝑌 is the vector of the n

observations.

The linear regression model makes several assumptions about the data:

1. The expected value of the error term 𝐸(𝜀) = 0, and hence 𝐸(𝑌) = 𝑋𝛽

2. The variance-covariance matrix of the error term, 𝑐𝑜𝑣(𝜀) = 𝜎2𝐼 ,so c𝑜𝑣(𝑌) = 𝜎2𝐼 ,

which means that the variance of the error term is assumed to be constant for all 𝜀𝑖 , 𝑖 =

1,2, … , 𝑛 , and it indicates that the errors are assumed to be uncorrelated.

So, the response variable 𝑌 is a random variable with mean 𝑋𝛽 and variance 𝜎2𝐼 .

In most real world problems, the values of parameters 𝛽𝑗 , 𝑗 = 0,1,2, … , 𝑘 and the error

variance 𝜎2 will not be known, and they must be estimated from the sample data.

(Montgomery, 2006).

8

(1.2) Least Squares Method

As we showed in the previous section, the parameters 𝛽𝑗 , 𝑗 = 0,1, … , 𝑘 are unknown

quantities that characterize a regression model. Estimates of these parameters are

computable functions of data and are, therefore, statistics.

To keep this distinction clear, parameters are denoted by Greek letters like α, β, and σ, and

estimates of parameters are denoted by putting a "hat" over the corresponding Greek letter

(Weisberg, 2005).

In the least squares method of regression, the overall size of error is measured by the sum

of the squares of the errors: ∑ 𝜀𝑖2𝑛

𝑖=1 . The least squares estimates of 𝛽𝑗 , 𝑗 = 0,1, … , 𝑘

minimize the sum of errors ∑ 𝜀𝑖2𝑛

𝑖=1 .

In order to obtain the least squares estimates 𝛽0 ∧

, 𝛽1 ∧

, … , 𝛽𝑘 ∧

, we differentiate ∑ 𝜀𝑖2 𝑛

𝑖=1 =

∑ (𝑦𝑖 − 𝛽0 − 𝛽1𝑥𝑖1 − 𝛽2𝑥𝑖2 −⋯− 𝛽𝑘𝑥𝑖𝑘)2𝑛

𝑖=1 with respect to each 𝛽𝑗 and set the results

equal to zero to yield (𝑘 + 1) equations that can be solved simultaneously for the 𝛽𝑗 's.

Formulas for the least squares estimates can be expressed in matrix form as :

𝛽∧

= (𝑋′𝑋)−1𝑋′𝑌 (1.4)

Provided that the inverse matrix (𝑋′𝑋)−1 exists. The (𝑋′𝑋)−1 matrix will always exist if

the explanatory variables: 𝑥1 ,𝑥2, … , 𝑥𝑘 are linearly independent, that is, if no column of the

𝑋 matrix is a linear combination of the other columns (Montgomery, 2006).

Two important concepts that will be seen later are:

1. The fitted values 𝑦𝑖 ∧

: is defined for the observation i as:

9

𝑦𝑖∧

= 𝐸 ∧(𝑌|𝑋 = 𝑥𝑖) = 𝛽0

∧

+ 𝛽1∧

𝑥𝑖1 +⋯+ 𝛽𝑘∧

𝑥𝑖𝑘 . (1.5)

2. The 𝑖 𝑡ℎ residual 𝑒𝑖 : is defined as the difference between the observed value 𝑦𝑖 and

the corresponding fitted value 𝑦𝑖 ∧ . So 𝑒𝑖 can be expressed as:

𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖∧

. (1.6)

Properties of the Least Squares Estimator 𝛽 ∧

If the previous assumptions E(𝑌) = 𝑋𝛽 , and 𝑐𝑜𝑣(𝑌) = 𝜎2𝐼 are satisfied, then there are

good properties of the least squares estimator 𝛽 ∧

:

1. 𝛽∧

is unbiased estimator for𝛽, that is 𝐸 (𝛽∧

) = 𝛽 .

2. The covariance matrix of 𝛽∧

can be found by the formula : 𝜎2(𝑋′𝑋)−1

3. 𝛽∧

has the minimum variance among all unbiased and linear estimators. This property

is known as Gauss-Markov theorem. Which can be stated as follows: If E(𝑌) = 𝑋𝛽 , and

(𝑌) = 𝜎2𝐼 , the least squares estimators 𝛽0∧

, 𝛽1∧

, … , 𝛽𝑘∧

are best linear unbiased estimators

(BLUE).In this expression, best means minimum variance and linear indicates that the

estimators are linear functions of y (Rencher, 2008).

Normal Error Regression Model

No matter what may be the form of the distribution of the error terms (and hence of

the 𝑦𝑖), the least squares method provides unbiased point estimators of 𝛽𝑗 , that have

minimum variance among all unbiased linear estimators.

10

The normal error model is the same as regression model (1.1) with unspecified error

distribution, except that this model assumes that the errors εi are normally distributed. So,

the assumption of uncorrelatedness of the εi in regression model (1.1) becomes one of

independence in the normal error model. Hence, the outcome in any trial has no effect on

the error term for any other trial as to whether it is positive or negative, small or large.

The normal error regression model implies that the 𝑦𝑖 's are independent normal random

variables, with mean 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑘𝑥𝑘 and variance 𝜎2 .

The importance of this model comes from the fact that inferences as testing hypotheses and

constructing confidence intervals about the model parameters and predictions of Y can be

obtained according to the properties of normality of the errors and the observed values.

Estimation of 𝝈𝟐

The need of estimating 𝜎2 is that a variety of inferences concerning the regression function

and the prediction of Y require an estimate 𝜎2. However, the method of least squares does

not yield a function of the y and x values in the sample that we can minimize to obtain an

estimator of 𝜎2. However, there is an unbiased estimator for 𝜎2 based on the least squares

estimator 𝛽 ∧

, which depends on the residuals.

In fact, the estimate of 𝜎2 can be obtained from the residual sum of squares (𝑆𝑆𝐸):

𝑆𝑆𝐸 =∑𝑒𝑖2 =∑(𝑦𝑖 − 𝑦𝑖

∧)2

𝑛

𝑖=1

𝑛

𝑖=1

(1.7)

So, by the previous assumptions we can show that 𝜎2 can be estimated by the

corresponding average from the sample as follows:

𝑀𝑆𝑅𝑒𝑠 = �̂�2 =

1

𝑛−𝑝∑ (𝑦𝑖 − 𝑥𝑖

′�̂�)2𝑛𝑖=1 ( 1.8)

11

Where n is the sample size and p is the number of parameters (𝑝 = 𝑘 + 1) , and 𝑀𝑆𝑅𝑒𝑠 is

called the residual mean square, and its square root √𝑀𝑆𝑅𝑒𝑠 is called standard error of

regression. Note that 𝑀𝑆𝑅𝑒𝑠 can be written as:

𝑀𝑆𝑅𝑒𝑠 =𝑆𝑆𝐸

𝑛−𝑝 ( 1.9)

which is an unbiased estimator of 𝜎2.

Because 𝑀𝑆𝑅𝑒𝑠 = �̂�2 depends on the residual sum of squares, any violation of the

assumptions on the model errors or any misspecification of the model form may seriously

damage the usefulness of �̂�2 as an estimate of 𝜎2. Because �̂�2 is computed from the

regression model residuals, we say that it’s a model-dependent estimate of 𝜎2

(Montgomery, 2006).

12

Chapter 2: Outliers and Influential points

(2.1) Introduction:

As we stated that for classical regression model, the method for estimating regression

parameters is the least squares method. If the error term in the regression model is

normally distributed, the least squares estimates of the regression parameters are the same

as the maximum likelihood estimates. After the estimates of the linear regression model are

obtained, the next step is to check whether or not; this linear regression model reasonably

reflects the true relationship between response variable and independent variables. This

falls into the area of regression diagnosis. There are two aspects of the regression

diagnosis. One is to check if a chosen model is reasonable enough to reflect the true

relationship between response variable and independent variables. Another is to check if

there are any data points that deviate significantly from the assumed model. The first

question relates to model diagnosis and the second question is to check outliers and

influential observations. In this chapter we focus on the detection of outliers and influential

observations.

Identifying outliers and influential observations for a regression model is based on the

assumption that the regression model is correctly specified. That is, the selected regression

model adequately specifies the relationship between response variable and independent

variables. Any data points that fit well the assumed model are the right data points for the

assumed model. Sometimes, however, not all data points fit the model equally well. There

may be some data points that may deviate significantly from the assumed model. These

data points may be considered as outliers if we believe that the selected model is correct.

Geometrically, linear regression is a line (for simple linear regression) or a hyper plane (for

13

multiple regression). If a regression model is appropriately selected, most data points

should be fairly close to regression line or hyper plane. The data points which are far away

from regression line or hyper-plane may not be "ideal" data points for the selected model

and could potentially be identified as the outliers for the model. An outlier is the data point

that is statistically far away from the chosen model if we believe that the selected

regression model is correct ( Yan and Su, 2009).

An influential observation is one that has a relatively larger impact on the estimates of one

or more regression parameters. i.e., including or excluding it in the regression model fitting

will result in unusual change in one or more estimates of regression parameters. We will

discuss the statistical procedures for identifying outliers and influential observations.

(2.2) Residual Analysis

As we saw in the previous chapter, the residual of the linear regression model is defined as

the difference between observed response variable 𝑦𝑖 and the fitted value 𝑦𝑖∧

.

The regression error term 𝜀 is unobservable and the residual is observable. Residual is an

important measurement of how close the calculated response from the fitted regression

model to the observed response.

Regression residual can be expressed in a vector form as :

𝑒 = 𝑌 − 𝑌∧

= 𝑌 − 𝑋(𝑋′𝑋)−1𝑋′𝑌 = (𝐼 − 𝐻)𝑌 (2.1)

where H = X(X′X)−1X′ is called the HAT matrix. Note that (I − H) is symmetric and

idempotent, that’s to say, (𝐼 − 𝐻)′ = (𝐼 − 𝐻) and (𝐼 − 𝐻)2 = (𝐼 − 𝐻). Further, we can

express the variance-covariance matrix of the residuals by the hat matrix as:

𝜎2(𝑒) = 𝜎2(𝐼 − 𝐻) (2.2)

14

Therefore, the variance of residuals 𝑒𝑖, denoted by 𝜎2(𝑒𝑖), is:

𝜎2(𝑒𝑖) = 𝜎2(1 − ℎ𝑖𝑖) ( 2.3)

where ℎ𝑖𝑖 is the element on the main diagonal of the hat matrix, and the covariance

between residuals 𝑒𝑖 and 𝑒𝑗 (𝑖 ≠ 𝑗) is:

𝜎(𝑒𝑖𝑒𝑗) = 𝜎2(0 − ℎ𝑖𝑖) = −ℎ𝑖𝑗𝜎

2 , 𝑖 ≠ 𝑗 (2.4)

where ℎ𝑖𝑗 is the element in the 𝑖th row and 𝑗th column of the hat matrix

so, �̂�2(𝑒𝑖) = �̂�2(1 − ℎ𝑖𝑖) (2.5)

�̂�(𝑒𝑖, 𝑒𝑗) = −ℎ𝑖𝑗(𝑀𝑆𝑅𝑒𝑠) , 𝑖 ≠ 𝑗

(2.3) The effect of outliers to least squares method:

Outliers are observations that appear inconsistent with the rest of the data set. Outliers

occur very frequently in real data, and they often go unnoticed because much data is

processed by computers, without careful inspection or screening. Outliers may a result of

errors in recording or may be from another population, or may be unusual observation

from the assumed distribution. For example, if the errors 𝜀𝑖 are distributed as 𝑁(0, 𝜎2), a

value of 𝜀𝑖 that is greater than 3𝜎 would be occur with probability 0.0027

(Rencher,2008).

Outliers can create great difficulty. When we encounter one, our first suspicion is that the

observation resulted from a mistake or other extraneous effect, and hence should be

discarded. A major reason for discarding it is that under the least squares method, a fitted

model may be pulled disproportionately toward an outlying observation because the sum of

15

2 4 6 8

46

81

01

2

x

y

0 2 4 6 8

46

81

01

21

4

x

y

the squared deviations is minimized. This could cause a misleading fit if indeed the

outlying observation resulted from a mistake or other extraneous cause. On the other hand,

outliers may convey significant information, as when an outlier occurs because of an

interaction with another predictor variable omitted from the model (Kutner, et al., 2005).

As an illustration, figure (2.1) represents the least squares fitted lines for: (a) with one

outlier, (b) the outlier was removed from the data set. In fact, the least squares estimates

were affected highly by this outlier as shown in the figure.

Figure (1-a) represents the ordinary least squares fit with only one outliers, the estimated

parameters are: 𝛽0∧

= 6.5562 , and 𝛽1∧

= 0.4611, while the estimated parameters

without outliers ( in figure (1-b ) ) are: : 𝛽0∧

= 2.187 and 𝛽1∧

= 1.063 .

Figure (2.1) : The effect of outlying points on Ordinary Least Squares

method, (a) represents the least squares model with one outlier, and (b)

represents least squares regression without outlier.

outlier

(a) (b)

16

(2.4) Types of Outliers:

Outliers can be classified as follows :( Ryan, 1997) and ( Adnan and Mohammad, 2003)

1. Regression Outlier

A point that deviates from the linear relationship determined from the other n-1 points, or

at least from the majority of those points.

2. Residual Outlier

A point that has a large residual when it is used in the calculations, which will be discussed

later. It is important to distinguish between a regression outlier and a residual outlier. So, a

point can be a regression outlier without being a residual outlier (if the point is influential),

and a point can be a residual outlier without there being strong evidence that the point is

also a regression outlier.

3. X-space outlier

This is an observation that is remote in one or more 𝑥 coordinates. An X- space outlier

could also be a regression outlier and /or a residual outlier.

4. Y-space outlier

This is a point that is outlying only because its y-coordinate is extreme. The manner and

extent to which such an outlier will affect the parameter estimates will depend upon both

its x-coordinate and the general configuration of the other points.

Thus, the point might also be a regression and/or residual outlier.

5. X- and Y-outlier:

A point that is outlying in both coordinates may be a regression outlier, or a residual outlier

17

(or both), or it may have a very small effect on the regression equation. The determining

factor is the general configuration of the other points.

Figure (2.2) illustrates the different types of outliers. The ellipse defines the majority of the

data. Point A is outliers in Y-space because its y value is significantly different from the

rest of the data, while point B is X-space outlier, that is, its x value is unusual and this is

also referred to as leverage point. However, points C and D are both X- and Y-outliers.

Note that point D has virtually no impact on the regression line, so it is neither regression

outlier nor residual outlier. Point C is considered as regression outlier, and it might be

residual outlier or not, depending on other measures, that will be discussed in the next

sections.

Figure (2.2): Scatter plot for different types of outlying observations

18

(2.5) Simple Criteria for Model Comparison

The following are some basic criteria that are commonly used for regression model

diagnosis:

1. Coefficients of determination:

𝑅2 = 1 −SSE

SST ( 2.6 )

where, 𝑆𝑆𝐸 = ∑ 𝑒𝑖2 = ∑ (𝑦𝑖 − 𝑦𝑖

∧)2𝑛

𝑖=1𝑛𝑖=1 , and the Total Sum of Squares (SST) =

∑ (𝑦𝑖 − 𝑦�̅�)2𝑛

𝑖=1 .The preferred model would be the model with 𝑅2 value close to 1. If the

data fit well the regression model then it should be expected that 𝑦𝑖 is close enough to 𝑦𝑖∧

.

Hence, SSE should be fairly close to zero. Therefore, 𝑅2 should be close to 1.

2. Estimate of error variance 𝑀𝑆𝑅𝑒𝑠 = �̂�2. Among a set of possible regression models a

preferred regression model should be one that has a smaller value of �̂�2 since this

corresponds to the situation where the fitted values are closer to the response observations

as a whole.

(2.6) Identifying outlying Y-observation

If the difference between the fitted value 𝑦𝑖∧

and the response 𝑦𝑖 is large, we may suspect

that the 𝑖𝑡ℎ observation is a potential outlier. The purpose of outlier detection in regression

analysis is eliminating some observations that have relatively larger residuals so that the

model fitting may be improved. However, eliminating observations cannot be made solely

based on statistical procedure. Determination of outlier should be made jointly by

statisticians and subject scientists (Yan, 2009). In many situations, eliminating

observations is never made too easy. Some observations that do not fit the model well may

imply a flaw in the selected model and outlier detection might actually result in altering the

model.

19

The analysis of residuals carries the most useful information for model fitting.

Three measures of residuals, that are commonly used in the outlier detection, will be

discussed, these are standardized residual, studentized residual, and R- student residuals.

1. Standardized Residual: let √𝑀𝑆𝑅𝑒𝑠 be the standard error of a regression model.

The standardized residual is defined as:

𝑧𝑖 =𝑒𝑖

√𝑀𝑆𝑅𝑒𝑠, 𝑖 = 1,2, … , 𝑛 (2.7)

The standardized residual is simply the normalized residual or the z-scores of the residuals.

The standardized residuals have mean zero and approximately unit variance. Consequently,

a large absolute standardized residual ( |𝑧𝑖| > 3 ) potentially indicates an outlier

(Montgomery,2006)

2. Studentized Residual:

The studentized residual is defined as:

ri =ei

�̂�(𝑒𝑖)=

ei

√𝑀𝑆𝑅𝑒𝑠(1 − hii) , i = 1,2, … , n (2.8)

Since the residuals may have different estimates for the variances σ2(ei), this suggests

that taking the magnitude of each ei relative to its estimated standard deviation �̂�(𝑒𝑖) (see

eq. (2.5) ) to give recognition to differences in the sampling errors of the residuals. (Kutner

et al, 2005). The ratio of ei to �̂�(𝑒𝑖) is called the studentized residual or internally

Studentized-residuals.

While the residuals ei will have substantially different sampling variations if their standard

deviations differ markedly, the studentized residuals ri have constant variance when the

model is appropriate (Kutner et al, 2005).

20

3. R–Student Residuals

The studentized residual 𝑟𝑖 discussed above is often considered an outlier diagnostic.

𝑀𝑆𝑅𝑒𝑠 was used as an estimate of 𝜎2 in computing 𝑟𝑖. This is referred to as internal scaling

of the residual because 𝑀𝑆𝑅𝑒𝑠 is an internally generated estimate of 𝜎2 obtained from

fitting the model to all 𝑛 observations.

Another approach would be to use an estimate of 𝜎2 based on a data set with the 𝑖th

observation removed. Denote this estimate of 𝜎2 by 𝑆(𝑖)2 . We can show that

𝑆(𝑖)2 =

(𝑛 − 𝑝)𝑀𝑆𝑅𝑒𝑠 −𝑒𝑖2

(1 − ℎ𝑖𝑖)

𝑛 − 𝑝 − 1 (2.9 )

This estimate of 𝜎2 is used instead of 𝑀𝑆𝑅𝑒𝑠 to produce an externally studentized residual,

usually called R-student given by:

𝑡𝑖 =𝑒𝑖

√𝑆(𝑖)2 (1−ℎ𝑖𝑖)

, 𝑖 = 1,2, … , 𝑛 (2.10)

In many situations 𝑡𝑖 will differ little from the studentized residual 𝑟𝑖. However, if the 𝑖th

observation is influential, then 𝑆(𝑖)2 can differ significantly from 𝑀𝑆𝑅𝑒𝑠, and thus the R-

student statistic will be more sensitive to this point.

Test of outliers:

We identify as outlying Y observations those cases whose studentized deleted residuals are

large in absolute value. In addition, we can conduct a formal test by means of the

Bonferroni test procedure of whether the case with the largest absolute studentized deleted

residual is an outlier. Since we do not know in advance which case will have the largest

absolute value |𝑡𝑖|, we consider the family of tests to include n tests, one for each case. If

21

the regression model is appropriate, so that no case is outlying because of a change in the

model, then each studentized deleted residual will follow the t- distribution with (𝑛 − 𝑝 −

1) degrees of freedom. The appropriate Bonferroni critical value therefore is :

𝑡 (1 −𝛼

2𝑛, 𝑛 − 𝑝 − 1). Note that the test is two-sided since we are not concerned with

the direction of the residuals but only with their absolute values (Montgomery, 2006).

(2.7) Identifying Outlying X Observations

The hat matrix, as we saw, plays an important role in determining identifying outlying Y

observations. The hat matrix is also helpful in directly identifying outlying X observations.

In particular, the diagonal elements of the hat matrix are useful indicators in a

multivariable setting of whether or not a case is outlying with respect to its X values.

The diagonal elements ℎ𝑖𝑖 of the hat matrix have some useful properties. In particular their

values are always between 0 and 1 and their sum is p:

0 ≤ ℎ𝑖𝑖 ≤ 1 𝑎𝑛𝑑 ∑ℎ𝑖𝑖

𝑛

𝑖=1

= 𝑝 (2.11)

where p is the number of regression parameters in the regression function. In addition, it

can be shown that ℎ𝑖𝑖 is a measure of the distance between the X values for the ith case and

the means of the X values for all n cases. Thus, a large value ℎ𝑖𝑖 indicates that the ith case

is distant from the center of all X observations. The diagonal element ℎ𝑖𝑖 in this context is

called the leverage (in terms of the X values) of the ith case. (Kutner, et al., 2005)

Hence, the farther a point is from the center of the X space, the more leverage it has.

Because the sum of the leverage values is p, an observation i can be considered as an X

outlier if its leverage exceeds twice the mean leverage value, denoted by ℎ ̅̅ ̅, which

according to (2.11) is:

22

ℎ ̅̅ ̅ =∑ ℎ𝑖𝑖𝑛𝑖=1

𝑛=𝑝

𝑛 (2.12 )

Hence, leverage values greater than 2p/n are considered by this rule to indicate outlying

cases with regard to their X values.

(2.8) Identifying Influential Cases

After identifying cases that are outlying with respect to their Y values and /or their X

values, we want to determine the influence of these observations on the regression model.

We choose one measure of influence that is widely used in practice, which is Cook's

distance.

Cook's distance

A measure of the overall influence of an outlying observation has on the fitted values was

proposed by R. D. Cook (1977). Cook's distance measure, denoted by Di , is an aggregate

influence measure, showing the effect of the ith case on all n fitted values

𝐷𝑖 =∑ (𝑦𝑗

∧ − 𝑦𝑗−𝑖

∧)2𝑛

𝑗=1

𝑝 × 𝑀𝑆𝑅𝑒𝑠 (2.13)

Note that each of the n fitted values 𝑦𝑗∧

is compared with the corresponding fitted value

𝑦𝑗−𝑖∧

when the ith case is deleted in fitting the regression model. These differences are

then squared and summed, so that the aggregate influence of the ith case is measured

without regard to the signs of the effects.

An equivalent expression of Cook's distance is :

𝐷𝑖 =𝑒𝑖2

𝑝 × 𝑀𝑆𝑅𝑒𝑠[

ℎ𝑖𝑖(1 − ℎ𝑖𝑖)2

] (2.14)

23

In this expression, we note that Di depends on both the residual 𝑒𝑖and the leverage ℎ𝑖𝑖 for

the ith observation. A large value of 𝐷𝑖 indicates that the observed 𝑦𝑖 value has strong

influence on the fitted values (since the residual, the leverage, or both will be large).

Values of 𝐷𝑖 can be compared to the values of the F distribution with 𝑣1 = 𝑝 and 𝑣2 =

𝑛 − 𝑝 degrees of freedom. Usually, an observation with a value of 𝐷𝑖 that falls at or above

the 50th

percentile of the F distribution is considered to be an influential observation

(Mendenhall, 2012).

(2.9) Example

Table (2.1) shows steel employment by country in thousands of people for the years 1974

and 1992, with x = the steel employment in thousands in year 1974 and y = the steel

employment in thousands in year 1992. First we make a plot of the data; this is shown in

Figure (2.3). We see that a straight line might describe the overall pattern of the data

reasonably well, and so the simple linear regression model 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜀 seems like a

good model to try. We use least squares method to estimate the parameters 𝛽0 , 𝛽1 , the

result is 𝛽0∧

= −0.3139, 𝛽1∧

= 0.4004 , so the fitted regression line is : 𝑦∧= −0.3139 +

0.4004 𝑥. This is the line in figure (2.4 ).

The fitted responses 𝑦𝑖∧

, residuals 𝑒𝑖, values of diagonal elements in the HAT matrix ℎ𝑖𝑖,

values of the studentized residuals 𝑟𝑖, values of the R-student statistics, and Cook's

distances associated with each observation are calculated and listed in table ( 2.2 ), and

the plots of each measure are provided in figures (2.5) to (2.9) .

24

0 50 100 150 200

02

04

06

08

01

00

12

0

(x) 1974

(y)

19

92

country 1974 (x) 1992 (y)

Germany 232 132

Italy 96 50

France 158 43

United Kingdom 194 41

Spain 89 33

Belgium 64 25

Netherlands 25 16

Luxembourg 23 8

Portugal 4 3

Denmark 2 1

Table (2.1) Steel Employment by Country in Europe, by Thousands in 1974 and

1992. Source (Draper and Smith, 1998, page 573)

Figure (2.3) : plot of the steel employment data

25

0 50 100 150 200

02

04

06

08

01

00

12

0

(x) 1974

(y)

19

92

Table (2.2) fitted values, residuals, leverage, studentized residuals, R-student

residuals, and Cook's distances

Country (𝒚𝒊) (𝒚𝒊∧

) (𝒆𝒊) (𝒉𝒊𝒊) (𝒓𝒊) (𝒕𝒊) (𝑫𝒊)

*Germany 132 92.57 39.43 0.44 2.53 5.34 2.54

Italy 50 38.12 11.88 0.10 0.60 0.58 0.02

France 43 62.95 -19.95 0.18 - 1.06 - 1.07 0.12

*United

Kingdom

41 77.36 -36.36 0.28 - 2.07 2.83 0.85

Spain 33 35.32 -2.32 0.10 - 0.12 - 0.11 0.00

Belgium 25 25.31 -0.31 0.11 - 0.016 - 0.015 0.00

Netherlands 16 9.70 6.30 0.17 0.33 0.31 0.01

Luxembourg 8 8.90 -0.90 0.17 - 0.05 - 0.04 0.00

Portugal 3 1.29 1.71 0.22 0.09 0.087 0.00

Denmark 1 0.49 0.51 0.23 0.03 0.026 0.00

Figure (2.4) : the least squares line for the Steel Employment Example

26

2 4 6 8 10

-20

020

40

Index

resi

dual

s(m

odel

)

Figure(2.5) plot of the residuals of least squares estimates in the Steel

Employment example

It can be seen from table (2.2) that observations 1 and 4 (Germany and United Kingdom)

have the largest residuals,𝑒1 = 39.43, 𝑒4 = −36.36, and the corresponding values of the

R-student statistic are respectively: 𝑡1 = 5.34, and 𝑡4 = 2.83 , so when the test level α is

0.05, we conclude that the first observation (Germany) is considered as an outlier since

𝑡 (1 −0.05

2∗10, 10 − 2 − 1) = 4.029 , while the fourth observation ( United Kingdom) is not

an outlier in this test.

As we can note from table (2.2), the largest ℎ𝑖𝑖value is for the first observation

(Germany), with ℎ1 = 0.44, and if we compare this value with 2𝑝

𝑛=

2×2

10= 0.4 we can say

that observation 1 is considered an X outlier, where all other observations are not.

Finally, if we want to determine which observation is influential, we use the last column,

which contains the Cook's distances 𝐷𝑖, as shown in the table, the largest values are

𝐷1 = 2.54 and 𝐷4 = 0.85 , we refer to the corresponding F distribution F(p, n − p) =

F(2 , 8) , we find that 2.54 is the 86th

percentile , and 0.85 is the 54th

percentile of this

distribution . Hence these two observations have a major influence on the regression fit.

27

2 4 6 8 10

-20

24

Index

rstu

dent

(mod

el)

2 4 6 8 10

-2-1

01

2

Index

rsta

nd

ard

(mo

de

l)

2 4 6 8 10

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Index

hatv

alue

s(m

odel

)

Figure(2.7) plot of the Studentized residuals of least squares estimates in the

Steel Employment example.

Figure(2.8) plot of the R-student residuals of least squares estimates in the

Steel Employment example.

Figure (2.6) plot of the hat values ℎ𝑖𝑖 of least squares estimates in the Steel

Employment example

28

2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

2.5

Index

cook

s.di

stan

ce(m

odel

)

Figure (2.9) plot of the Cook's distances of least squares estimates in the Steel

Employment example

Now, we perform the least squares estimation for two cases, the first one is for all the 10

observations, and second for only 8 observations (observations 1 and 4 are excluded from

the data set). In each case we compute the standard error of regression √𝑀𝑆𝑅𝑒𝑠 and the

coefficient of determination 𝑅2 , the results are summarized in table (2.3)

The results indicate that the best properties were satisfied if we excluded observations 1

and 4 (the influential observations), it can be seen from table (2.3) the decrease in the

√𝑀𝑆𝑅𝑒𝑠 and the increase of 𝑅2 in the second case.

Method �̂�𝟎 �̂�𝟏 √𝑴𝑺𝑹𝒆𝒔 𝑹𝟐

OLS for all cases -0.31386 0.40038 20.81 0.7357

OLS without observations 1 and 4 4.61012 0.30828 8.271 0.8281

Table (2.3) estimated parameters, √𝑀𝑆𝑅𝑒𝑠, leverage, 𝑅2 for different scenarios in the

Steel Employment data

𝑅2 distances

29

Chapter Three: Robust Regression Methods

(3.1) overview

Outliers should be investigated carefully. Often they contain valuable information about

the process under investigation or the data gathering and recording process. Before

considering the possible elimination of these points from the data, one should try to

understand why they appeared and whether it is likely similar values will continue to

appear. Of course, outliers are often bad data points.

When the observations 𝑌 in the linear regression model 𝑌 = 𝑋𝛽 + 𝜀 are normally

distributed, the method of least squares is a good parameter estimation procedure in the

sense that it produces an estimator of the parameter vector 𝛽 that has good statistical

properties. However, there are many situations where we have evidence that the

distribution of the response variable is (considerably) non normal and/or there are outliers

that affect the regression model. A case of considerable practical interest is one in which

the observations follow a distribution that has longer or heavier tails than the normal.

These heavy-tailed distributions tend to generate outliers, and these outliers may have a

strong influence on the method of least squares in the sense that they "pull" the regression

equation too much in their direction.

To conclude, regression outliers (either in X or in Y spaces) pose a serious threat to

ordinary least squares analysis. Basically, there are two ways out of this problem. The first,

and probably most well-known, approach is to construct regression diagnostics ,which was

discussed in chapter 2, diagnostics are certain quantities computed from the data with the

purpose of pinpointing influential points, after which these outliers can be removed or

30

corrected, followed by the least squares analysis on the remaining cases. When there is

only a single outlier, some of these methods work quite well by looking at the effect of

deleting one point at a time. Unfortunately, it is much more difficult to diagnose outliers

when there are several of them, and diagnostics for such multiple outliers are quite

involved and often give rise to extensive computations.

The other approach is robust regression, which tries to devise estimators that are not so

strongly affected by outliers. Therefore, diagnostics and robust regression really have the

same goals, only in the opposite order: When using diagnostic tools, one first tries to delete

the outliers and then to fit the "good" data by least squares, whereas a robust analysis first

wants to fit a regression to the majority of the data and then to discover the outliers as

those points which possess large residuals from that robust solution. The following step is

to think about the structure that has been uncovered. For instance, one may go back to the

original data set and use subject-matter knowledge to study the outliers and explain their

origin (Rousseeuw and Leroy, 1987).

In this chapter, we study the most popular robust regression methods that are used in linear

models, in addition some properties of these methods will be discussed.

(3.2) Properties of Robust Estimators:

In this section we introduce two important properties of robust estimators: breakdown and

efficiency.

(3.2.1) Breakdown Point

In chapter two, we saw that even a single regression outlier can totally offset the least

squares estimator (page 15). On the other hand, we will see that there exist estimators that

31

can deal with data containing a certain percentage of outliers. In order to formalize this

aspect, the breakdown point was introduced.

Take any sample of n data points, 𝑍 = {(𝑥11, … , 𝑥1𝑝, 𝑦1), … , (𝑥𝑛1, … , 𝑥𝑛𝑝, 𝑦𝑛)}, and let T

be an estimator. This means that applying T to such a sample Z yields a vector of

regression coefficients : T(Z) = 𝛽 ∧

. Now consider all possible corrupted samples 𝑍′ that are

obtained by replacing any m of the original data points by arbitrary values (this allows for

very bad outliers). Let us denote by bias (m; T, Z) the maximum bias that can be caused by

such a contamination:

bias(m; T, Z) = 𝑠𝑢𝑝𝑍′‖𝑇(𝑍′) − 𝑇(𝑍)‖ (3.1)

where the supremum is over all possible 𝑍′. If bias (m; T,Z ) is infinite, this means that m

outliers can have an arbitrarily large effect on T, which may be expressed by saying that

the estimator "breaks down" . Therefore, the (finite-sample) breakdown point of the

estimator T at the sample Z is defined as:

𝑏𝑛∗ = 𝑚𝑖𝑛 {

𝑚

𝑛 , bias(m; T, Z) 𝑖𝑠 𝑖𝑛𝑓𝑖𝑛𝑖𝑡𝑒} (3.2)

In other words, it is the smallest fraction of contamination that can cause the estimator T to

take on values arbitrarily far from T(Z) (Rousseeuw and Leroy ,1987).

For least squares, we have seen that one outlier is sufficient to carry T over all bounds.

Therefore, its breakdown point equals :

𝑏𝑛∗(𝑇, 𝑍) =

1

𝑛 (3.3)

32

which tends to be zero for increasing sample size n, so it can be said that least squares

method has a breakdown point of 0%. This again reflects the extreme sensitivity of the

least squares method to outliers.

(3.2.2) Efficiency

Suppose that the data set has no gross errors, there are no influential observations, and the

observations come from a normal distribution. If we use a robust estimator on such a data

set, we would want the results to be virtually identical to OLS, since OLS is the

appropriate technique for such data. The efficiency of a robust estimator can be thought of

as the residual mean square obtained from the OLS divided by the residual mean square of

the robust procedure ; we want this efficiency measure to be close to unity.

There is a lot of emphasis in the robust regression literature on asymptotic efficiency, that

is, the efficiency of an estimator as the sample size n becomes infinite. this is a useful

concept in comparing robust estimators, but many practical regression problems involve

small to moderate sample sizes (n < 50, for instance), and small sample efficiencies are

known to differ dramatically from their asymptotic values. Consequently, a model-builder

should be interested in the asymptotic behavior of any estimator that might be used in a

given situation but should not be unduly excited about it. What is more important from a

practical viewpoint is the finite-sample efficiency, or how well a particular estimator works

with reference to OLS on "clean" data for sample sizes consistent with those of interest in

the problem at hand (Montgomery,2006). The finite-sample efficiency of a robust

estimator is defined as the ratio of the OLS residual mean square to the robust estimator

residual mean square :

𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 =𝑀𝑆𝑅𝑒𝑠(𝑂𝐿𝑆)

𝑀𝑆𝑅𝑒𝑠(𝑅𝑜𝑏𝑢𝑠𝑡 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟) (3.4)

33

where OLS is applied only to the clean data.

(3.3) Robust Regression Methods:

(3.3.1)M-estimation

M- estimators are " maximum likelihood type" estimators. Suppose the errors are

independently distributed and all follow the same distribution f(ε).Then the maximum

likelihood estimator (MLE) of β is given by 𝛽 ∧

, which maximizes the quantity:

∏f(yi − xi′ β

n

i=1

) (3.5)

Where xi′ is the ith row of X, i = 1,2, … . , n , in the model y = Xβ + ε.

Equivalently, the MLE of β maximizes:

∑ ln f(yi − xi′ β n

i=1 ) (3.6)

So, if the errors are normally distributed, this leads to minimize the sum of squares

function:

∑ (yi − xi′ β )2 (3.7) n

i=1

which is the least squares estimation, and if the errors follow the double exponential

distribution (Laplace distribution) , then we minimize :

∑|yi − xi′ β| (3.8)

which is the least absolute deviation estimation (LAD) or the L1-norm estimation.

34

This idea can be extended as follows, suppose p(ε) is a defined function of ε, so the M-

estimation method minimizes a function p of the ε :

min ∑ p(εi) = min ni=1 ∑ pn

i=1 (yi − xi′ β ) (3.9)

In general, the M-estimation is not scale invariant (i.e., if the errors yi−xi′β were multiplied

by a constant, the new solution to (3.9) might not be the same as the old one). To obtain a

scale invariant version of this estimator, we solve the equation:

min ∑ p(ε𝑖

s) = n

i=1 min ∑ p(yi−𝑥𝑖

′β

s) n

i=1 (3.10)

where s is a robust estimate of scale . A popular choice for s is the median absolute

deviation MAD, which is highly resistant to outlying observations:

𝑠 = 𝑀𝐴𝐷 =median|ei−median(ei)|

h , 𝑖 = 1,2, … 𝑛 (3.11)

the constant ℎ is suggested to be 0.6745 which makes 𝑠 an approximately unbiased

estimator of 𝜎 if 𝑛 is large and the error distribution is normal (Draper and Smith, 1998).

We see that if 𝑝(ε) = ε2 , the criterion minimized is the same as equation (3.7) , while if

𝑝(ε) = |ε| , we get equation (3.8).So, in these specific cases , the form of 𝑝 and the

underlying distribution are specifically related. In fact, the knowledge of an appropriate

distribution for the errors tells us what 𝑝(ϵ) to use.

To minimize equation (3.10) , we equate the first partial derivatives of 𝑝 with respect to

𝛽𝑗(𝑗 = 0, … , 𝑘)to zero , yielding a necessary condition for a minimum .

This gives the system of 𝑝 = 𝑘 + 1 equations:

35

∑ 𝑥𝑖𝑗𝜓(𝑦𝑖−�́�𝑖𝛽

𝑠

𝑛𝑖=1 ) = 0 , 𝑗 = 0,1, … 𝑘,…. (3.12)

where 𝜓 is the derivative 𝑑𝑝

𝑑ϵ , and 𝑥𝑖𝑗 is the 𝑖th observation on the 𝑗th predictor and

𝑥𝑖0 = 1 .

These equations do not have an explicit solution in general, and iterative methods or

nonlinear optimization techniques will be used to solve them. In fact, iterative reweighted

least squares (IRLS) might be used.

To use (IRLS), first we define the weights as:

𝑤𝑖𝛽 =

{

𝜓(

𝑦𝑖 − 𝑥𝑖′𝛽

𝑠 )

(𝑦𝑖 − 𝑥𝑖

′𝛽𝑠 )

, 𝑦𝑖 ≠ 𝑥𝑖′𝛽

1 , 𝑦𝑖 = 𝑥𝑖′𝛽

(3.13)

Then (3.12) becomes:

∑𝑥𝑖𝑗𝑤𝑖𝛽(𝑦𝑖 − 𝑥𝑖′𝛽) = 0 , 𝑗 = 0,1, … . , 𝑘 (3.14)

𝑛

𝑖=1

This can be written in matrix notation as:

𝑋′𝑊𝛽𝑋𝛽 = 𝑋′𝑊𝛽𝑌 (3.15)

where 𝑊𝛽 is an 𝑛 × 𝑛 diagonal matrix of "weights" with diagonal elements:

𝑤1𝛽 , 𝑤2𝛽 , …𝑤𝑛𝛽 , given by equation (3.13). Equations (3.14) and (3.15) can be considered

as the usual least squares normal equations.

To solve equation (3.15), we follow the following steps:

36

1. The least squares method is used to fit an initial model to the data, yielding the initial

estimates of the regression coefficients 𝛽0∧

.

2. The initial residuals 𝑒𝑖0 are found using 𝛽0

∧

, and they can be used to calculate the initial

scale 𝑠0 . (eq. (3.11) ).

3. A weight function 𝑤(𝑢) is chosen and applied to 𝑒𝑖0

𝑠0 and then (3.13) is used to obtain the

initial weights 𝑊0.

4. Using 𝑊0 we can obtain 𝛽1∧

from (3.15)

5. Using 𝛽1∧

, new residuals, 𝑒𝑖1 can be found, and then by calculating 𝑠1 and

application of the weight function we obtain 𝑊1.

6. Then 𝑊1is used to get 𝛽2∧

, and so on .

Usually some iterations are required to achieve convergence.

We can write the iterative solution as :

�̂�𝑞+1 = (𝑋′𝑊𝑞𝑋)−1𝑋′𝑊𝑞𝑦 , 𝑞 = 0,1,2… (3.16)

And the procedure may be stopped when all the estimates change by less than some

selected present amount , say 0.1% or 0.01% , or after a selected number of iterations .

A number of popular criterion or objective functions 𝑝(𝑢) are proposed, where u

represents the scaled residual, two functions will be discussed and studied:

1. Huber’s function.

2. Tukey’s (bi square) weight function.

The objective function 𝑝(𝑢) , its derivative 𝜓(𝑢), and the weight function 𝑤(𝑢)of each

criterion in addition to the least squares method are all summarized in table (3.1).

37

Table ( 3.1) different objective functions and p(u), and their properties: range of

𝑢 , 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝜓(𝑢), 𝑎𝑛𝑑 𝑤𝑒𝑖𝑔ℎ𝑡 𝑤(𝑢) = 𝜓(𝑢)

𝑢

Robust regression procedures can be classified by the behavior of their 𝜓 function which is

called the influence function. The 𝜓 function controls the weight given to each residual. In

least squares method, the 𝜓 function is unbounded, and thus least squares is not robust if

the data follow a heavy-tailed distribution .The Huber’s function has a monotone 𝜓

function, and does not weight large residual as heavily as least squares. While the Tukey’s

influence function is hard redescenders, that is, the 𝜓 function equals zero for sufficiently

large|𝑢|.

Figures (3.1 ), (3.2), (3.3), show the graph of the weight functions of the Huber's criteria,

Tukey's criteria, and the least squares criteria respectively. It can be seen that least squares

method gives weight of 1 to all residuals either large or small, while the other criteria

down-weight large residuals in different ways. In addition, the symmetry around u = 0 can

Criterion(function) 𝒑(𝒖) 𝝍(𝒖) 𝒘(𝒖) =𝝍(𝒖)

𝒖

Huber’s

{

1

2𝑢2, 𝑖𝑓 |𝑢| ≤ 𝑎

𝑎|𝑢| −1

2𝑎2, 𝑖𝑓 |𝑢| > 𝑎

𝑢

𝑎𝑠𝑖𝑔𝑛(𝑢)

1

𝑎

|𝑢|

Tukey’s

(bisquare) {

𝑏2

6(1 − (1 − (

𝑢

𝑏)2)3), 𝑖𝑓 |𝑢| ≤ 𝑏

1

6𝑏2, 𝑖𝑓 |𝑢| > 𝑏

𝑢(1 − (𝑢

𝑏)2

)2

0

(1 − (𝑢

𝑏)2)2

0

Least squares 1

2𝑢2, −∞ < 𝑢 < ∞

𝑢

1

38

be noted in each weight function. Also, we can see that the constants a= 1.345 in Huber

weight function, and b = 4.685 in Tukey weight function , which are called tuning

constants, are chosen to make the IRLS robust procedure 95 percent efficient for data

Figure (3.1) : plot of Huber weight function, with tuning constant a = 1.345

Figure (3.2) : plot of Tukey's weight function, with tuning constant b=4.685

Figure (3.3) : plot of ordinary least squares weight function.

Huber weight

39

generated by the normal error regression model( Kutner, et al , 2005). More about M

estimation can be found in ( Hogg and Craig, 2006)

(3.3.2)Repeated Medians Estimators

Repeated Median method (RM) was proposed by Siegel (1982). In order to understand

the repeated median algorithm we will start with the bivariate case (p=2). Given data

(𝑥1, 𝑦1)… . , (𝑥𝑛, 𝑦𝑛) with distinct 𝑥1, we wish to predict 𝑦 from 𝑥 by selecting a robust

straight line of the form 𝑦 = 𝛽0 + 𝛽1𝑋.

To estimate 𝛽1, note that each pair of points determines a line, we will denote the slopes of

these lines by : 𝛽1∧

(𝑖, 𝑗) = (𝑦𝑗−𝑦𝑖)

(𝑥𝑗−𝑥𝑖) .

All possible pairs of points give us 𝑛 (𝑛 − 1) 2⁄ such estimates of 𝛽1, which are to be

combined in a resistant way. The RM estimate is the result of two stages of medians in this

case:

𝛽1∧

= 𝑚𝑒𝑑𝑖 𝑚𝑒𝑑𝑗≠𝑖 𝛽1∧

(𝑖, 𝑗) (3.17)

At the first (inner) stage we take the median of the slopes of the 𝑛 − 1 lines passing

through a given point (𝑖) and one other point. At the second (outer) stage we take the

median of these 𝑛 medians, this is why this method was called "repeated median".

There are now two ways to estimate 𝛽0∧

: if 𝛽1∧

is used, then to each point we can associate

the intercept 𝛽0 𝑖∧

= 𝑦𝑖 − 𝛽1∧

𝑥𝑖. Only a single median is now needed, and the RM estimate is

40

𝛽0∧

= 𝑚𝑒𝑑𝑖 β0 i∧

(3.18)

Alternatively, if we do not use 𝛽1∧

, then we need a pair of points for each estimate of 𝛽0

using : 𝛽0∧

(𝑖, 𝑗) = (𝑥𝑗𝑦𝑖−𝑥𝑖𝑦𝑗)

(𝑥𝑗−𝑥𝑖). A double repeated median is then taken as in (3.17), and the

repeated median estimate for 𝛽0 not using 𝛽1∧

is :

𝛽0∧

= 𝑚𝑒𝑑𝑖 𝑚𝑒𝑑𝑗≠𝑖 𝛽0∧

(𝑖, 𝑗) (3.19)

For the general case, for any 𝑝 observations with indices {𝑖1, … . . , 𝑖𝑝}, we compute the

coefficients 𝛽0(𝑖1, … . . , 𝑖𝑝),… . , 𝛽𝑘(𝑖1, … . . , 𝑖𝑝), ( p = k+1) such that the corresponding

surface fits these 𝑝 points exactly. The 𝑗th coefficient of the repeated median estimator is

then defined as

𝛽𝑗∧

= 𝑚𝑒𝑑𝑖1(𝑚𝑒𝑑𝑖2(… . (𝑚𝑒𝑑𝑖𝑝𝛽𝑗(𝑖1, … . , 𝑖𝑝))…)) , 𝑗 = 0,1, … , 𝑘 (3.20 )

where the outer median is over all choices of 𝑖1,the next is over all choices of 𝑖2 ≠ 𝑖1, and

so on. This estimator can be computed explicitly but requires consideration of all subsets

of 𝑝 points, which may cost a lot of time. It has been successfully applied to problems with

small p (Rousseeuw and Leroy, 1987).

The repeated median estimator is the first robust regression method with a 50% breakdown

point, and it is robust for both X and Y outliers.

(3.3.3)Least median of squares:

The least median of squares (LMS) estimator is obtained by minimizing the 𝑚th-ordered

squared error:

41

𝑚𝑖𝑛𝛽 𝑚𝑒𝑑𝑖𝑎𝑛𝑖( 𝜀𝑖2 ) (3.21)

where m must be determined, possible choices are: 𝑚 = [𝑛

2] + [

𝑝+1

2], and 𝑚 = [

𝑛+1

2] , with

𝑛 and 𝑝 denoting the sample size and number of parameters ,respectively. And the symbol

[.] denotes the integer portion of the argument. This estimator was introduced by

Rousseeuw (1984). However, as Least squares method minimizes the sum of the squared

errors, now the mean is typically not a good estimator of location when there are outliers,

and the median is often preferable .Therefore, LMS minimizes the median of the squared

errors.

To view the LMS estimator geometrically, assume that we have a single explanatory

variable (𝑝 = 2). The LMS fitted model is the equation of the line that is in the center of

the narrowest strip that will cover the majority of the data, with distance measured

vertically.

In general, the advantage of LMS is that it has a high breakdown point, theoretically 50%.

That is, up to half of the data could be anomalous without rendering the regression model

useless (Rousseeuw and Leroy 1987).

Basically, LMS fits just half of the data, so it is possible for outliers that are aligned with

good data values to pull the regression model away from the desired equation for the good

data. As a result, LMS can perform poorly relative to OLS when OLS is the appropriate

criterion. The asymptotic efficiency of LMS is actually zero (Ryan, 1997), because it

requires that the squared residual be minimized at a particular point, in effect ignoring the

fit at the other 𝑛 − 1 obsevations. As 𝑛 become large, we would expect the fit at these

42

𝑛 − 1 points to become poor relative to the least-squares fit. The finite-efficiency of LMS

can also be very low.

In an effort to improve the efficiency of LMS, Rousseeuw and Leroy (1987) suggested

using LMS estimates as starting values for computing a one-step M-estimator.

(3.3.4) Least Trimmed Sum of Squares Estimators

Another method developed by Rousseeuw (1984) is least trimmed sum of squares (LTS)

estimators. Extending from the trimmed mean, LTS estimator minimizes the trimmed sum

of the squared residuals. The LTS estimator is found by:

min∑𝑒(𝑖)2

ℎ

𝑖=1

(3.22)

where 𝑒(1)2 , 𝑒(2)

2 , … , 𝑒(𝑛)2 are the ordered squares residuals, from smallest to largest,

the residuals are first squared and then ordered, and the value of h must be determined.

We might let ℎ =𝑛

2 , so that it will be a high breakdown point estimator (50%), but on the

other hand high breakdown point can sometimes produce poor results in normal situations.

Another choice is ℎ = [𝑛

2] + [

𝑝+1

2] which increases the efficiency of the estimator.

Consequently, it seems preferable to use a larger value of ℎ and to speak of a trimming

percentage 𝛼. Rousseeuw and Leroy (1987) suggest that ℎ might be selected as ℎ =

[𝑛(1 − 𝛼)] + 1.

LTS is intuitively more appealing than LMS, due in part to the fact that the objective

function is not based on the fit at any particular point.

Part of the appeal of LTS is that if we are fortunate to trim the exact number of bad data

points without trimming any good data points, we will have the optimal estimator (OLS)

43

applied to the good data points. Currently LTS is considered the preferred choice of

Rousseeuw and Ryan. (Ryan, 1997)

(3.3.5) S-Estimators:

This method is first proposed by Rousseeuw and Yohai (1984); they chose to call these

estimators S-estimators because they were based on estimates of scale. In the same way that

the least squares estimator minimizes the variance of the errors, S-estimators minimize the

dispersion of the residuals, so the estimator 𝛽𝑠∧

in the notation used here, is obtained as:

𝑚𝑖𝑛𝛽𝑠(𝑒1(𝛽),… , 𝑒𝑛(𝛽)) ( 3.23)

where 𝑒1(𝛽), … , 𝑒𝑛(𝛽) denote the 𝑛 residuals for a given candidate 𝛽, and

𝑠(𝑒1(𝛽),… , 𝑒𝑛(𝛽)) is given by the solution to

1

𝑛 ∑𝜌(𝑒𝑖 𝑠⁄

𝑛

𝑖=1

) = 𝑘 (3.24)

where 𝑘 is a constant, and the objective function 𝜌(. ) must be selected with the following

conditions:

1. 𝜌 is symmetric and continuously differentiable, and 𝜌(0) = 0.

2. There exists 𝑐 > 0 such that 𝜌 is strictly increasing on [0 , 𝑐 ], and constant on

[𝑐 , ∞ ).

3. 𝑘

𝜌(𝑐)=

1

2 .

The second condition on the objective function means that the associated influence function

will be redescending. A possible objective function to use is the one associated with the

Tukey bisquare objective function, given in Table (3.1).

44

𝜌(𝑥) =

{

𝑥2

2−𝑥4

2𝑐2+𝑥6

6𝑐4 |𝑥| ≤ 𝑐

𝑐2

6 |𝑥| > 𝑐

The third condition is required to obtain a breakdown point of 50%. Often, K is chosen so

that the resulting 𝑠 is an estimator for σ when the errors are normally distributed. In order to

do this, K is set to 𝐸𝜙(𝜌(𝑢)) which is the expected value of the objective function if it is

assumed that u has a standard normal distribution (Rousseeuw and Leroy, 1987) .

The asymptotic efficiency of this class of estimators depends on the objective function by

which they are defined; the tuning constants of this function cannot be chosen to give the

estimator simultaneously high breakdown point and high asymptotic efficiency.

When using the Tukey bisquare objective function, Rousseeuw and Yohai (1984) state that

setting 𝑐 = 1.548 satisfies the third condition, and so results in an S-estimator with 50%

breakdown point and about 28% asymptotic efficiency.

Trade-off in breakdown and efficiency is possible based on choices for tuning constants c

and K, for example if we choose c = 5.182, k= 0.4475 we will have an asymptotic

efficiency of 96.6% but with 10% breakdown point (Rousseeuw and Leroy, 1987).

The final scale estimate, s, is the standard deviation of the residuals from the fit that

minimized the dispersion of the residuals.

(3.3.6) MM-Estimation:

First proposed by Yohai (1987). He combines a high breakdown point (50%) with good

efficiency. The "MM" in the name refers to the fact that more than one M-estimation

procedure is used to calculate the final estimates. Following from the M-estimation case,

iteratively reweighted least squares (IRLS) is employed to find estimates. The procedure is

as follows:

45

1. Initial estimates of the coefficients 𝛽 and corresponding residuals e are taken from a

highly resistant regression (i.e., a regression with a breakdown point of 50%). It is not

necessary that the estimator must be efficient; as a result, S estimation with Huber or bi-

square weights is typically employed at this stage, the estimates of 𝛽 will be denoted by 𝛽.

2. The residuals 𝑒 from the initial estimation at step 1 are used to compute an M-

estimation of the scale of the residuals, 𝑠𝑛.

3. The initial estimates of the residuals 𝑒 from step 1 and of the residual scale 𝑠𝑛 from

step 2 are used in the first iteration of reweighted least squares to determine the M-

estimates of the regression coefficients, so an MM-estimator �̂� is defined as a solution to:

∑𝑤𝑖(𝑒𝑖𝑠𝑛

𝑛

𝑖=1

)𝑥𝑖 = 0 (3.25)

where the 𝑤𝑖are typically Huber or Tukey bisquare weights.

4. New weights are calculated, 𝑤𝑖 , using the residuals from the iteration in step 3.

5. Keeping the measure of the scale of the residuals 𝑠𝑛 fixed from step 2, so steps 3 and 4

are continually reiterated until convergence.

Although MM estimation aims to obtain estimates that have a high breakdown value and

more efficient, but MM estimates has trouble with high leverage outliers in small to

moderate dimension data (Alma, 2011).

46

Chapter Four: Comparison among Robust Regression

Methods

In order to compare the robust methods that were discussed in chapter three, we consider

two studies; the first one is based on a real life data that contain outliers and a comparison

study will be implemented on these robust methods. The second is a simulation study, in

which different scenarios will be used to compare the robust regression methods, and their

properties will be discussed.

(4.1) Real life data:

We refer to our example in chapter two, data set are shown in table(2.1) page 23, and the

plot of the data is shown in figure(2.3) page 24. In this example we found that observations

1 and 4 (Germany and United Kingdom) were classified as influential observations that

can cause large effect on the regression fit.

Now, we find the estimates of the parameters 𝛽0, 𝛽1 in the proposed model: 𝑦 = 𝛽0 +

𝛽1𝑥 + 𝜀 using ordinary least squares method (OLS), and using the robust methods that

were discussed in chapter 3, the estimates will be found for two cases, one for all 10

observations with the influential points, and the second for observations without

observations 1 and 4 ( for 8 observations).

The fitted models are compared to the OLS model with and without influential points, each

fitted model is plotted against the least squares model, and hence the effect of the unusual

observations on each model can be seen. In addition, the efficiency of each robust method

can be noted from the performance of a model in the clean data case (8 observations).

Plots of these models are provided in figures (4.1) to (4.7).

The calculations are found using the program R 3.2.2. For the M-estimation with Huber

weights, we used the tuning constant a = 1.345, and the accuracy of convergence 0.001, the

47

estimates converged after 20 iterations (in M-H(10)) and 13 iterations in (M-H(8)). In the

Tukey weights, we used the tuning constant b= 4.685, accuracy of convergence is 0.001,

the estimates converged after 11 iterations ( in M-T (10)) and 13 iterations (in M-T(8)).

Method �̂�0 �̂�1

OLS (10) −0.3139 0.4004

OLS(8) 4.6101 0.3083

M-H(10) 2.9593 0.3298

M-H (8) 4.4053 0.2881

M-T(10) 6.6582 0.2282

M-T (8) 4.2220 0.2846

RM (10) 1.5333 0.3727

RM (8) 1.5333 0.3727

LMS (10) 0.5500 0.3667

LMS (8) 0.8966 0.3678

LTS (10) 4.6101 0.3083

LTS (8) 1.9693 0.3584

S (10) 7.0645 0.2305

S (8) 3.7508 0.3128

MM (10) 7.0706 0.2307

MM (8) 4.2297 0.2860

Table (4.1) The estimated parameters in 7 robust methods and OLS for the Steel Employment

data set that contains outliers (10 observations), and clean data( 8 observations) .

48

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

For the LTS estimates, we used ℎ = [𝑛

2] + [

𝑝+1

2] = 5 + 1 = 6. For the LMS estimates we

used 𝑚 = [𝑛+1

2] = 5. And finally Tukey bisquare objective function was used for the S-

estimates and MM-estimates, with c = 1.548, and the MM-estimates converged after 5

iterations ( in MM(10)) and 12 iterations ( in MM(8)).

_____ OLS(10)

-------- M-H(10)

____ OLS(10)

------- M-T(10)

Figure(4.1) fitted line for OLS(10) , M-H(10), OLS(8), and M-H(8)in the steel employment example

Figure(4.2) fitted line for OLS(10) , M-T(10), OLS(8),and M-T(8) in the steel employment example

____ OLS(8)

-------- M-H(8)

____ OLS(8)

------ M-T(8)

49

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

_____ OLS(10)

------- RM(10)

____ OLS(10)

------ LMS(10)

____ OLS(10)

-------- LTS(10)

_____ OLS(8)

------- RM(8)

____ OLS(8)

------ LMS(8)

____ OLS(8)

-------- LTS(8)

Figure(4.3) fitted line for OLS(10) , RM(10), OLS(8), and RM(8)in the steel employment example

Figure(4.4) fitted line for OLS(10) , LMS(10), OLS(8), and LMS(8)in the steel employment example

Figure(4.5) fitted line for OLS(10) , LTS(10), OLS(8), and LTS(8) in the steel employment example

50

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

0 50 100 150 200

02

04

06

08

01

00

12

0

x

y

0 50 100 150

01

02

03

04

05

0

x

y

In order to explore the performance of each robust method and OLS in presence of

influential points, fitted values and residuals (and weights for M-H, M-T, and MM) are

calculated and summarized in tables ( 4.2) to (4.9). The results will be discussed after the

simulation study.

____ OLS(10)

-------- S(10)

_____ OLS(10)

-------- MM(10)

_____ OLS(8)

-------- MM(8)

____ OLS(8)

-------- S(8)

Figure(4.6) fitted line for OLS(10) , S(10), OLS(8), and S(8) in the steel employment example

Figure(4.7) fitted line for OLS(10) , MM(10), OLS(8), and MM(8) in the steel employment example

51

𝑦𝑖 𝑦𝑖∧

𝑒𝑖 𝑤𝑖

132 92.5746957 39.4253043 1

50 38.1227863 11.8772137 1

43 62.9464509 -19.9464509 1

41 77.3601916 -36.3601916 1

33 35.3201145 -2.3201145 1

25 25.3105723 -0.3105723 1

16 9.6956866 6.3043134 1

8 8.8949232 -0.8949232 1

3 1.2876712 1.7123288 1

1 0.4869078 0.5130922 1

yi yi∧

ei wi

132 79.4854 52.5146 0.1408

50 34.6253 15.3747 0.4809

43 55.0762 -12.0762 0.6105

41 66.9509 -25.9509 0.2843

33 32.3163 0.6837 1

25 24.0699 0.9300 1

16 11.2057 4.7943 1

8 10.5460 -2.5460 1

3 4.2787 -1.2787 1

1 3.6190 -2.6190 1

Table(4.3) fitted values , residuals , and weights of M-H(10)

Table (4.2) fitted values , residuals , and weights of OLS(10)

52


𝑒𝑖 𝑤𝑖

132 59.611592 72.3884082 0

50 28.569936 21.4300635 0.4371

43 42.721279 0.2787207 1

41 50.938188 -9.9381881 0.8596

33 26.972204 6.0277958 0.9471

25 21.266018 3.7339824 0.9795

16 12.364366 3.6356335 0.9806

8 11.907872 -3.9078715 0.9776

3 7.571170 -4.5711697 0.9694

1 7.114675 -6.1146747 0.9456


𝑒𝑖

132 88.006061 43.99393939

50 37.315152 12.68484848

43 60.424242 -17.42424242

41 73.842424 -32.84242424

33 34.706061 -1.70606061

25 25.387879 -0.38787879

16 10.851515 5.14848485

8 10.106061 -2.10606061

3 3.024242 -0.02424242

1 2.278788 -1.27878788

Table(4.5) fitted values ,and residuals of RM(10)

Table(4.4) fitted values , residuals , and weights of M-T(10)

53


𝑒𝑖

132 85.616667 46.3833333

50 35.750000 14.2500000

43 58.483333 -15.4833333

41 71.683333 -30.6833333

33 33.183333 -0.1833333

25 24.016667 0.9833333

16 9.716667 6.2833333

8 8.983333 -0.9833333

3 2.016667 0.9833333

1 1.283333 -0.2833333


𝑒𝑖

132 76.132078 55.8679219

50 34.205411 15.7945893

43 53.319038 -10.3190385

41 64.417274 -23.4172740

33 32.047421 0.9525795

25 24.340313 0.6596875

16 12.317224 3.6827759

8 11.700655 -3.7006555

3 5.843253 -2.8432534

1 5.226685 -4.2266848

Table(4.6) fitted values ,and residuals of LMS(10)

Table(4.7) fitted values , and residuals of LTS(10)

54


𝑒𝑖

132 59.461828 72.5381724

50 27.672119 22.3278811

43 42.164486 0.8355139

41 50.579409 -9.5794090

33 26.035884 6.9641161

25 20.192187 4.8078125

16 11.076021 4.9239790

8 10.608525 -2.6085253

3 6.167316 -3.1673160

1 5.699820 -4.6998203


𝑒𝑖 𝑤𝑖

132 60.598189 71.4018113 0

50 29.219942 20.7800582 0.6911654

43 43.524731 -0.5247309 0.9997857

41 51.830737 -10.8307374 0.9104973

33 27.604885 5.3951150 0.9773915

25 21.836825 3.1631751 0.9921988

16 12.838651 3.1613488 0.9922088

8 12.377206 -4.3772064 0.9850926

3 7.993481 -4.9934807 0.9806204

1 7.532036 -6.5320359 0.9669535

Table(4.8) fitted values ,and residuals of S(10)

Table (4.9) fitted values , residuals , and weights of MM(10)

55

(4.2) Simulation study

In this section, we introduced a simulation study which has been carried out to illustrate the

robustness of the estimators under different cases. Simulation was used to compare the

mean squared errors (MSE) and the relative mean squared errors (RMSE) of the estimates

of the regression parameters for each estimation method, where:

𝑀𝑆𝐸(�̂�) =1

𝑚∑(�̂�𝑖 − 𝛽)

2 (4.26)

𝑚

𝑖=1

where m is the number of replications, 𝛽 is the true parameter, and �̂� is the estimated

parameter.

𝑅𝑀𝑆𝐸(�̂�) =𝑀𝑆𝐸(�̂�𝑂𝐿𝑆) − 𝑀𝑆𝐸(�̂�𝑜𝑡ℎ𝑒𝑟 𝑚𝑒𝑡ℎ𝑜𝑑)

𝑀𝑆𝐸(�̂�𝑂𝐿𝑆) (4.27)

where 𝑀𝑆𝐸(�̂�𝑂𝐿𝑆) stands for mean squared errors of the estimated parameter �̂� using OLS,

and 𝑀𝑆𝐸(�̂�𝑜𝑡ℎ𝑒𝑟 𝑚𝑒𝑡ℎ𝑜𝑑) stands for the mean squared errors of the estimated parameter �̂�

using the other methods.

Relative mean square error has been used as a measure of the quality of parameter

estimation. In fact, RMSE can be interpreted as a proportionate change from baseline, using

the OLS estimator MSE within a given data condition as a baseline value. Positive values

of RMSE refer to the proportional reduction in the MSE of a given estimator with respect

to OLS estimation. Hence, RMSE is considered as a relative measure of performance above

and beyond that of the OLS estimator (Nevitt and Tam, 1998).

We explore eight different regression estimates: ordinary least squares (OLS), M-

estimation using Huber weights (M-H), M-estimation using Tukey weights (M-T),

repeated medians (RM), least median of squares (LMS), least trimmed squares (LTS), S-

56

estimate, MM-estimate. The program R.3.2.2. is used, the same criteria that were described

in the last section are also used here.

This simulation study is performed for sample sizes n =20, and n =100, to compare the

performance of these eight methods:

The model: we generate n samples {(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑛, 𝑦𝑛)} from the model

𝑦 = 3 + 2𝑥 + 𝜖

where 𝑥 ~ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚 (−5,5), with the following cases:

Case 1: ϵ ~ N(0 , 1) ( standard normal distribution.)

Case 2: ϵ ~ N (0 , 1) with 10% identical outliers in y direction (where we let the first

10% of y's equal to 40).

Case 3 : ϵ ~ N (0 , 1) with 25 % identical outliers in y direction (where we let the first

25% of y's equal to 40).

Case 4: ϵ ~ N (0,1) with 10% identical high leverage outliers (where we let the first 10%

of x's equal to 30 ).

Case 5: ϵ ~ N (0,1) with 25% identical high leverage outliers (where we let the first 25 %

of x's equal to 30 ).

Case 6 :ϵ ~ 0.90N(0, 1) + 0.10N(0; 100) (contaminated normal mixture).

Case 7: ϵ ~ Laplace(0 , 4) (double exponential distribution, with mean=0, scale=4)

Case 8: ϵ ~ 𝑡3 (t-distribution with 3 degrees of freedom).

Tables (4.10) to (4.25) report the estimates of the simulated parameters �̂� and their MSE

and RMSE for each estimation method with sample size n=20, and n=100, respectively.

The number of replicates is 1000. And the true model is : 𝑦 = 3 + 2𝑥 + 𝜖

57

Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )

OLS 2.995978 0.04583142 0 1.999841 0.0048380 0

M-H 2.995145 0.04911524 -0.07164994 2.000145 0.0050053 -0.03459696

M-T 2.995071 0.05135233 -0.12046105 2.000573 0.0051709 -0.06882508

RM 2.992821 0.07895747 -0.72278011 1.999981 0.0064337 -0.32983238

LMS 2.992933 0.23095011 -4.03912151 2.007179 0.0217917 -3.50428522

LTS 2.995637 0.08152247 -0.77874623 2.003723 0.0083696 -0.72997397

S 2.994282 0.13839046 -2.01955409 2.002924 0.0141564 -1.92609450

MM 2.994530 0.05029169 -0.09731885 2.000454 0.0051294 -0.06023841


OLS 2.994201 0.01054727 0 2.000185 0.0013832 0

M-H 2.994965 0.01116005 -0.05809806 2.000301 0.0014644 -0.05865381

M-T 2.995026 0.01122873 -0.06460928 2.000321 0.0014717 -0.06393916

RM 2.994757 0.01576845 -0.49502622 2.000446 0.0019903 -0.43881233

LMS 2.993902 0.06875218 -5.51847774 1.999665 0.0086844 -5.27808152

LTS 2.994125 0.01521376 -0.44243480 2.001154 0.0020216 -0.46147568

S 2.999137 0.03217913 -2.05094290 2.000461 0.0045241 -2.27058359

MM 2.994909 0.01118382 -0.06035128 2.000339 0.0014734 -0.06516871

Table (4.10) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point

Estimates ,with n = 20 Case1:ϵ~ N(0,1)


Estimates ,with n = 100 Case1:ϵ~ N(0,1)

58


OLS 6.249055 10.60751701 0 2.209871 0.0510380 0

M-H 3.177105 0.09552916 0.9909942 2.021826 0.0090712 0.8222659

M-T 3.003997 0.06468554 0.9939019 2.000270 0.0087058 0.8294253

RM 3.165193 0.12628024 0.9880952 2.005892 0.0133337 0.7387498

LMS 2.992060 0.24748724 0.9766687 2.015049 0.0339619 0.3345769

LTS 3.006635 0.08264286 0.9922090 2.002126 0.0121035 0.7628529

S 3.001087 0.13283632 0.9874772 2.003805 0.0186237 0.6351005

MM 3.004187 0.06304418 0.9940567 2.000396 0.0085174 0.8331166


OLS 6.902860 15.24168571 0 1.327680 0.4531280 0

M-H 3.200050 0.05203813 0.9965858 1.975971 0.0020372 0.9955041

M-T 3.000867 0.01184571 0.9992228 2.001818 0.0014626 0.9967721

RM 3.144881 0.03844485 0.9974777 1.986661 0.0021626 0.9952273

LMS 3.007946 0.06356390 0.9958296 1.999836 0.0076666 0.9830806

LTS 3.001091 0.01324594 0.9991309 2.002061 0.0016720 0.9963099

S 3.001456 0.02861111 0.9981228 2.002527 0.0035112 0.9922510

MM 3.000773 0.01180496 0.9992255 2.001805 0.0014594 0.9967791


Estimates ,with n = 20 Case 2: ϵ ~ N (0,1) with 10% identical outliers in y direction

(where we let the first 10% of y's equal to 40).


Estimates ,with n = 100 Case 2:ϵ ~ N (0,1) with 10% identical outliers in y direction

(where we let the first 10% of y's equal to 40).

59


OLS 12.07198 82.3398961 0 1.599496 0.1635614 0

M-H 3.929834 1.00614249 0.9877806 2.047522 0.0099448 0.9391979

M-T 3.013318 0.07005663 0.9991492 2.001874 0.0074642 0.9543644

RM 3.425122 0.30628199 0.9962803 2.035532 0.0114673 0.9298898

LMS 3.009960 0.20560661 0.9975030 2.003935 0.0189918 0.8838856

LTS 3.010615 0.07204460 0.9991250 2.002237 0.0077773 0.9524500

S 3.013494 0.10521390 0.9987222 2.000408 0.0106107 0.9351269

MM 3.012502 0.06928318 0.9991586 2.001979 0.0073938 0.9547946


OLS 12.344139 87.320997 0 1.361525 0.408577 0

M-H 3.946737 0.9280973 0.9893714 1.977289 0.002140 0.9947616

M-T 3.000979 0.0145881 0.9998329 1.998911 0.001625 0.9960215

RM 3.478967 0.2537359 0.9970942 1.967613 0.003292 0.9919410

LMS 3.002669 0.0568738 0.9993487 2.000991 0.006130 0.9849964

LTS 3.000531 0.0149311 0.9998290 1.998954 0.001644 0.9959756

S 3.003359 0.0234364 0.9997316 1.999011 0.002554 0.9937489

MM 3.000738 0.0145242 0.9998337 1.998913 0.001619 0.9960355


Estimates ,with n = 20 Case3:ϵ~ N(0,1) with 25% identical outliers in y direction

(where we let the first 25 % of y's equal to 40).


Estimates ,with n = 100 Case3:ϵ~ N(0,1) with 25% identical outliers in y direction

(where we let the first 25 % of y's equal to 40).

60


OLS 1.288070 2.98099470 0 0.244617 3.0814555 0

M-H 1.089816 3.70799281 -0.2438777 0.237815 3.1053989 -0.00777016

M-T 1.126466 3.56746045 -0.1967349 0.232027 3.1258236 -0.01439843

RM 2.849285 0.13472359 0.9548058 1.915814 0.0190538 0.993816596

LMS 2.993807 0.24455836 0.9179608 2.005190 0.0334030 0.989159975

LTS 2.990957 0.07803726 0.9738217 1.999392 0.0113597 0.996313501

S 2.992096 0.14175331 0.9524476 1.999533 0.0186715 0.993940688

MM 2.996869 0.06041228 0.9797342 2.002878 0.0082177 0.997333146


OLS 2.085423 0.84655160 0 0.236196 3.1110229 0

M-H 2.028269 0.95464874 -0.1276911 0.234379 3.1174351 -0.00206110

M-T 2.012842 0.98480013 -0.1633079 0.227152 3.1430088 -0.01028145

RM 2.918987 0.02560080 0.9697587 1.931363 0.0066212 0.997871677

LMS 2.993970 0.06617510 0.9218298 2.003158 0.0077459 0.997510175

LTS 2.995040 0.01253096 0.9851976 2.003321 0.0016736 0.999462027

S 2.991978 0.02953308 0.9651137 2.003435 0.0038285 0.998769369

MM 2.995322 0.01083792 0.9871976 2.002815 0.0014156 0.999544955


Estimates ,with n = 20 Case4:ϵ~ N(0,1) with 10 % identical outliers in x direction

(where we let the first 10 % of x's equal to 30).


Estimates ,with n = 100 Case 4:ϵ~ N(0,1) with 10 % identical outliers in x direction


61


OLS 2.096356 0.88028827 0 0.154525 3.4058541 0

M-H 2.058576 0.95462952 -0.08445102 0.154592 3.4056108 7.1446e-05

M-T 2.059764 0.95091454 -0.08023084 0.149945 3.4227817 -4.97013e-03

RM 3.006138 0.27467702 0.68796924 1.753267 0.0767181 9.77474e-01

LMS 3.031105 2.17743042 -1.47354247 1.885200 0.2465114 9.27621e-01

LTS 3.095153 0.89125082 -0.01245336 1.949623 0.1165513 9.65779e-01

S 2.998620 0.11055581 0.87440954 2.001756 0.0119567 9.96489e-01

MM 3.011508 0.06661944 0.92432088 2.000884 0.0072728 9.97864e-01


OLS 3.135412 0.03200837 0 0.112413 3.5629991 0

M-H 3.211086 0.06459291 -1.0180009 0.097751 3.6185800 -0.01559946

M-T 3.201407 0.06087894 -0.9019697 0.096080 3.6249452 -0.01738593

RM 3.428996 0.21501995 -5.7176173 1.748848 0.0662590 0.98140356

LMS 2.993440 0.05512972 -0.7223533 2.000750 0.0064637 0.99818588

LTS 2.997902 0.01401274 0.5622163 2.000876 0.0016153 0.99954662

S 2.998735 0.02225039 0.3048571 2.001040 0.0026848 0.99924646

MM 2.997181 0.01375233 0.5703519 2.000983 0.0015690 0.99955964







62


OLS 2.701215 48.8654771 0 2.099622 5.8676453 0

M-H 2.997033 0.09496866 0.9980565 2.007969 0.0108911 0.9981439

M-T 3.002315 0.06410098 0.9986882 2.005798 0.0072504 0.9987643

RM 3.002522 0.10835005 0.9977827 2.010181 0.0113361 0.9980680

LMS 3.001221 0.24580465 0.9949698 2.007500 0.0260866 0.9955542

LTS 3.008946 0.08172036 0.9983276 2.005251 0.0096179 0.9983608

S 3.004459 0.13328527 0.9972724 2.011206 0.0149691 0.9974489

MM 3.002500 0.00630079 0.9987050 2.006079 0.0071685 0.9987783


OLS 2.884353 9.72510518 0 2.047791 0.92958906 0

M-H 2.998212 0.01604806 0.9983498 2.001620 0.00162942 0.9982472

M-T 2.999628 0.01251891 0.9987127 2.000164 0.00120415 0.9987046

RM 2.998187 0.02138542 0.9978010 2.002489 0.00183232 0.9980289

LMS 3.007326 0.07177498 0.9926196 2.000362 0.00600956 0.9935352

LTS 2.999713 0.01426245 0.9985334 2.000604 0.00136659 0.9985299

S 2.999414 0.03015062 0.9968997 2.001449 0.00288640 0.9968950

MM 2.999562 0.01247196 0.9987175 2.000160 0.00120070 0.9987084


Estimates ,with n = 20 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100)


Estimates ,with n = 100 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100)

63


OLS 2.977393 1.571826 0 2.014600 0.1959045 0

M-H 2.968720 1.227067 0.21933656 2.006962 0.1543497 0.21211768

M-T 2.965907 1.245367 0.20769422 2.003825 0.1604707 0.18087269

RM 2.988083 1.517493 0.03456703 2.001139 0.1688664 0.13801660

LMS 2.976926 3.119543 -0.98466147 2.007580 0.4165185 -1.12613043

LTS 2.950252 1.540569 0.01988604 2.006313 0.1992438 -0.01704538

S 3.004480 1.940194 -0.23435647 1.986397 0.2623846 -0.33934969

MM 2.969494 1.227378 0.21913853 2.007090 0.1574824 0.19612649


OLS 3.017684 0.3425886 0 2.003316 0.0388032 0

M-H 3.018909 0.2481231 0.2757403 2.002451 0.0281271 0.2751344

M-T 3.022460 0.2482821 0.2752763 2.001754 0.0288226 0.2572122

RM 3.008481 0.2234773 0.3476801 1.997753 0.0262700 0.3229953

LMS 3.021568 0.6906369 -1.0159366 1.993617 0.0751224 -0.9359819

LTS 3.013680 0.2493054 0.2722891 1.996493 0.0288520 0.2564533

S 3.004208 0.2896170 0.1546215 1.992827 0.0347323 0.1049114

MM 3.020064 0.2473662 0.2779497 2.001668 0.0287226 0.2597878


Estimates ,with n = 20 Case7:ϵ~ Laplace (0,4)


Estimates ,with n = 100 Case7:ϵ~ Laplace (0,4)

64


OLS 3.001632 0.15626264 0 2.002546 0.01717085 0

M-H 3.008082 0.09657450 0.38197320 2.002850 0.01210816 0.2948419

M-T 3.010029 0.09422788 0.39699032 2.003081 0.01234496 0.2810516

RM 3.004152 0.12667501 0.18934547 2.004627 0.01392421 0.1890786

LMS 3.008877 0.10430774 0.33248445 2.002809 0.01323696 0.2291026

LTS 3.013229 0.13166488 0.15741288 - 2.003121 0.01901876 -0.1076190

S 3.011049 0.16906975 -0.08195891 2.004820 0.02258645 -0.3153952

MM 3.010296 0.09390307 0.39906896 2.002237 0.01229587 0.2839105


OLS 2.995572 0.03194884 0 1.997565 0.00332865 0

M-H 2.997807 0.01592158 0.5016541 1.999147 0.00191222 0.42552514

M-T 2.998543 0.01556788 0.5127249 1.999688 0.00191913 0.42344990

RM 3.000725 0.01919937 0.3990591 1.999239 0.00210048 0.36897003

LMS 2.998201 0.01721277 0.4612396 1.998625 0.00204566 0.38543730

LTS 3.000598 0.01754478 0.4508477 1.999787 0.00229034 0.31193150

S 3.003417 0.02694119 0.1567397 2.000963 0.00357390 -0.07367771

MM 2.998975 0.01561128 0.5113663 1.999643 0.00191381 0.42504901


Estimates ,with n = 20 Case8:ϵ~ 𝑡3


Estimates ,with n = 100 Case 8: ϵ~ 𝑡3

65

(4.3) Discussion

Under ideal conditions (normal error distribution, no contamination) , from the first two

tables (4.10), (4.11) , we can see that the best estimates are the ordinary least squares (OLS)

as expected. However, we can see that M-estimates with Huber and Tukey criteria and

MM-estimates are very close to the OLS estimates, which indicates that they have high

efficiency. Repeated median (RM) performed well to some extent, so it has a good

efficiency. The RMSE values of the LTS estimates became better when n increased from

20 to 100, but we can say that LTS is not efficient in the normal case. In fact, we can see

the poor performance of the other estimates LMS and S when they are compared to the

other methods, and the corresponding RMSE values indicate this result.

In the other tables, as we expected, least squares estimates perform poor in all

contamination scenarios.

For Y outliers in both 10% and 25 % ( tables (4.12) to (4.15) ), in general, the robust

estimators have good estimated coefficients and positive RMSE values, in particular we

can see the strong performance of M-H, M-T, MM , RM and LTS estimates.

For X outliers in both 10% and 25 % ( tables (4.16) to (4.19) ), the results are different, in

particular, the poor performance in M-H and M-T, can be shown by the negative values of

their RMSE values, while the other methods performed well , MM-estimates and LTS

estimates are considered the best in presence of X outliers.

For the tables ( 4.20) and (4.21), which contain the contaminated normal distribution, we

can see the good performance of all methods , with advantage to the MM and M-T

estimates.

66

For the heavy tailed distributions ( Laplace(0,4) and 𝑡3) that can be seen in tables (4.22) to

(4.25) ,M-H is considered the best estimator, While MM-estimates and M-T performed

well under these distributions.

However, we can see that the best estimates are the MM-estimates, which have the least

MSE in most scenarios, and for both Y-outliers and X-outliers (leverage points), and also

for different sample sizes ( n=20, n=100), although it is not appropriate if the errors follow

Laplace distribution.

M-estimators in Huber and Tukey weights are not sensitive to Y-outliers, but they are so

sensitive to the leverage points (X-outliers). Repeated median estimation is not sensitive to

both Y-outliers and X-outliers, and it is competitive in different scenarios. S-estimation

and LTS method are robust to both X and Y outliers, and we can see that S-estimates

perform well when the sample size is large (n=100). LMS estimation is not sensitive to X

and Y-outliers, but it is not competitive method when it is compared to the other methods.

If we refer to the real data set of Steel Employment example that was discussed in the last

section, we found that robust regression methods can be classified into two sets, the first

contains estimators that utilize all available data, but are resistant to outliers by down

weighting their effects (M-H, M-T, and MM), the second set contains estimators that

ignore outliers and do not utilize all available data (RM, LMS, LTS, and S). Our results in

the real data example and in the simulation study demonstrate that methods which utilize

all observations provide more accurate estimates of true population values. Hence, M-

estimators and MM estimators are recommended.

67

Conclusion

The usual assumption for a linear regression model is that error terms have a normal

distribution, which leads OLS estimation procedure to give optimal results.

However, in real life it is nearly impossible to find a data set that satisfies the normality

assumption. The poor performance of OLS estimation under the contaminated data

conditions and non-normal error distributions serves to ensure both the importance of

assessing underlying assumptions as part of any regression analysis and the need for

alternatives to OLS regression.

Looking at the summary tables for estimators, the OLS is invalid if only a single outlying

observation occurs in the data. Not only is detection of these influential points important

but also utilizing regression methods which are less sensitive to these points is more

important in regression analysis. The poor performance of OLS estimators with the

presence of outliers confirms our need for alternative methods.

In this thesis, seven robust regression methods were comparatively evaluated against OLS

regression method. In general, as expected, robust estimators have better performance than

ordinary least squares method (OLS) in presence of outliers.

The best estimator is the MM-estimator, which has high breakdown point and high

efficiency, and performs well in different outlier scenarios.

Repeated median method, is very robust method, and has high breakdown point and good

efficiency also performs well in outliers in X and Y spaces.

M-estimation is the best method if there are only Y-outliers, and also it has high efficiency

if the data do not contain outliers.

68

Least trimmed squares method (LTS) and S-estimator are good estimators and have good

robust properties, but their efficiencies are not high.

Finally, it is recommended that robust regression should be used in conjunction with and as

a check on the method of least squares. If the results of the two procedures are

substantially the same, use the least squares fit since confidence intervals and tests on the

regression coefficients can be made. On the other hand, if the two analyses give quite

different results, the robust method must be used.

69

REFERENCES

[1] Adnan, R. and Mohamad, N. M, (2003), Multiple Outliers Detection Procedures in

Linear Regression. Matematika, Jabatan Matematik, UTM, Volume (19): 29-45.

[2] Alma, O. G. (2011), Comparison of Robust Regression Methods in Linear Regression.

International Journal of Contemporary Mathematical Sciences, Volume (6): 409– 421.

[3] Birkes, D. and Dodge, Y. (1993), Alternative Methods of Regression. John Wiley &

Sons Ltd.

[4] Cook, R.D. , (1977), Detection of Influential Observation in Linear Regression.

Technometrics, Volume (19): 15-18.

[5] Croux, C., Rousseeuw, P. J., and Hossjer O. (1994), Generalized S-estimators. Journal

of American Statistical Association, Volume (89): 1271-1281.

[6] Dalgaard, P. (2008), Introductory Statistics with R. Second Edition. Springer.

[7] Draper, N.R and Smith, H. (1998), Applied Regression Analysis. Third edition.

Wiley-Interscience Publication.

[8] Faraway, J.J., (2002), Practical Regression and Anova Using R. Springer.

[9] Fox, J., (2002), An R and S-Plus Companion to Applied Regression. Sage

Publications, Inc.

[10] Gervini, D. and Yohai, V.J. (2002), A Class of Robust and Fully Efficient Regression

Estimators. The Annals of Statistics, Volume (30): 583-616.

[11] Hogg, R.V., McKean, J.W. and Craig, A. T. (2006), Introduction to Mathematical

Statistics. Sixth edition. Pearson Education International.

[12] Huber, P. J. and Ronchetti E. M. (2009), Robust Statistics. Second edition. John

Wiley & Sons Ltd.

[13] Kutner, M.H., Nachtsniem, C.J. and Neter, J. (2005), Applied Linear Statistical

Models. Fifth edition. Mc Graw-Hill.

[14] Lee, Y., MacEachern, S. N., and Jung, Y. (2011), Regularization of Case-Specific

Parameters for Robustness and Efficiency. Statistical Science, Volume (27): 350–372,

DOI: 10.1214/11-STS377

[15] Marona, R.R. Martin, V.J. Yohai, (2006), Robust Statistics Theory and Methods,

John Wiley & Sons Ltd., England.

70

[16] Mendenhall, W. and Sinich, T. (2012), Regression Analysis: A Second Course in

Statistics, 7th edition. Upper Saddle River, NJ: Prentice Hall.

[17] Mohebbi, M.K., Nourijelyani, K.H. and Zeraati, H. (2007), A Simulation Study on

Robust Alternatives of Least Squares Regression, Journal of Applied Sciences, Volume

(7): 3469-3476

[18] Montgomery, D. C. and Peck, E. A. (2006), Introduction to Linear Regression

Analysis, 4th Edition. John Wiley & Sons, Inc., New York.

[19] Nevitt, J. and Tam, H.P. (1998), A Comparison of Robust and Nonparametric

Estimators under the Simple Linear Regression Model. Multiple Linear Regression

Viewpoints, Volume (25): 54-69.

[20] Noor N.H. and Mohammad A.A. (2013), Model of Robust Regression with Parametric

and Nonparametric Methods. Mathematical Theory and Modeling, Volume (3): 27-39.

[21] Renaud, O. and Victoria-Feser, M.-P. (2010), A Robust Coefficient of Determination

for Regression, Journal of Statistical Planning and Inference, Volume (140): 1852-

1862.

[22] Rencher, A.C. and Schaalje, G.B. (2008), Linear Models in Statistics, Second

Edition, John Wiley & Sons ,Inc., New York.

[23] Rousseeuw, P.J. (1984), Least Median of Squares Regression. Journal of American

Statistical Association, Volume (79): 871-880.

[24] Rousseeuw, P.J. and Leroy, A.M, (1987), Robust Regression and Outlier Detection.

John Wiley & Sons, New York.

[25] Rousseeuw,P.J. and Yohai ,V. J.(1984), Robust Regression by Means of S-estimators.

In W. H. J. Franke and D. Martin (Editors.), Lecture Notes in Statistics, Springer Verlag,

New York, Volume (26): 256- 272

[26] Ryan, T. P. (1997), Modern Regression Analysis. John Wiley & Sons, New York

[27] Siegel, A. F. (1982), Robust Regression Using Repeated Medians. Biometrika,

Volume (69): 242-244.

[28] Staudte, R.G. , Sheatther, S.J. (1990), Robust Estimation and Testing. John Wiley &

Sons, New York.

[29] Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth

Edition. Springer.

http://link.springer.com/bookseries/694

71

[30] Weisberg, S. (2005), Applied Regression Analysis. John Wiley & Sons, New Jersy.

[31] Yan, X. and Su, X.G.(2009), Linear Regression Analysis: Theory and Computing .

World Scientific Publishing Co. Pte. Ltd.

[32] Yohai, V. J. (1987), High Breakdown-point and High Efficiency Robust Estimates for

Regression. The Annals of Statistics, Volume (15): 642-656.

72

وتطويرالطرق الحصينة في تحليل االنحدار: مقارنة

اعداد

محمد عبد المنعم حسين العملة

المشرف

ستاذ الدكتور فارس العذارياأل

ملخص

تعد طريقة المربعات الصغرى الطريقة األفضل المستخدمة في تحليل االنحدار ولكن تحت بعض الشروط المناسبة. ان

تأثيرات عكسية على التقديرات والنتائج المتعلقة الى القيم الشاذة من الممكن أن يؤدي دكوجوعدم تحقيق هذه الشروط

بهذه الطريقة. لذلك تم اقتراح الطرق الحصينة التي من شأنها أن ال تتأثر بوجود هذه القيم الشاذة . تهدف هذه الدراسة

افة الى لمعرفة سلوك القيم الشاذة وتأثيرها في نماذج االنحدار الخطي التي تستخدم طريقة المربعات الصغرى. باالض

تقدير االحتمال األرجح, طريقة ذلك يهدف البحث لدراسة الطرق الحصينة المستخدمة في تحليل االنحدار و تشمل :

ى, تصغير النطاق, و طريق تقدير االحتمال المربعات المشذبة الصغر,وسيط المربعات األدنىالمتكررة, وسيطاتال

األرجح المحسنة.

هذه الطرق الحصينة بناء على بعض الخصائص الحصينة والكفاءة من خالل دراسة محاكاة, كما أخيرا, لقد قمنا بمقارنة

تم استخدام قيم حقيقية الجراء هذه المقارنات مع طريقة المربعات الصغرى.

أو تقلل وزن هذه القيمفي هذه الدراسة وجدنا ان الطرق الحصينة ليست حساسة لوجود القيم الشاذة , بل انها يمكن أن

المحسنة هي افضل الطرق الحصينة و أكفؤها في ج أن طريقة تقدير االحتمال االرجحتتجاهل وجودها, كما تم استنتا

معظم الحاالت.

Robust Methods in Regression Analysis: Comparison and...

Documents

Transcript of Robust Methods in Regression Analysis: Comparison and...