Robust Methods in Regression Analysis: Comparison and...
Transcript of Robust Methods in Regression Analysis: Comparison and...
Robust Methods in Regression Analysis: Comparison and Improvement
By
Mohammad Abd- Almonem H. Al-Amleh
Supervisor
Professor Faris M. Al-Athari
This Thesis was Submitted in Partial Fulfillment of the Requirements for the Master's
Degree of Science in Mathematics
Faculty of Graduate Studies
Zarqa University
December, 2015
iii
Dedication
To my dear mom, to my dear wife, and to my beloved daughters …
Dana and Ruba.
iv
Acknowledgments
I wish to thank all who have provided me support and assistance during my research. I am
deeply grateful to my supervisor Professor Faris M. Al-Athari. He has provided the
guidance and instruction that I value greatly.
I would like to thank the committee members for their cooperation.
v
List of Contents
Committee Decision …………………………………………………………….……… ii
Dedication …………………………….,………………………………………….……..iii
Acknowledgement ……………………………………………………………….………iv
List of Contents …………………………………………………………………….…….v
List of Tables ……………………………………………………………………………..vii
List of Figures…………………………………………………………………….……..…ix
List of Abbreviations…………………………………………………………………..…..x
Abstract……………………………………………………………….…………….….…xi
Introduction…………………………………………………………………………………1
Chapter one: Regression Analysis
1.1 Linear Regression Model ………………………………………………………....……6
1.2 Least Squares Method……………..……………………………………..……………..8
Chapter 2: Outliers and Influential Points
2.1 Introduction……………………………………………………………………………12
2.2 Residual Analysis ……………………………………………………………..………13
2.3 The Effect of Outliers to Least Squares Method………………..…………………….14
2.4 Types of Outliers………………………………………………….…………………..16
2.5 Simple Criteria for Model Comparison……………………………………………….18
2.6 Identifying Outlying Y-observation…………………………………………..……….18
vi
2.7 Identifying Outlying X-observation……………………………………………..…….21
2.8 Identifying Influential Cases……..…………………………………………..……….22
2.9 Example……………………………………………………………………..…………23
Chapter 3: Robust Regression Methods
3.1 Overview……………………………………………………………………….……..29
3.2 Properties of Robust Estimators…….…………………………………………...……31
3.3 Robust Regression Methods…………………………………………………….…….33
Chapter 4: Comparison among Robust Regression Methods
4.1 Real Life Data…………………..……………………………………………………..46
4.2 Simulation Study……………………………………………………………………...55
4.3 Discussion…………………………………………………………………..……...….65
Conclusion…………………………………………………………………………………66
References………………………………………………………………………..………..68
Abstract ( in Arabic) ………………………………………………………..…………….72
vii
List of Tables Table (2.1): Steel employment by Country in Europe…………………………………….24
Table (2.2): Fitted values, residuals, leverage, studentized residuals, R-student residuals,
and Cook's distances.............................................................................................................25
Table (2.3): Estimated parameters,√𝑀𝑆𝑅𝑒𝑠 , leverage, 𝑅2 for different scenarios in the
Steel Employment data………………………….…………………………...…….………28
Table (3.1): Different objective functions and p(u), and their properties: range of u,
influence function 𝜑(𝑢), and weight 𝑤(𝑢)………………………………….…………….37
Table (4.1): The estimated parameters in 7 robust methods and OLS for the Steel
Employment data set that contain outliers and clean data ………………………………...47
Table (4.2): Fitted values, residuals, and weights of OLS (10)…………………..…….…51
Table (4.3): Fitted values, residuals, and weights of M-H (10)……………………………51
Table (4.4): Fitted values, residuals, and weights of M-T (10)……………………………52
Table (4.5): Fitted values, and residuals of RM (10)………………………….………….52
Table (4.6): Fitted values, and residuals of LMS (10)………………………...………….53
Table (4.7): Fitted values, and residuals of LTS (10)……………………………..….…..53
Table (4.8): Fitted Values, and residuals of S(10)……………………………………..…54
Table (4.9): Fitted values, residuals, and weights of MM (10)………………………..….54
Table (4.10): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case1:ϵ~ N(0,1)……………………………………………...….57
Table (4.11): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case1:ϵ~ N(0,1)…………………………………………….…57
Table (4.12): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case 2: ϵ ~ N (0,1) with 10% identical outliers in y direction
(where we let the first 10% of y's equal to 40)…………………………………………….58
Table (4.13): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case 2:ϵ ~ N (0,1) with 10% identical outliers in y direction
(where we let the first 10% of y's equal to 40)…………………………………………….58
Table (4.14): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case3:ϵ~ N(0,1) with 25% identical outliers in y direction (where
we let the first 25 % of y's equal to 40)………………………………...………………….59
viii
Table (4.15): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case1:ϵ~ N(0,1) with 25% identical outliers in y direction
(where we let the first 25 % of y's equal to 40)…………………………………………....59
Table (4.16): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case4:ϵ~ N(0,1) with 10 % identical outliers in x direction (where
we let the first 10 % of x's equal to 30)..........................................………………………..60
Table (4.17): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case 4:ϵ~ N(0,1) with 10 % identical outliers in x direction
(where we let the first 10 % of x's equal to 30)……………………………………………60
Table (4.18): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case 5:ϵ~ N(0,1) with 25 % identical outliers in x direction
(where we let the first 25 % of x's equal to 30)……………................................................61
Table (4.19): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case 5:ϵ~ N(0,1) with 25 % identical outliers in x direction
(where we let the first 25 % of x's equal to 30)…………………………………………..61
Table (4.20): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates ,with n = 20 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100) ………………………..62
Table (4.21): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100)………………………62
Table (4.22): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case1:ϵ~ Laplace (0,4)………………………………………...63
Table (4.23): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 100 Case1:ϵ~ Laplace (0,4)………………………………………..63
Table (4.24): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 20 Case1:ϵ~ 𝑡3……………………………………………………..64
Table (4.25): Simulated Mean of the Estimated Parameters, MSE, and RMSE of Point
Estimates, with n = 50 Case1: ϵ~ 𝑡3……………….………………………………….…64
ix
List of Figures
Figure (2.1): the effect of outlying points on ordinary least squares method……………..15
Figure (2.2): Scatter plot for different types of outlying observations…………………….17
Figure (2.3): plot of the steel employment data…………………………………….…..…24
Figure (2.4): the least squares line for the steel employment example……………………25
Figure (2.5): plot the residuals of least squares estimates in the steel employment……….26
Figure (2.6): plot of the hat values ℎ𝑖𝑖 of least squares estimates in the steel employment
example……………………………………………………………………………………27
Figure (2.7): plot of the studentized residuals of least squares estimates in the steel
employment example…………………………………………………………………...…27
Figure (2.8): plot of the R-student residuals of least squares estimates in the steel
employment example……………………………………………………………………...27
Figure (2.9): plot of the Cook's distances of least squares estimates in the steel
employment example……………………………………………………………………...28
Figure (3.1): plot of Huber's weight function, with tuning constant a=1.345…………..…38
Figure (3.2): plot of Tukey's weight function, with tuning constant b=4.685……………..38
Figure (3.3): plot of ordinary least squares weight function……………………………....38
Figure (4.1): fitted line for OLS(10), M-H(10), OLS(8), and M-H(8) in the steel
employment example……………………………………………………………….....…..48
Figure (4.2): fitted line for OLS(10), M-T(10), OLS(8), and M-T(8) in the steel
employment example………………………………………………………………......….48
Figure (4.3): fitted line for OLS(10), RM(10), OLS(8), and RM(8) in the steel employment
example……………………………………………………………………………………49
Figure (4.4): fitted line for OLS(10), LMS(10), OLS(8), and LMS(8) in the steel
employment example……………………………………………………………......…….49
Figure (4.5): fitted line for OLS(10), LTS(10), OLS(8), and LTS(8) in the steel
employment example……………………………………………………………...………49
Figure (4.6): fitted line for OLS(10), S(10), OLS(8), and S(8) in the steel employment
example……………………………………………………………………………………50
Figure (4.7): fitted line for OLS(10), MM(10), OLS(8), and MM(8) in the steel
employment example…………………………………………………………………...…50
x
List of Abbreviations
𝐸(. ) Expected value
Cov(. ) Variance-Covariance matrix
𝑦𝑖∧
Fitted values
𝑒𝑖 Residuals
𝜀𝑖 Errors
𝜎2 Variance
𝑀𝑆𝑅𝐸𝑆 Residual mean square
SSE Residual sum of squares
H Hat matrix
𝑅2 Coefficient of determination
OLS Ordinary Least Squares Method
LAD Least Absolute Deviation Method
M-H M-estimation with Huber weights
M-T M-estimation with Tukey weights
RM Repeated Median Method
LMS Least Median of Squares Method
LTS Least Trimmed sum of Squares
S S-estimators
MM MM-estimators
MSE Mean Square Error
RMSE Relative Mean Square Error
xi
Robust Methods in Regression Analysis: Comparison and Improvement
By
Mohammad Abd- Almonem H. Al-Amleh
Supervisor
Professor Faris M. Al-Athari
Abstract
Ordinary least squares method is the best method that can be used in regression analysis
under some assumptions. Violation of these assumptions as the presence of outliers may
lead to adverse effect on the estimates and results associated to this method. However,
robust methods are proposed to cope with these unusual observations. The purpose of this
thesis is to define the behavior of outliers in linear regression and their influence on the
ordinary least squares method. In addition, we investigated various robust regression
methods including: M-estimation, repeated medians, least median of squares, least trimmed
squares estimation, scale estimation (S-estimation), and MM-estimation.
Finally, we compared these robust estimates based on their robustness and efficiency
through a simulation study. A real data set application is also provided to compare the
robust estimates with traditional least squares estimator.
In this thesis, we found that robust regression methods are not sensitive to the presence of
outliers, and they can down weight or ignore these observations, in addition we found that
MM-estimators are the best robust estimators that can be used efficiently in different
contamination models.
1
Introduction
Regression Analysis is a statistical technique for investigating and modeling the
relationship between dependent and independent variables. Applications of regression are
numerous and occur in almost every field, including engineering, the physical and
chemical sciences, economics, management, biological sciences, and the social sciences. In
fact, regression analysis may be the most widely used statistical technique. The main
objectives of the regression analysis are:
1. Prediction of future observations.
2. Assessment of the effect, or relationship between independent variables on the
response or dependent variable.
3. A general description of the data structure.
4. Control purposes.
Frequently in regression analysis applications, the data set contains some observations
which are outliers or extremes, i.e., observations which are well separated from the rest of
the data. These outlying observations may involve large residuals and often have dramatic
effects on the fitted least squares regression function. It is, therefore, important to study the
outlying observations carefully and their effect on the adequacy of the model. An
observation can be an outlier due to the dependent variable or any one or more of the
independent variables having values outside the expected limits. A data point may be an
outlier or a potentially influential point because of errors in the conduct of the study
(machine malfunction; recording, coding, or data entry errors; failure to follow the
experimental protocol) or because the data point is from a different population. The
2
danger of outlying observations to the least squares estimation is that they can have a
strong adverse effect on the estimate and they may remain unnoticed. Therefore, statistical
techniques that are able to cope with or to detect outlying observations have been
developed. Robust regression is an important method for analyzing data that are
contaminated with outliers. It can be used to detect outliers and to provide resistant results
in the presence of outliers.
Purposes of the Thesis
The main purposes of this thesis are:
1. Studying outliers and influential points and their effect on the least squares method.
2. Reviewing various robust methods that are used in regression analysis, and studying
their properties as breakdown point, and efficiency.
3. Comparing the robust methods and their performance in different contamination
situations and heavy tailed distributions.
Methodology
In order to study the effect of outliers and influential observations on the least squares
method, we used real data containing outliers, and then the least squares estimates were
compared in presence of outliers and without outliers. The same technique was used to
compare the robust regression methods and their properties. In addition, a simulation study
was used to compare the robust methods and the least squares method in different
scenarios. Calculations and simulations were done by using the program R version 3.2.2.
In fact, we used references [6], [8], [9], and [29] as a guide for this program.
3
Literature Survey
There are various methods that can be used to construct a linear regression model and
estimate the regression coefficients. One of the most popular methods is the ordinary least
squares method. It was discovered by Carl Friedrich Gauss in Germany and Adrien Marie
Legendre in France around 1805. The purpose of this procedure is to optimize fit by
minimizing the sum of the squares of the errors. Since OLS estimator could be calculated
explicitly from the data, it became the only possible approach in regression for many
decades. After Gauss had introduced the normal (Gaussian) distribution as the error
distribution, ordinary least squares became optimal, in addition, very important
mathematical results are obtained. Even now, OLS procedure is still used because of its
tradition and ease of computation.
Edgeworth proposed the least absolute deviation method 1887. He said that since the errors
are squared in the least squares procedures, the outliers have a large effect on the estimates.
Hence, he suggested minimizing the sum of the absolute values of the errors (Birkes and
Dodge, 1993).
Recently, there has been an increasing interest in other methods by realizing that it is very
difficult for a real life data to satisfy the necessary assumptions. Another reason for that is
the advances in computer technology, which have decreased the computational difficulties
of other methods.
Many robust methods have been proposed to deal with outliers. Huber introduced a class of
estimators known as M-estimators that aim to minimize: ∑ 𝑝(𝑢𝑖)𝑛𝑖=1 , where 𝑝(𝑢𝑖) is some
symmetric function of the errors. Several 𝑝 functions have been proposed, Huber and
Hampel are two examples of these functions (Huber, 2009).
4
Least Median of Squares (LMS) estimate is introduced by Rouusseeuw (1984) , in which
he replaced the sum of squared deviations in ordinary least squares by the median of the
squared deviations, which is a robust estimator of location.
Rousseeuw (1984) developed the least trimmed sum of squares (LTS ) which minimizes h
ordered squares residuals, where h is a constant that must be determined, .The largest
squared residuals are excluded from the summation in this method.
S-estimation is a high breakdown value method introduced by Rousseeuw and Yohai
(1984) that minimizes the scale of the residuals. Generalized S-estimates (GS-estimates) ,
that were proposed by (Croux et al. 1994), maintain high breakdown point as S-estimates
and have slightly higher efficiency.
MM-estimates proposed by Yohai (1987) can simultaneously attain high breakdown point
and efficiencies. Gervini and Yohai (2002) proposed a new class of high breakdown point
and high efficiency robust estimate called robust and efficient weighted least squares
estimator (REWLSE).
Lee et al. (2011) proposed a new class of robust methods based on the regularization of
case specific parameters for each response. They further proved that the M estimator
(Huber's function) is a special case of their proposed estimator. Another estimators can be
found in (Marona et al. 2006).
Another robust measure that is related to regression analysis is the robust coefficient of
determination that was introduced by Renaud and Victoria-Feser (2010).
A comparison among robust methods and least squares method was studied by Alma
(2011), in his study four robust regression methods were compared, Least Trimmed of
Squares (LTS), M-Estimate, S-Estimate, and MM-Estimate. The study concluded that S-
5
estimate and MM-estimate perform the best overall against a comprehensive set of outlier
conditions.
Another comparison study was applied by (Mohebbi et al, 2007), in which two robust
methods were compared, Huber M-estimate, and least absolute deviations (LAD), in
addition to nonparametric method , the conclusion was that the least absolute deviation and
Huber M-estimate are more appropriate in heavy tailed distributions, while the
nonparametric and the (LAD) regression are better choices for skewed data.
Noor and Mohammad (2013), compared three robust methods and some nonparametric
methods in simple linear regression models, they found that LAD and M-estimators are the
best methods when they were compared to nonparametric methods and OLS.
6
Chapter 1: Regression Analysis
(1.1) Linear Regression Model
Regression analysis is used for explaining and modeling the relationships between a single
variable y, called the response, output, or dependent variable, and one or more predictors,
input, or explanatory variables, 𝑥1 ,𝑥2, … , 𝑥𝑘. When 𝑘 = 1 it is called simple regression,
but when 𝑘 > 1 it is called multiple regression or multivariate regression. Suppose that
there is a relationship between a response variable 𝑦, and 𝑘 explanatory variables:
𝑥1 ,𝑥2, … , 𝑥𝑘 , then the multiple linear regression model can be expressed as :
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑘𝑥𝑘 + 𝜀 . (1.1)
The parameters 𝛽𝑗 , 𝑗 = 0,1,2, … , 𝑘 are called the regression coefficients. The model is
called linear because the parameters are linear.
The error 𝜀 is the difference between the observed values of 𝑦 and the function:
𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑘𝑥𝑘. It is convenient to think of 𝜀 as a statistical error, that is, it
is a random variable that accounts for the failure of the model to fit the data exactly.
(Montgomery, 2006)
In terms of the observed data, the linear regression model can be expressed as:
𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖1 + 𝛽2𝑥𝑖2 +⋯+ 𝛽𝑘𝑥𝑖𝑘 + 𝜀𝑖 , 𝑖 = 1,2, … , 𝑛. (1.2)
So, we can write each observation as:
𝑦1 = 𝛽0 + 𝛽1𝑥11 + 𝛽2𝑥12 +⋯+ 𝛽𝑘𝑥1𝑘 + 𝜀1
7
𝑦2 = 𝛽0 + 𝛽1𝑥21 + 𝛽2𝑥22 +⋯+ 𝛽𝑘𝑥2𝑘 + 𝜀2
⋮
𝑦𝑛 = 𝛽0 + 𝛽1𝑥𝑛1 + 𝛽2𝑥𝑛2 +⋯+ 𝛽𝑘𝑥𝑛𝑘 + 𝜀𝑛
These n equations can be written in matrix form as:
(
𝑦1⋮𝑦𝑛) = (
1 𝑥11… 𝑥1𝑘⋮ ⋮ ⋮1 𝑥𝑛1… 𝑥𝑛𝑘
)(𝛽0⋮𝛽𝑘
) + (
𝜀1⋮𝜀𝑛)
Or equivalently:
𝑌 = 𝑋𝛽 + 𝜀. (1.3)
Where 𝑋 is a 𝑛 × (𝑘 + 1) matrix, with elements to be known constants, 𝛽 is the vector of
the (𝑘 + 1) parameters, 𝜀 is the vector of the errors, and 𝑌 is the vector of the n
observations.
The linear regression model makes several assumptions about the data:
1. The expected value of the error term 𝐸(𝜀) = 0, and hence 𝐸(𝑌) = 𝑋𝛽
2. The variance-covariance matrix of the error term, 𝑐𝑜𝑣(𝜀) = 𝜎2𝐼 ,so c𝑜𝑣(𝑌) = 𝜎2𝐼 ,
which means that the variance of the error term is assumed to be constant for all 𝜀𝑖 , 𝑖 =
1,2, … , 𝑛 , and it indicates that the errors are assumed to be uncorrelated.
So, the response variable 𝑌 is a random variable with mean 𝑋𝛽 and variance 𝜎2𝐼 .
In most real world problems, the values of parameters 𝛽𝑗 , 𝑗 = 0,1,2, … , 𝑘 and the error
variance 𝜎2 will not be known, and they must be estimated from the sample data.
(Montgomery, 2006).
8
(1.2) Least Squares Method
As we showed in the previous section, the parameters 𝛽𝑗 , 𝑗 = 0,1, … , 𝑘 are unknown
quantities that characterize a regression model. Estimates of these parameters are
computable functions of data and are, therefore, statistics.
To keep this distinction clear, parameters are denoted by Greek letters like α, β, and σ, and
estimates of parameters are denoted by putting a "hat" over the corresponding Greek letter
(Weisberg, 2005).
In the least squares method of regression, the overall size of error is measured by the sum
of the squares of the errors: ∑ 𝜀𝑖2𝑛
𝑖=1 . The least squares estimates of 𝛽𝑗 , 𝑗 = 0,1, … , 𝑘
minimize the sum of errors ∑ 𝜀𝑖2𝑛
𝑖=1 .
In order to obtain the least squares estimates 𝛽0 ∧
, 𝛽1 ∧
, … , 𝛽𝑘 ∧
, we differentiate ∑ 𝜀𝑖2 𝑛
𝑖=1 =
∑ (𝑦𝑖 − 𝛽0 − 𝛽1𝑥𝑖1 − 𝛽2𝑥𝑖2 −⋯− 𝛽𝑘𝑥𝑖𝑘)2𝑛
𝑖=1 with respect to each 𝛽𝑗 and set the results
equal to zero to yield (𝑘 + 1) equations that can be solved simultaneously for the 𝛽𝑗 's.
Formulas for the least squares estimates can be expressed in matrix form as :
𝛽∧
= (𝑋′𝑋)−1𝑋′𝑌 (1.4)
Provided that the inverse matrix (𝑋′𝑋)−1 exists. The (𝑋′𝑋)−1 matrix will always exist if
the explanatory variables: 𝑥1 ,𝑥2, … , 𝑥𝑘 are linearly independent, that is, if no column of the
𝑋 matrix is a linear combination of the other columns (Montgomery, 2006).
Two important concepts that will be seen later are:
1. The fitted values 𝑦𝑖 ∧
: is defined for the observation i as:
9
𝑦𝑖∧
= 𝐸 ∧(𝑌|𝑋 = 𝑥𝑖) = 𝛽0
∧
+ 𝛽1∧
𝑥𝑖1 +⋯+ 𝛽𝑘∧
𝑥𝑖𝑘 . (1.5)
2. The 𝑖 𝑡ℎ residual 𝑒𝑖 : is defined as the difference between the observed value 𝑦𝑖 and
the corresponding fitted value 𝑦𝑖 ∧ . So 𝑒𝑖 can be expressed as:
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖∧
. (1.6)
Properties of the Least Squares Estimator 𝛽 ∧
If the previous assumptions E(𝑌) = 𝑋𝛽 , and 𝑐𝑜𝑣(𝑌) = 𝜎2𝐼 are satisfied, then there are
good properties of the least squares estimator 𝛽 ∧
:
1. 𝛽∧
is unbiased estimator for𝛽, that is 𝐸 (𝛽∧
) = 𝛽 .
2. The covariance matrix of 𝛽∧
can be found by the formula : 𝜎2(𝑋′𝑋)−1
3. 𝛽∧
has the minimum variance among all unbiased and linear estimators. This property
is known as Gauss-Markov theorem. Which can be stated as follows: If E(𝑌) = 𝑋𝛽 , and
(𝑌) = 𝜎2𝐼 , the least squares estimators 𝛽0∧
, 𝛽1∧
, … , 𝛽𝑘∧
are best linear unbiased estimators
(BLUE).In this expression, best means minimum variance and linear indicates that the
estimators are linear functions of y (Rencher, 2008).
Normal Error Regression Model
No matter what may be the form of the distribution of the error terms (and hence of
the 𝑦𝑖), the least squares method provides unbiased point estimators of 𝛽𝑗 , that have
minimum variance among all unbiased linear estimators.
10
The normal error model is the same as regression model (1.1) with unspecified error
distribution, except that this model assumes that the errors εi are normally distributed. So,
the assumption of uncorrelatedness of the εi in regression model (1.1) becomes one of
independence in the normal error model. Hence, the outcome in any trial has no effect on
the error term for any other trial as to whether it is positive or negative, small or large.
The normal error regression model implies that the 𝑦𝑖 's are independent normal random
variables, with mean 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑘𝑥𝑘 and variance 𝜎2 .
The importance of this model comes from the fact that inferences as testing hypotheses and
constructing confidence intervals about the model parameters and predictions of Y can be
obtained according to the properties of normality of the errors and the observed values.
Estimation of 𝝈𝟐
The need of estimating 𝜎2 is that a variety of inferences concerning the regression function
and the prediction of Y require an estimate 𝜎2. However, the method of least squares does
not yield a function of the y and x values in the sample that we can minimize to obtain an
estimator of 𝜎2. However, there is an unbiased estimator for 𝜎2 based on the least squares
estimator 𝛽 ∧
, which depends on the residuals.
In fact, the estimate of 𝜎2 can be obtained from the residual sum of squares (𝑆𝑆𝐸):
𝑆𝑆𝐸 =∑𝑒𝑖2 =∑(𝑦𝑖 − 𝑦𝑖
∧)2
𝑛
𝑖=1
𝑛
𝑖=1
(1.7)
So, by the previous assumptions we can show that 𝜎2 can be estimated by the
corresponding average from the sample as follows:
𝑀𝑆𝑅𝑒𝑠 = �̂�2 =
1
𝑛−𝑝∑ (𝑦𝑖 − 𝑥𝑖
′�̂�)2𝑛𝑖=1 ( 1.8)
11
Where n is the sample size and p is the number of parameters (𝑝 = 𝑘 + 1) , and 𝑀𝑆𝑅𝑒𝑠 is
called the residual mean square, and its square root √𝑀𝑆𝑅𝑒𝑠 is called standard error of
regression. Note that 𝑀𝑆𝑅𝑒𝑠 can be written as:
𝑀𝑆𝑅𝑒𝑠 =𝑆𝑆𝐸
𝑛−𝑝 ( 1.9)
which is an unbiased estimator of 𝜎2.
Because 𝑀𝑆𝑅𝑒𝑠 = �̂�2 depends on the residual sum of squares, any violation of the
assumptions on the model errors or any misspecification of the model form may seriously
damage the usefulness of �̂�2 as an estimate of 𝜎2. Because �̂�2 is computed from the
regression model residuals, we say that it’s a model-dependent estimate of 𝜎2
(Montgomery, 2006).
12
Chapter 2: Outliers and Influential points
(2.1) Introduction:
As we stated that for classical regression model, the method for estimating regression
parameters is the least squares method. If the error term in the regression model is
normally distributed, the least squares estimates of the regression parameters are the same
as the maximum likelihood estimates. After the estimates of the linear regression model are
obtained, the next step is to check whether or not; this linear regression model reasonably
reflects the true relationship between response variable and independent variables. This
falls into the area of regression diagnosis. There are two aspects of the regression
diagnosis. One is to check if a chosen model is reasonable enough to reflect the true
relationship between response variable and independent variables. Another is to check if
there are any data points that deviate significantly from the assumed model. The first
question relates to model diagnosis and the second question is to check outliers and
influential observations. In this chapter we focus on the detection of outliers and influential
observations.
Identifying outliers and influential observations for a regression model is based on the
assumption that the regression model is correctly specified. That is, the selected regression
model adequately specifies the relationship between response variable and independent
variables. Any data points that fit well the assumed model are the right data points for the
assumed model. Sometimes, however, not all data points fit the model equally well. There
may be some data points that may deviate significantly from the assumed model. These
data points may be considered as outliers if we believe that the selected model is correct.
Geometrically, linear regression is a line (for simple linear regression) or a hyper plane (for
13
multiple regression). If a regression model is appropriately selected, most data points
should be fairly close to regression line or hyper plane. The data points which are far away
from regression line or hyper-plane may not be "ideal" data points for the selected model
and could potentially be identified as the outliers for the model. An outlier is the data point
that is statistically far away from the chosen model if we believe that the selected
regression model is correct ( Yan and Su, 2009).
An influential observation is one that has a relatively larger impact on the estimates of one
or more regression parameters. i.e., including or excluding it in the regression model fitting
will result in unusual change in one or more estimates of regression parameters. We will
discuss the statistical procedures for identifying outliers and influential observations.
(2.2) Residual Analysis
As we saw in the previous chapter, the residual of the linear regression model is defined as
the difference between observed response variable 𝑦𝑖 and the fitted value 𝑦𝑖∧
.
The regression error term 𝜀 is unobservable and the residual is observable. Residual is an
important measurement of how close the calculated response from the fitted regression
model to the observed response.
Regression residual can be expressed in a vector form as :
𝑒 = 𝑌 − 𝑌∧
= 𝑌 − 𝑋(𝑋′𝑋)−1𝑋′𝑌 = (𝐼 − 𝐻)𝑌 (2.1)
where H = X(X′X)−1X′ is called the HAT matrix. Note that (I − H) is symmetric and
idempotent, that’s to say, (𝐼 − 𝐻)′ = (𝐼 − 𝐻) and (𝐼 − 𝐻)2 = (𝐼 − 𝐻). Further, we can
express the variance-covariance matrix of the residuals by the hat matrix as:
𝜎2(𝑒) = 𝜎2(𝐼 − 𝐻) (2.2)
14
Therefore, the variance of residuals 𝑒𝑖, denoted by 𝜎2(𝑒𝑖), is:
𝜎2(𝑒𝑖) = 𝜎2(1 − ℎ𝑖𝑖) ( 2.3)
where ℎ𝑖𝑖 is the element on the main diagonal of the hat matrix, and the covariance
between residuals 𝑒𝑖 and 𝑒𝑗 (𝑖 ≠ 𝑗) is:
𝜎(𝑒𝑖𝑒𝑗) = 𝜎2(0 − ℎ𝑖𝑖) = −ℎ𝑖𝑗𝜎
2 , 𝑖 ≠ 𝑗 (2.4)
where ℎ𝑖𝑗 is the element in the 𝑖th row and 𝑗th column of the hat matrix
so, �̂�2(𝑒𝑖) = �̂�2(1 − ℎ𝑖𝑖) (2.5)
�̂�(𝑒𝑖, 𝑒𝑗) = −ℎ𝑖𝑗(𝑀𝑆𝑅𝑒𝑠) , 𝑖 ≠ 𝑗
(2.3) The effect of outliers to least squares method:
Outliers are observations that appear inconsistent with the rest of the data set. Outliers
occur very frequently in real data, and they often go unnoticed because much data is
processed by computers, without careful inspection or screening. Outliers may a result of
errors in recording or may be from another population, or may be unusual observation
from the assumed distribution. For example, if the errors 𝜀𝑖 are distributed as 𝑁(0, 𝜎2), a
value of 𝜀𝑖 that is greater than 3𝜎 would be occur with probability 0.0027
(Rencher,2008).
Outliers can create great difficulty. When we encounter one, our first suspicion is that the
observation resulted from a mistake or other extraneous effect, and hence should be
discarded. A major reason for discarding it is that under the least squares method, a fitted
model may be pulled disproportionately toward an outlying observation because the sum of
15
2 4 6 8
46
81
01
2
x
y
0 2 4 6 8
46
81
01
21
4
x
y
the squared deviations is minimized. This could cause a misleading fit if indeed the
outlying observation resulted from a mistake or other extraneous cause. On the other hand,
outliers may convey significant information, as when an outlier occurs because of an
interaction with another predictor variable omitted from the model (Kutner, et al., 2005).
As an illustration, figure (2.1) represents the least squares fitted lines for: (a) with one
outlier, (b) the outlier was removed from the data set. In fact, the least squares estimates
were affected highly by this outlier as shown in the figure.
Figure (1-a) represents the ordinary least squares fit with only one outliers, the estimated
parameters are: 𝛽0∧
= 6.5562 , and 𝛽1∧
= 0.4611, while the estimated parameters
without outliers ( in figure (1-b ) ) are: : 𝛽0∧
= 2.187 and 𝛽1∧
= 1.063 .
Figure (2.1) : The effect of outlying points on Ordinary Least Squares
method, (a) represents the least squares model with one outlier, and (b)
represents least squares regression without outlier.
outlier
(a) (b)
16
(2.4) Types of Outliers:
Outliers can be classified as follows :( Ryan, 1997) and ( Adnan and Mohammad, 2003)
1. Regression Outlier
A point that deviates from the linear relationship determined from the other n-1 points, or
at least from the majority of those points.
2. Residual Outlier
A point that has a large residual when it is used in the calculations, which will be discussed
later. It is important to distinguish between a regression outlier and a residual outlier. So, a
point can be a regression outlier without being a residual outlier (if the point is influential),
and a point can be a residual outlier without there being strong evidence that the point is
also a regression outlier.
3. X-space outlier
This is an observation that is remote in one or more 𝑥 coordinates. An X- space outlier
could also be a regression outlier and /or a residual outlier.
4. Y-space outlier
This is a point that is outlying only because its y-coordinate is extreme. The manner and
extent to which such an outlier will affect the parameter estimates will depend upon both
its x-coordinate and the general configuration of the other points.
Thus, the point might also be a regression and/or residual outlier.
5. X- and Y-outlier:
A point that is outlying in both coordinates may be a regression outlier, or a residual outlier
17
(or both), or it may have a very small effect on the regression equation. The determining
factor is the general configuration of the other points.
Figure (2.2) illustrates the different types of outliers. The ellipse defines the majority of the
data. Point A is outliers in Y-space because its y value is significantly different from the
rest of the data, while point B is X-space outlier, that is, its x value is unusual and this is
also referred to as leverage point. However, points C and D are both X- and Y-outliers.
Note that point D has virtually no impact on the regression line, so it is neither regression
outlier nor residual outlier. Point C is considered as regression outlier, and it might be
residual outlier or not, depending on other measures, that will be discussed in the next
sections.
Figure (2.2): Scatter plot for different types of outlying observations
18
(2.5) Simple Criteria for Model Comparison
The following are some basic criteria that are commonly used for regression model
diagnosis:
1. Coefficients of determination:
𝑅2 = 1 −SSE
SST ( 2.6 )
where, 𝑆𝑆𝐸 = ∑ 𝑒𝑖2 = ∑ (𝑦𝑖 − 𝑦𝑖
∧)2𝑛
𝑖=1𝑛𝑖=1 , and the Total Sum of Squares (SST) =
∑ (𝑦𝑖 − 𝑦�̅�)2𝑛
𝑖=1 .The preferred model would be the model with 𝑅2 value close to 1. If the
data fit well the regression model then it should be expected that 𝑦𝑖 is close enough to 𝑦𝑖∧
.
Hence, SSE should be fairly close to zero. Therefore, 𝑅2 should be close to 1.
2. Estimate of error variance 𝑀𝑆𝑅𝑒𝑠 = �̂�2. Among a set of possible regression models a
preferred regression model should be one that has a smaller value of �̂�2 since this
corresponds to the situation where the fitted values are closer to the response observations
as a whole.
(2.6) Identifying outlying Y-observation
If the difference between the fitted value 𝑦𝑖∧
and the response 𝑦𝑖 is large, we may suspect
that the 𝑖𝑡ℎ observation is a potential outlier. The purpose of outlier detection in regression
analysis is eliminating some observations that have relatively larger residuals so that the
model fitting may be improved. However, eliminating observations cannot be made solely
based on statistical procedure. Determination of outlier should be made jointly by
statisticians and subject scientists (Yan, 2009). In many situations, eliminating
observations is never made too easy. Some observations that do not fit the model well may
imply a flaw in the selected model and outlier detection might actually result in altering the
model.
19
The analysis of residuals carries the most useful information for model fitting.
Three measures of residuals, that are commonly used in the outlier detection, will be
discussed, these are standardized residual, studentized residual, and R- student residuals.
1. Standardized Residual: let √𝑀𝑆𝑅𝑒𝑠 be the standard error of a regression model.
The standardized residual is defined as:
𝑧𝑖 =𝑒𝑖
√𝑀𝑆𝑅𝑒𝑠, 𝑖 = 1,2, … , 𝑛 (2.7)
The standardized residual is simply the normalized residual or the z-scores of the residuals.
The standardized residuals have mean zero and approximately unit variance. Consequently,
a large absolute standardized residual ( |𝑧𝑖| > 3 ) potentially indicates an outlier
(Montgomery,2006)
2. Studentized Residual:
The studentized residual is defined as:
ri =ei
�̂�(𝑒𝑖)=
ei
√𝑀𝑆𝑅𝑒𝑠(1 − hii) , i = 1,2, … , n (2.8)
Since the residuals may have different estimates for the variances σ2(ei), this suggests
that taking the magnitude of each ei relative to its estimated standard deviation �̂�(𝑒𝑖) (see
eq. (2.5) ) to give recognition to differences in the sampling errors of the residuals. (Kutner
et al, 2005). The ratio of ei to �̂�(𝑒𝑖) is called the studentized residual or internally
Studentized-residuals.
While the residuals ei will have substantially different sampling variations if their standard
deviations differ markedly, the studentized residuals ri have constant variance when the
model is appropriate (Kutner et al, 2005).
20
3. R–Student Residuals
The studentized residual 𝑟𝑖 discussed above is often considered an outlier diagnostic.
𝑀𝑆𝑅𝑒𝑠 was used as an estimate of 𝜎2 in computing 𝑟𝑖. This is referred to as internal scaling
of the residual because 𝑀𝑆𝑅𝑒𝑠 is an internally generated estimate of 𝜎2 obtained from
fitting the model to all 𝑛 observations.
Another approach would be to use an estimate of 𝜎2 based on a data set with the 𝑖th
observation removed. Denote this estimate of 𝜎2 by 𝑆(𝑖)2 . We can show that
𝑆(𝑖)2 =
(𝑛 − 𝑝)𝑀𝑆𝑅𝑒𝑠 −𝑒𝑖2
(1 − ℎ𝑖𝑖)
𝑛 − 𝑝 − 1 (2.9 )
This estimate of 𝜎2 is used instead of 𝑀𝑆𝑅𝑒𝑠 to produce an externally studentized residual,
usually called R-student given by:
𝑡𝑖 =𝑒𝑖
√𝑆(𝑖)2 (1−ℎ𝑖𝑖)
, 𝑖 = 1,2, … , 𝑛 (2.10)
In many situations 𝑡𝑖 will differ little from the studentized residual 𝑟𝑖. However, if the 𝑖th
observation is influential, then 𝑆(𝑖)2 can differ significantly from 𝑀𝑆𝑅𝑒𝑠, and thus the R-
student statistic will be more sensitive to this point.
Test of outliers:
We identify as outlying Y observations those cases whose studentized deleted residuals are
large in absolute value. In addition, we can conduct a formal test by means of the
Bonferroni test procedure of whether the case with the largest absolute studentized deleted
residual is an outlier. Since we do not know in advance which case will have the largest
absolute value |𝑡𝑖|, we consider the family of tests to include n tests, one for each case. If
21
the regression model is appropriate, so that no case is outlying because of a change in the
model, then each studentized deleted residual will follow the t- distribution with (𝑛 − 𝑝 −
1) degrees of freedom. The appropriate Bonferroni critical value therefore is :
𝑡 (1 −𝛼
2𝑛, 𝑛 − 𝑝 − 1). Note that the test is two-sided since we are not concerned with
the direction of the residuals but only with their absolute values (Montgomery, 2006).
(2.7) Identifying Outlying X Observations
The hat matrix, as we saw, plays an important role in determining identifying outlying Y
observations. The hat matrix is also helpful in directly identifying outlying X observations.
In particular, the diagonal elements of the hat matrix are useful indicators in a
multivariable setting of whether or not a case is outlying with respect to its X values.
The diagonal elements ℎ𝑖𝑖 of the hat matrix have some useful properties. In particular their
values are always between 0 and 1 and their sum is p:
0 ≤ ℎ𝑖𝑖 ≤ 1 𝑎𝑛𝑑 ∑ℎ𝑖𝑖
𝑛
𝑖=1
= 𝑝 (2.11)
where p is the number of regression parameters in the regression function. In addition, it
can be shown that ℎ𝑖𝑖 is a measure of the distance between the X values for the ith case and
the means of the X values for all n cases. Thus, a large value ℎ𝑖𝑖 indicates that the ith case
is distant from the center of all X observations. The diagonal element ℎ𝑖𝑖 in this context is
called the leverage (in terms of the X values) of the ith case. (Kutner, et al., 2005)
Hence, the farther a point is from the center of the X space, the more leverage it has.
Because the sum of the leverage values is p, an observation i can be considered as an X
outlier if its leverage exceeds twice the mean leverage value, denoted by ℎ ̅̅ ̅, which
according to (2.11) is:
22
ℎ ̅̅ ̅ =∑ ℎ𝑖𝑖𝑛𝑖=1
𝑛=𝑝
𝑛 (2.12 )
Hence, leverage values greater than 2p/n are considered by this rule to indicate outlying
cases with regard to their X values.
(2.8) Identifying Influential Cases
After identifying cases that are outlying with respect to their Y values and /or their X
values, we want to determine the influence of these observations on the regression model.
We choose one measure of influence that is widely used in practice, which is Cook's
distance.
Cook's distance
A measure of the overall influence of an outlying observation has on the fitted values was
proposed by R. D. Cook (1977). Cook's distance measure, denoted by Di , is an aggregate
influence measure, showing the effect of the ith case on all n fitted values
𝐷𝑖 =∑ (𝑦𝑗
∧ − 𝑦𝑗−𝑖
∧)2𝑛
𝑗=1
𝑝 × 𝑀𝑆𝑅𝑒𝑠 (2.13)
Note that each of the n fitted values 𝑦𝑗∧
is compared with the corresponding fitted value
𝑦𝑗−𝑖∧
when the ith case is deleted in fitting the regression model. These differences are
then squared and summed, so that the aggregate influence of the ith case is measured
without regard to the signs of the effects.
An equivalent expression of Cook's distance is :
𝐷𝑖 =𝑒𝑖2
𝑝 × 𝑀𝑆𝑅𝑒𝑠[
ℎ𝑖𝑖(1 − ℎ𝑖𝑖)2
] (2.14)
23
In this expression, we note that Di depends on both the residual 𝑒𝑖and the leverage ℎ𝑖𝑖 for
the ith observation. A large value of 𝐷𝑖 indicates that the observed 𝑦𝑖 value has strong
influence on the fitted values (since the residual, the leverage, or both will be large).
Values of 𝐷𝑖 can be compared to the values of the F distribution with 𝑣1 = 𝑝 and 𝑣2 =
𝑛 − 𝑝 degrees of freedom. Usually, an observation with a value of 𝐷𝑖 that falls at or above
the 50th
percentile of the F distribution is considered to be an influential observation
(Mendenhall, 2012).
(2.9) Example
Table (2.1) shows steel employment by country in thousands of people for the years 1974
and 1992, with x = the steel employment in thousands in year 1974 and y = the steel
employment in thousands in year 1992. First we make a plot of the data; this is shown in
Figure (2.3). We see that a straight line might describe the overall pattern of the data
reasonably well, and so the simple linear regression model 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜀 seems like a
good model to try. We use least squares method to estimate the parameters 𝛽0 , 𝛽1 , the
result is 𝛽0∧
= −0.3139, 𝛽1∧
= 0.4004 , so the fitted regression line is : 𝑦∧= −0.3139 +
0.4004 𝑥. This is the line in figure (2.4 ).
The fitted responses 𝑦𝑖∧
, residuals 𝑒𝑖, values of diagonal elements in the HAT matrix ℎ𝑖𝑖,
values of the studentized residuals 𝑟𝑖, values of the R-student statistics, and Cook's
distances associated with each observation are calculated and listed in table ( 2.2 ), and
the plots of each measure are provided in figures (2.5) to (2.9) .
24
0 50 100 150 200
02
04
06
08
01
00
12
0
(x) 1974
(y)
19
92
country 1974 (x) 1992 (y)
Germany 232 132
Italy 96 50
France 158 43
United Kingdom 194 41
Spain 89 33
Belgium 64 25
Netherlands 25 16
Luxembourg 23 8
Portugal 4 3
Denmark 2 1
Table (2.1) Steel Employment by Country in Europe, by Thousands in 1974 and
1992. Source (Draper and Smith, 1998, page 573)
Figure (2.3) : plot of the steel employment data
25
0 50 100 150 200
02
04
06
08
01
00
12
0
(x) 1974
(y)
19
92
Table (2.2) fitted values, residuals, leverage, studentized residuals, R-student
residuals, and Cook's distances
Country (𝒚𝒊) (𝒚𝒊∧
) (𝒆𝒊) (𝒉𝒊𝒊) (𝒓𝒊) (𝒕𝒊) (𝑫𝒊)
*Germany 132 92.57 39.43 0.44 2.53 5.34 2.54
Italy 50 38.12 11.88 0.10 0.60 0.58 0.02
France 43 62.95 -19.95 0.18 - 1.06 - 1.07 0.12
*United
Kingdom
41 77.36 -36.36 0.28 - 2.07 2.83 0.85
Spain 33 35.32 -2.32 0.10 - 0.12 - 0.11 0.00
Belgium 25 25.31 -0.31 0.11 - 0.016 - 0.015 0.00
Netherlands 16 9.70 6.30 0.17 0.33 0.31 0.01
Luxembourg 8 8.90 -0.90 0.17 - 0.05 - 0.04 0.00
Portugal 3 1.29 1.71 0.22 0.09 0.087 0.00
Denmark 1 0.49 0.51 0.23 0.03 0.026 0.00
Figure (2.4) : the least squares line for the Steel Employment Example
26
2 4 6 8 10
-20
020
40
Index
resi
dual
s(m
odel
)
Figure(2.5) plot of the residuals of least squares estimates in the Steel
Employment example
It can be seen from table (2.2) that observations 1 and 4 (Germany and United Kingdom)
have the largest residuals,𝑒1 = 39.43, 𝑒4 = −36.36, and the corresponding values of the
R-student statistic are respectively: 𝑡1 = 5.34, and 𝑡4 = 2.83 , so when the test level α is
0.05, we conclude that the first observation (Germany) is considered as an outlier since
𝑡 (1 −0.05
2∗10, 10 − 2 − 1) = 4.029 , while the fourth observation ( United Kingdom) is not
an outlier in this test.
As we can note from table (2.2), the largest ℎ𝑖𝑖value is for the first observation
(Germany), with ℎ1 = 0.44, and if we compare this value with 2𝑝
𝑛=
2×2
10= 0.4 we can say
that observation 1 is considered an X outlier, where all other observations are not.
Finally, if we want to determine which observation is influential, we use the last column,
which contains the Cook's distances 𝐷𝑖, as shown in the table, the largest values are
𝐷1 = 2.54 and 𝐷4 = 0.85 , we refer to the corresponding F distribution F(p, n − p) =
F(2 , 8) , we find that 2.54 is the 86th
percentile , and 0.85 is the 54th
percentile of this
distribution . Hence these two observations have a major influence on the regression fit.
27
2 4 6 8 10
-20
24
Index
rstu
dent
(mod
el)
2 4 6 8 10
-2-1
01
2
Index
rsta
nd
ard
(mo
de
l)
2 4 6 8 10
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Index
hatv
alue
s(m
odel
)
Figure(2.7) plot of the Studentized residuals of least squares estimates in the
Steel Employment example.
Figure(2.8) plot of the R-student residuals of least squares estimates in the
Steel Employment example.
Figure (2.6) plot of the hat values ℎ𝑖𝑖 of least squares estimates in the Steel
Employment example
28
2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
2.5
Index
cook
s.di
stan
ce(m
odel
)
Figure (2.9) plot of the Cook's distances of least squares estimates in the Steel
Employment example
Now, we perform the least squares estimation for two cases, the first one is for all the 10
observations, and second for only 8 observations (observations 1 and 4 are excluded from
the data set). In each case we compute the standard error of regression √𝑀𝑆𝑅𝑒𝑠 and the
coefficient of determination 𝑅2 , the results are summarized in table (2.3)
The results indicate that the best properties were satisfied if we excluded observations 1
and 4 (the influential observations), it can be seen from table (2.3) the decrease in the
√𝑀𝑆𝑅𝑒𝑠 and the increase of 𝑅2 in the second case.
Method �̂�𝟎 �̂�𝟏 √𝑴𝑺𝑹𝒆𝒔 𝑹𝟐
OLS for all cases -0.31386 0.40038 20.81 0.7357
OLS without observations 1 and 4 4.61012 0.30828 8.271 0.8281
Table (2.3) estimated parameters, √𝑀𝑆𝑅𝑒𝑠, leverage, 𝑅2 for different scenarios in the
Steel Employment data
𝑅2 distances
29
Chapter Three: Robust Regression Methods
(3.1) overview
Outliers should be investigated carefully. Often they contain valuable information about
the process under investigation or the data gathering and recording process. Before
considering the possible elimination of these points from the data, one should try to
understand why they appeared and whether it is likely similar values will continue to
appear. Of course, outliers are often bad data points.
When the observations 𝑌 in the linear regression model 𝑌 = 𝑋𝛽 + 𝜀 are normally
distributed, the method of least squares is a good parameter estimation procedure in the
sense that it produces an estimator of the parameter vector 𝛽 that has good statistical
properties. However, there are many situations where we have evidence that the
distribution of the response variable is (considerably) non normal and/or there are outliers
that affect the regression model. A case of considerable practical interest is one in which
the observations follow a distribution that has longer or heavier tails than the normal.
These heavy-tailed distributions tend to generate outliers, and these outliers may have a
strong influence on the method of least squares in the sense that they "pull" the regression
equation too much in their direction.
To conclude, regression outliers (either in X or in Y spaces) pose a serious threat to
ordinary least squares analysis. Basically, there are two ways out of this problem. The first,
and probably most well-known, approach is to construct regression diagnostics ,which was
discussed in chapter 2, diagnostics are certain quantities computed from the data with the
purpose of pinpointing influential points, after which these outliers can be removed or
30
corrected, followed by the least squares analysis on the remaining cases. When there is
only a single outlier, some of these methods work quite well by looking at the effect of
deleting one point at a time. Unfortunately, it is much more difficult to diagnose outliers
when there are several of them, and diagnostics for such multiple outliers are quite
involved and often give rise to extensive computations.
The other approach is robust regression, which tries to devise estimators that are not so
strongly affected by outliers. Therefore, diagnostics and robust regression really have the
same goals, only in the opposite order: When using diagnostic tools, one first tries to delete
the outliers and then to fit the "good" data by least squares, whereas a robust analysis first
wants to fit a regression to the majority of the data and then to discover the outliers as
those points which possess large residuals from that robust solution. The following step is
to think about the structure that has been uncovered. For instance, one may go back to the
original data set and use subject-matter knowledge to study the outliers and explain their
origin (Rousseeuw and Leroy, 1987).
In this chapter, we study the most popular robust regression methods that are used in linear
models, in addition some properties of these methods will be discussed.
(3.2) Properties of Robust Estimators:
In this section we introduce two important properties of robust estimators: breakdown and
efficiency.
(3.2.1) Breakdown Point
In chapter two, we saw that even a single regression outlier can totally offset the least
squares estimator (page 15). On the other hand, we will see that there exist estimators that
31
can deal with data containing a certain percentage of outliers. In order to formalize this
aspect, the breakdown point was introduced.
Take any sample of n data points, 𝑍 = {(𝑥11, … , 𝑥1𝑝, 𝑦1), … , (𝑥𝑛1, … , 𝑥𝑛𝑝, 𝑦𝑛)}, and let T
be an estimator. This means that applying T to such a sample Z yields a vector of
regression coefficients : T(Z) = 𝛽 ∧
. Now consider all possible corrupted samples 𝑍′ that are
obtained by replacing any m of the original data points by arbitrary values (this allows for
very bad outliers). Let us denote by bias (m; T, Z) the maximum bias that can be caused by
such a contamination:
bias(m; T, Z) = 𝑠𝑢𝑝𝑍′‖𝑇(𝑍′) − 𝑇(𝑍)‖ (3.1)
where the supremum is over all possible 𝑍′. If bias (m; T,Z ) is infinite, this means that m
outliers can have an arbitrarily large effect on T, which may be expressed by saying that
the estimator "breaks down" . Therefore, the (finite-sample) breakdown point of the
estimator T at the sample Z is defined as:
𝑏𝑛∗ = 𝑚𝑖𝑛 {
𝑚
𝑛 , bias(m; T, Z) 𝑖𝑠 𝑖𝑛𝑓𝑖𝑛𝑖𝑡𝑒} (3.2)
In other words, it is the smallest fraction of contamination that can cause the estimator T to
take on values arbitrarily far from T(Z) (Rousseeuw and Leroy ,1987).
For least squares, we have seen that one outlier is sufficient to carry T over all bounds.
Therefore, its breakdown point equals :
𝑏𝑛∗(𝑇, 𝑍) =
1
𝑛 (3.3)
32
which tends to be zero for increasing sample size n, so it can be said that least squares
method has a breakdown point of 0%. This again reflects the extreme sensitivity of the
least squares method to outliers.
(3.2.2) Efficiency
Suppose that the data set has no gross errors, there are no influential observations, and the
observations come from a normal distribution. If we use a robust estimator on such a data
set, we would want the results to be virtually identical to OLS, since OLS is the
appropriate technique for such data. The efficiency of a robust estimator can be thought of
as the residual mean square obtained from the OLS divided by the residual mean square of
the robust procedure ; we want this efficiency measure to be close to unity.
There is a lot of emphasis in the robust regression literature on asymptotic efficiency, that
is, the efficiency of an estimator as the sample size n becomes infinite. this is a useful
concept in comparing robust estimators, but many practical regression problems involve
small to moderate sample sizes (n < 50, for instance), and small sample efficiencies are
known to differ dramatically from their asymptotic values. Consequently, a model-builder
should be interested in the asymptotic behavior of any estimator that might be used in a
given situation but should not be unduly excited about it. What is more important from a
practical viewpoint is the finite-sample efficiency, or how well a particular estimator works
with reference to OLS on "clean" data for sample sizes consistent with those of interest in
the problem at hand (Montgomery,2006). The finite-sample efficiency of a robust
estimator is defined as the ratio of the OLS residual mean square to the robust estimator
residual mean square :
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 =𝑀𝑆𝑅𝑒𝑠(𝑂𝐿𝑆)
𝑀𝑆𝑅𝑒𝑠(𝑅𝑜𝑏𝑢𝑠𝑡 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟) (3.4)
33
where OLS is applied only to the clean data.
(3.3) Robust Regression Methods:
(3.3.1)M-estimation
M- estimators are " maximum likelihood type" estimators. Suppose the errors are
independently distributed and all follow the same distribution f(ε).Then the maximum
likelihood estimator (MLE) of β is given by 𝛽 ∧
, which maximizes the quantity:
∏f(yi − xi′ β
n
i=1
) (3.5)
Where xi′ is the ith row of X, i = 1,2, … . , n , in the model y = Xβ + ε.
Equivalently, the MLE of β maximizes:
∑ ln f(yi − xi′ β n
i=1 ) (3.6)
So, if the errors are normally distributed, this leads to minimize the sum of squares
function:
∑ (yi − xi′ β )2 (3.7) n
i=1
which is the least squares estimation, and if the errors follow the double exponential
distribution (Laplace distribution) , then we minimize :
∑|yi − xi′ β| (3.8)
which is the least absolute deviation estimation (LAD) or the L1-norm estimation.
34
This idea can be extended as follows, suppose p(ε) is a defined function of ε, so the M-
estimation method minimizes a function p of the ε :
min ∑ p(εi) = min ni=1 ∑ pn
i=1 (yi − xi′ β ) (3.9)
In general, the M-estimation is not scale invariant (i.e., if the errors yi−xi′β were multiplied
by a constant, the new solution to (3.9) might not be the same as the old one). To obtain a
scale invariant version of this estimator, we solve the equation:
min ∑ p(ε𝑖
s) = n
i=1 min ∑ p(yi−𝑥𝑖
′β
s) n
i=1 (3.10)
where s is a robust estimate of scale . A popular choice for s is the median absolute
deviation MAD, which is highly resistant to outlying observations:
𝑠 = 𝑀𝐴𝐷 =median|ei−median(ei)|
h , 𝑖 = 1,2, … 𝑛 (3.11)
the constant ℎ is suggested to be 0.6745 which makes 𝑠 an approximately unbiased
estimator of 𝜎 if 𝑛 is large and the error distribution is normal (Draper and Smith, 1998).
We see that if 𝑝(ε) = ε2 , the criterion minimized is the same as equation (3.7) , while if
𝑝(ε) = |ε| , we get equation (3.8).So, in these specific cases , the form of 𝑝 and the
underlying distribution are specifically related. In fact, the knowledge of an appropriate
distribution for the errors tells us what 𝑝(ϵ) to use.
To minimize equation (3.10) , we equate the first partial derivatives of 𝑝 with respect to
𝛽𝑗(𝑗 = 0, … , 𝑘)to zero , yielding a necessary condition for a minimum .
This gives the system of 𝑝 = 𝑘 + 1 equations:
35
∑ 𝑥𝑖𝑗𝜓(𝑦𝑖−�́�𝑖𝛽
𝑠
𝑛𝑖=1 ) = 0 , 𝑗 = 0,1, … 𝑘,…. (3.12)
where 𝜓 is the derivative 𝑑𝑝
𝑑ϵ , and 𝑥𝑖𝑗 is the 𝑖th observation on the 𝑗th predictor and
𝑥𝑖0 = 1 .
These equations do not have an explicit solution in general, and iterative methods or
nonlinear optimization techniques will be used to solve them. In fact, iterative reweighted
least squares (IRLS) might be used.
To use (IRLS), first we define the weights as:
𝑤𝑖𝛽 =
{
𝜓(
𝑦𝑖 − 𝑥𝑖′𝛽
𝑠 )
(𝑦𝑖 − 𝑥𝑖
′𝛽𝑠 )
, 𝑦𝑖 ≠ 𝑥𝑖′𝛽
1 , 𝑦𝑖 = 𝑥𝑖′𝛽
(3.13)
Then (3.12) becomes:
∑𝑥𝑖𝑗𝑤𝑖𝛽(𝑦𝑖 − 𝑥𝑖′𝛽) = 0 , 𝑗 = 0,1, … . , 𝑘 (3.14)
𝑛
𝑖=1
This can be written in matrix notation as:
𝑋′𝑊𝛽𝑋𝛽 = 𝑋′𝑊𝛽𝑌 (3.15)
where 𝑊𝛽 is an 𝑛 × 𝑛 diagonal matrix of "weights" with diagonal elements:
𝑤1𝛽 , 𝑤2𝛽 , …𝑤𝑛𝛽 , given by equation (3.13). Equations (3.14) and (3.15) can be considered
as the usual least squares normal equations.
To solve equation (3.15), we follow the following steps:
36
1. The least squares method is used to fit an initial model to the data, yielding the initial
estimates of the regression coefficients 𝛽0∧
.
2. The initial residuals 𝑒𝑖0 are found using 𝛽0
∧
, and they can be used to calculate the initial
scale 𝑠0 . (eq. (3.11) ).
3. A weight function 𝑤(𝑢) is chosen and applied to 𝑒𝑖0
𝑠0 and then (3.13) is used to obtain the
initial weights 𝑊0.
4. Using 𝑊0 we can obtain 𝛽1∧
from (3.15)
5. Using 𝛽1∧
, new residuals, 𝑒𝑖1 can be found, and then by calculating 𝑠1 and
application of the weight function we obtain 𝑊1.
6. Then 𝑊1is used to get 𝛽2∧
, and so on .
Usually some iterations are required to achieve convergence.
We can write the iterative solution as :
�̂�𝑞+1 = (𝑋′𝑊𝑞𝑋)−1𝑋′𝑊𝑞𝑦 , 𝑞 = 0,1,2… (3.16)
And the procedure may be stopped when all the estimates change by less than some
selected present amount , say 0.1% or 0.01% , or after a selected number of iterations .
A number of popular criterion or objective functions 𝑝(𝑢) are proposed, where u
represents the scaled residual, two functions will be discussed and studied:
1. Huber’s function.
2. Tukey’s (bi square) weight function.
The objective function 𝑝(𝑢) , its derivative 𝜓(𝑢), and the weight function 𝑤(𝑢)of each
criterion in addition to the least squares method are all summarized in table (3.1).
37
Table ( 3.1) different objective functions and p(u), and their properties: range of
𝑢 , 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝜓(𝑢), 𝑎𝑛𝑑 𝑤𝑒𝑖𝑔ℎ𝑡 𝑤(𝑢) = 𝜓(𝑢)
𝑢
Robust regression procedures can be classified by the behavior of their 𝜓 function which is
called the influence function. The 𝜓 function controls the weight given to each residual. In
least squares method, the 𝜓 function is unbounded, and thus least squares is not robust if
the data follow a heavy-tailed distribution .The Huber’s function has a monotone 𝜓
function, and does not weight large residual as heavily as least squares. While the Tukey’s
influence function is hard redescenders, that is, the 𝜓 function equals zero for sufficiently
large|𝑢|.
Figures (3.1 ), (3.2), (3.3), show the graph of the weight functions of the Huber's criteria,
Tukey's criteria, and the least squares criteria respectively. It can be seen that least squares
method gives weight of 1 to all residuals either large or small, while the other criteria
down-weight large residuals in different ways. In addition, the symmetry around u = 0 can
Criterion(function) 𝒑(𝒖) 𝝍(𝒖) 𝒘(𝒖) =𝝍(𝒖)
𝒖
Huber’s
{
1
2𝑢2, 𝑖𝑓 |𝑢| ≤ 𝑎
𝑎|𝑢| −1
2𝑎2, 𝑖𝑓 |𝑢| > 𝑎
𝑢
𝑎𝑠𝑖𝑔𝑛(𝑢)
1
𝑎
|𝑢|
Tukey’s
(bisquare) {
𝑏2
6(1 − (1 − (
𝑢
𝑏)2)3), 𝑖𝑓 |𝑢| ≤ 𝑏
1
6𝑏2, 𝑖𝑓 |𝑢| > 𝑏
𝑢(1 − (𝑢
𝑏)2
)2
0
(1 − (𝑢
𝑏)2)2
0
Least squares 1
2𝑢2, −∞ < 𝑢 < ∞
𝑢
1
38
be noted in each weight function. Also, we can see that the constants a= 1.345 in Huber
weight function, and b = 4.685 in Tukey weight function , which are called tuning
constants, are chosen to make the IRLS robust procedure 95 percent efficient for data
Figure (3.1) : plot of Huber weight function, with tuning constant a = 1.345
Figure (3.2) : plot of Tukey's weight function, with tuning constant b=4.685
Figure (3.3) : plot of ordinary least squares weight function.
Huber weight
39
generated by the normal error regression model( Kutner, et al , 2005). More about M
estimation can be found in ( Hogg and Craig, 2006)
(3.3.2)Repeated Medians Estimators
Repeated Median method (RM) was proposed by Siegel (1982). In order to understand
the repeated median algorithm we will start with the bivariate case (p=2). Given data
(𝑥1, 𝑦1)… . , (𝑥𝑛, 𝑦𝑛) with distinct 𝑥1, we wish to predict 𝑦 from 𝑥 by selecting a robust
straight line of the form 𝑦 = 𝛽0 + 𝛽1𝑋.
To estimate 𝛽1, note that each pair of points determines a line, we will denote the slopes of
these lines by : 𝛽1∧
(𝑖, 𝑗) = (𝑦𝑗−𝑦𝑖)
(𝑥𝑗−𝑥𝑖) .
All possible pairs of points give us 𝑛 (𝑛 − 1) 2⁄ such estimates of 𝛽1, which are to be
combined in a resistant way. The RM estimate is the result of two stages of medians in this
case:
𝛽1∧
= 𝑚𝑒𝑑𝑖 𝑚𝑒𝑑𝑗≠𝑖 𝛽1∧
(𝑖, 𝑗) (3.17)
At the first (inner) stage we take the median of the slopes of the 𝑛 − 1 lines passing
through a given point (𝑖) and one other point. At the second (outer) stage we take the
median of these 𝑛 medians, this is why this method was called "repeated median".
There are now two ways to estimate 𝛽0∧
: if 𝛽1∧
is used, then to each point we can associate
the intercept 𝛽0 𝑖∧
= 𝑦𝑖 − 𝛽1∧
𝑥𝑖. Only a single median is now needed, and the RM estimate is
40
𝛽0∧
= 𝑚𝑒𝑑𝑖 β0 i∧
(3.18)
Alternatively, if we do not use 𝛽1∧
, then we need a pair of points for each estimate of 𝛽0
using : 𝛽0∧
(𝑖, 𝑗) = (𝑥𝑗𝑦𝑖−𝑥𝑖𝑦𝑗)
(𝑥𝑗−𝑥𝑖). A double repeated median is then taken as in (3.17), and the
repeated median estimate for 𝛽0 not using 𝛽1∧
is :
𝛽0∧
= 𝑚𝑒𝑑𝑖 𝑚𝑒𝑑𝑗≠𝑖 𝛽0∧
(𝑖, 𝑗) (3.19)
For the general case, for any 𝑝 observations with indices {𝑖1, … . . , 𝑖𝑝}, we compute the
coefficients 𝛽0(𝑖1, … . . , 𝑖𝑝),… . , 𝛽𝑘(𝑖1, … . . , 𝑖𝑝), ( p = k+1) such that the corresponding
surface fits these 𝑝 points exactly. The 𝑗th coefficient of the repeated median estimator is
then defined as
𝛽𝑗∧
= 𝑚𝑒𝑑𝑖1(𝑚𝑒𝑑𝑖2(… . (𝑚𝑒𝑑𝑖𝑝𝛽𝑗(𝑖1, … . , 𝑖𝑝))…)) , 𝑗 = 0,1, … , 𝑘 (3.20 )
where the outer median is over all choices of 𝑖1,the next is over all choices of 𝑖2 ≠ 𝑖1, and
so on. This estimator can be computed explicitly but requires consideration of all subsets
of 𝑝 points, which may cost a lot of time. It has been successfully applied to problems with
small p (Rousseeuw and Leroy, 1987).
The repeated median estimator is the first robust regression method with a 50% breakdown
point, and it is robust for both X and Y outliers.
(3.3.3)Least median of squares:
The least median of squares (LMS) estimator is obtained by minimizing the 𝑚th-ordered
squared error:
41
𝑚𝑖𝑛𝛽 𝑚𝑒𝑑𝑖𝑎𝑛𝑖( 𝜀𝑖2 ) (3.21)
where m must be determined, possible choices are: 𝑚 = [𝑛
2] + [
𝑝+1
2], and 𝑚 = [
𝑛+1
2] , with
𝑛 and 𝑝 denoting the sample size and number of parameters ,respectively. And the symbol
[.] denotes the integer portion of the argument. This estimator was introduced by
Rousseeuw (1984). However, as Least squares method minimizes the sum of the squared
errors, now the mean is typically not a good estimator of location when there are outliers,
and the median is often preferable .Therefore, LMS minimizes the median of the squared
errors.
To view the LMS estimator geometrically, assume that we have a single explanatory
variable (𝑝 = 2). The LMS fitted model is the equation of the line that is in the center of
the narrowest strip that will cover the majority of the data, with distance measured
vertically.
In general, the advantage of LMS is that it has a high breakdown point, theoretically 50%.
That is, up to half of the data could be anomalous without rendering the regression model
useless (Rousseeuw and Leroy 1987).
Basically, LMS fits just half of the data, so it is possible for outliers that are aligned with
good data values to pull the regression model away from the desired equation for the good
data. As a result, LMS can perform poorly relative to OLS when OLS is the appropriate
criterion. The asymptotic efficiency of LMS is actually zero (Ryan, 1997), because it
requires that the squared residual be minimized at a particular point, in effect ignoring the
fit at the other 𝑛 − 1 obsevations. As 𝑛 become large, we would expect the fit at these
42
𝑛 − 1 points to become poor relative to the least-squares fit. The finite-efficiency of LMS
can also be very low.
In an effort to improve the efficiency of LMS, Rousseeuw and Leroy (1987) suggested
using LMS estimates as starting values for computing a one-step M-estimator.
(3.3.4) Least Trimmed Sum of Squares Estimators
Another method developed by Rousseeuw (1984) is least trimmed sum of squares (LTS)
estimators. Extending from the trimmed mean, LTS estimator minimizes the trimmed sum
of the squared residuals. The LTS estimator is found by:
min∑𝑒(𝑖)2
ℎ
𝑖=1
(3.22)
where 𝑒(1)2 , 𝑒(2)
2 , … , 𝑒(𝑛)2 are the ordered squares residuals, from smallest to largest,
the residuals are first squared and then ordered, and the value of h must be determined.
We might let ℎ =𝑛
2 , so that it will be a high breakdown point estimator (50%), but on the
other hand high breakdown point can sometimes produce poor results in normal situations.
Another choice is ℎ = [𝑛
2] + [
𝑝+1
2] which increases the efficiency of the estimator.
Consequently, it seems preferable to use a larger value of ℎ and to speak of a trimming
percentage 𝛼. Rousseeuw and Leroy (1987) suggest that ℎ might be selected as ℎ =
[𝑛(1 − 𝛼)] + 1.
LTS is intuitively more appealing than LMS, due in part to the fact that the objective
function is not based on the fit at any particular point.
Part of the appeal of LTS is that if we are fortunate to trim the exact number of bad data
points without trimming any good data points, we will have the optimal estimator (OLS)
43
applied to the good data points. Currently LTS is considered the preferred choice of
Rousseeuw and Ryan. (Ryan, 1997)
(3.3.5) S-Estimators:
This method is first proposed by Rousseeuw and Yohai (1984); they chose to call these
estimators S-estimators because they were based on estimates of scale. In the same way that
the least squares estimator minimizes the variance of the errors, S-estimators minimize the
dispersion of the residuals, so the estimator 𝛽𝑠∧
in the notation used here, is obtained as:
𝑚𝑖𝑛𝛽𝑠(𝑒1(𝛽),… , 𝑒𝑛(𝛽)) ( 3.23)
where 𝑒1(𝛽), … , 𝑒𝑛(𝛽) denote the 𝑛 residuals for a given candidate 𝛽, and
𝑠(𝑒1(𝛽),… , 𝑒𝑛(𝛽)) is given by the solution to
1
𝑛 ∑𝜌(𝑒𝑖 𝑠⁄
𝑛
𝑖=1
) = 𝑘 (3.24)
where 𝑘 is a constant, and the objective function 𝜌(. ) must be selected with the following
conditions:
1. 𝜌 is symmetric and continuously differentiable, and 𝜌(0) = 0.
2. There exists 𝑐 > 0 such that 𝜌 is strictly increasing on [0 , 𝑐 ], and constant on
[𝑐 , ∞ ).
3. 𝑘
𝜌(𝑐)=
1
2 .
The second condition on the objective function means that the associated influence function
will be redescending. A possible objective function to use is the one associated with the
Tukey bisquare objective function, given in Table (3.1).
44
𝜌(𝑥) =
{
𝑥2
2−𝑥4
2𝑐2+𝑥6
6𝑐4 |𝑥| ≤ 𝑐
𝑐2
6 |𝑥| > 𝑐
The third condition is required to obtain a breakdown point of 50%. Often, K is chosen so
that the resulting 𝑠 is an estimator for σ when the errors are normally distributed. In order to
do this, K is set to 𝐸𝜙(𝜌(𝑢)) which is the expected value of the objective function if it is
assumed that u has a standard normal distribution (Rousseeuw and Leroy, 1987) .
The asymptotic efficiency of this class of estimators depends on the objective function by
which they are defined; the tuning constants of this function cannot be chosen to give the
estimator simultaneously high breakdown point and high asymptotic efficiency.
When using the Tukey bisquare objective function, Rousseeuw and Yohai (1984) state that
setting 𝑐 = 1.548 satisfies the third condition, and so results in an S-estimator with 50%
breakdown point and about 28% asymptotic efficiency.
Trade-off in breakdown and efficiency is possible based on choices for tuning constants c
and K, for example if we choose c = 5.182, k= 0.4475 we will have an asymptotic
efficiency of 96.6% but with 10% breakdown point (Rousseeuw and Leroy, 1987).
The final scale estimate, s, is the standard deviation of the residuals from the fit that
minimized the dispersion of the residuals.
(3.3.6) MM-Estimation:
First proposed by Yohai (1987). He combines a high breakdown point (50%) with good
efficiency. The "MM" in the name refers to the fact that more than one M-estimation
procedure is used to calculate the final estimates. Following from the M-estimation case,
iteratively reweighted least squares (IRLS) is employed to find estimates. The procedure is
as follows:
45
1. Initial estimates of the coefficients 𝛽 and corresponding residuals e are taken from a
highly resistant regression (i.e., a regression with a breakdown point of 50%). It is not
necessary that the estimator must be efficient; as a result, S estimation with Huber or bi-
square weights is typically employed at this stage, the estimates of 𝛽 will be denoted by 𝛽.
2. The residuals 𝑒 from the initial estimation at step 1 are used to compute an M-
estimation of the scale of the residuals, 𝑠𝑛.
3. The initial estimates of the residuals 𝑒 from step 1 and of the residual scale 𝑠𝑛 from
step 2 are used in the first iteration of reweighted least squares to determine the M-
estimates of the regression coefficients, so an MM-estimator �̂� is defined as a solution to:
∑𝑤𝑖(𝑒𝑖𝑠𝑛
𝑛
𝑖=1
)𝑥𝑖 = 0 (3.25)
where the 𝑤𝑖are typically Huber or Tukey bisquare weights.
4. New weights are calculated, 𝑤𝑖 , using the residuals from the iteration in step 3.
5. Keeping the measure of the scale of the residuals 𝑠𝑛 fixed from step 2, so steps 3 and 4
are continually reiterated until convergence.
Although MM estimation aims to obtain estimates that have a high breakdown value and
more efficient, but MM estimates has trouble with high leverage outliers in small to
moderate dimension data (Alma, 2011).
46
Chapter Four: Comparison among Robust Regression
Methods
In order to compare the robust methods that were discussed in chapter three, we consider
two studies; the first one is based on a real life data that contain outliers and a comparison
study will be implemented on these robust methods. The second is a simulation study, in
which different scenarios will be used to compare the robust regression methods, and their
properties will be discussed.
(4.1) Real life data:
We refer to our example in chapter two, data set are shown in table(2.1) page 23, and the
plot of the data is shown in figure(2.3) page 24. In this example we found that observations
1 and 4 (Germany and United Kingdom) were classified as influential observations that
can cause large effect on the regression fit.
Now, we find the estimates of the parameters 𝛽0, 𝛽1 in the proposed model: 𝑦 = 𝛽0 +
𝛽1𝑥 + 𝜀 using ordinary least squares method (OLS), and using the robust methods that
were discussed in chapter 3, the estimates will be found for two cases, one for all 10
observations with the influential points, and the second for observations without
observations 1 and 4 ( for 8 observations).
The fitted models are compared to the OLS model with and without influential points, each
fitted model is plotted against the least squares model, and hence the effect of the unusual
observations on each model can be seen. In addition, the efficiency of each robust method
can be noted from the performance of a model in the clean data case (8 observations).
Plots of these models are provided in figures (4.1) to (4.7).
The calculations are found using the program R 3.2.2. For the M-estimation with Huber
weights, we used the tuning constant a = 1.345, and the accuracy of convergence 0.001, the
47
estimates converged after 20 iterations (in M-H(10)) and 13 iterations in (M-H(8)). In the
Tukey weights, we used the tuning constant b= 4.685, accuracy of convergence is 0.001,
the estimates converged after 11 iterations ( in M-T (10)) and 13 iterations (in M-T(8)).
Method �̂�0 �̂�1
OLS (10) −0.3139 0.4004
OLS(8) 4.6101 0.3083
M-H(10) 2.9593 0.3298
M-H (8) 4.4053 0.2881
M-T(10) 6.6582 0.2282
M-T (8) 4.2220 0.2846
RM (10) 1.5333 0.3727
RM (8) 1.5333 0.3727
LMS (10) 0.5500 0.3667
LMS (8) 0.8966 0.3678
LTS (10) 4.6101 0.3083
LTS (8) 1.9693 0.3584
S (10) 7.0645 0.2305
S (8) 3.7508 0.3128
MM (10) 7.0706 0.2307
MM (8) 4.2297 0.2860
Table (4.1) The estimated parameters in 7 robust methods and OLS for the Steel Employment
data set that contains outliers (10 observations), and clean data( 8 observations) .
48
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
For the LTS estimates, we used ℎ = [𝑛
2] + [
𝑝+1
2] = 5 + 1 = 6. For the LMS estimates we
used 𝑚 = [𝑛+1
2] = 5. And finally Tukey bisquare objective function was used for the S-
estimates and MM-estimates, with c = 1.548, and the MM-estimates converged after 5
iterations ( in MM(10)) and 12 iterations ( in MM(8)).
_____ OLS(10)
-------- M-H(10)
____ OLS(10)
------- M-T(10)
Figure(4.1) fitted line for OLS(10) , M-H(10), OLS(8), and M-H(8)in the steel employment example
Figure(4.2) fitted line for OLS(10) , M-T(10), OLS(8),and M-T(8) in the steel employment example
____ OLS(8)
-------- M-H(8)
____ OLS(8)
------ M-T(8)
49
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
_____ OLS(10)
------- RM(10)
____ OLS(10)
------ LMS(10)
____ OLS(10)
-------- LTS(10)
_____ OLS(8)
------- RM(8)
____ OLS(8)
------ LMS(8)
____ OLS(8)
-------- LTS(8)
Figure(4.3) fitted line for OLS(10) , RM(10), OLS(8), and RM(8)in the steel employment example
Figure(4.4) fitted line for OLS(10) , LMS(10), OLS(8), and LMS(8)in the steel employment example
Figure(4.5) fitted line for OLS(10) , LTS(10), OLS(8), and LTS(8) in the steel employment example
50
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
0 50 100 150 200
02
04
06
08
01
00
12
0
x
y
0 50 100 150
01
02
03
04
05
0
x
y
In order to explore the performance of each robust method and OLS in presence of
influential points, fitted values and residuals (and weights for M-H, M-T, and MM) are
calculated and summarized in tables ( 4.2) to (4.9). The results will be discussed after the
simulation study.
____ OLS(10)
-------- S(10)
_____ OLS(10)
-------- MM(10)
_____ OLS(8)
-------- MM(8)
____ OLS(8)
-------- S(8)
Figure(4.6) fitted line for OLS(10) , S(10), OLS(8), and S(8) in the steel employment example
Figure(4.7) fitted line for OLS(10) , MM(10), OLS(8), and MM(8) in the steel employment example
51
𝑦𝑖 𝑦𝑖∧
𝑒𝑖 𝑤𝑖
132 92.5746957 39.4253043 1
50 38.1227863 11.8772137 1
43 62.9464509 -19.9464509 1
41 77.3601916 -36.3601916 1
33 35.3201145 -2.3201145 1
25 25.3105723 -0.3105723 1
16 9.6956866 6.3043134 1
8 8.8949232 -0.8949232 1
3 1.2876712 1.7123288 1
1 0.4869078 0.5130922 1
yi yi∧
ei wi
132 79.4854 52.5146 0.1408
50 34.6253 15.3747 0.4809
43 55.0762 -12.0762 0.6105
41 66.9509 -25.9509 0.2843
33 32.3163 0.6837 1
25 24.0699 0.9300 1
16 11.2057 4.7943 1
8 10.5460 -2.5460 1
3 4.2787 -1.2787 1
1 3.6190 -2.6190 1
Table(4.3) fitted values , residuals , and weights of M-H(10)
Table (4.2) fitted values , residuals , and weights of OLS(10)
52
𝑦𝑖 𝑦𝑖∧
𝑒𝑖 𝑤𝑖
132 59.611592 72.3884082 0
50 28.569936 21.4300635 0.4371
43 42.721279 0.2787207 1
41 50.938188 -9.9381881 0.8596
33 26.972204 6.0277958 0.9471
25 21.266018 3.7339824 0.9795
16 12.364366 3.6356335 0.9806
8 11.907872 -3.9078715 0.9776
3 7.571170 -4.5711697 0.9694
1 7.114675 -6.1146747 0.9456
𝑦𝑖 𝑦𝑖∧
𝑒𝑖
132 88.006061 43.99393939
50 37.315152 12.68484848
43 60.424242 -17.42424242
41 73.842424 -32.84242424
33 34.706061 -1.70606061
25 25.387879 -0.38787879
16 10.851515 5.14848485
8 10.106061 -2.10606061
3 3.024242 -0.02424242
1 2.278788 -1.27878788
Table(4.5) fitted values ,and residuals of RM(10)
Table(4.4) fitted values , residuals , and weights of M-T(10)
53
𝑦𝑖 𝑦𝑖∧
𝑒𝑖
132 85.616667 46.3833333
50 35.750000 14.2500000
43 58.483333 -15.4833333
41 71.683333 -30.6833333
33 33.183333 -0.1833333
25 24.016667 0.9833333
16 9.716667 6.2833333
8 8.983333 -0.9833333
3 2.016667 0.9833333
1 1.283333 -0.2833333
𝑦𝑖 𝑦𝑖∧
𝑒𝑖
132 76.132078 55.8679219
50 34.205411 15.7945893
43 53.319038 -10.3190385
41 64.417274 -23.4172740
33 32.047421 0.9525795
25 24.340313 0.6596875
16 12.317224 3.6827759
8 11.700655 -3.7006555
3 5.843253 -2.8432534
1 5.226685 -4.2266848
Table(4.6) fitted values ,and residuals of LMS(10)
Table(4.7) fitted values , and residuals of LTS(10)
54
𝑦𝑖 𝑦𝑖∧
𝑒𝑖
132 59.461828 72.5381724
50 27.672119 22.3278811
43 42.164486 0.8355139
41 50.579409 -9.5794090
33 26.035884 6.9641161
25 20.192187 4.8078125
16 11.076021 4.9239790
8 10.608525 -2.6085253
3 6.167316 -3.1673160
1 5.699820 -4.6998203
𝑦𝑖 𝑦𝑖∧
𝑒𝑖 𝑤𝑖
132 60.598189 71.4018113 0
50 29.219942 20.7800582 0.6911654
43 43.524731 -0.5247309 0.9997857
41 51.830737 -10.8307374 0.9104973
33 27.604885 5.3951150 0.9773915
25 21.836825 3.1631751 0.9921988
16 12.838651 3.1613488 0.9922088
8 12.377206 -4.3772064 0.9850926
3 7.993481 -4.9934807 0.9806204
1 7.532036 -6.5320359 0.9669535
Table(4.8) fitted values ,and residuals of S(10)
Table (4.9) fitted values , residuals , and weights of MM(10)
55
(4.2) Simulation study
In this section, we introduced a simulation study which has been carried out to illustrate the
robustness of the estimators under different cases. Simulation was used to compare the
mean squared errors (MSE) and the relative mean squared errors (RMSE) of the estimates
of the regression parameters for each estimation method, where:
𝑀𝑆𝐸(�̂�) =1
𝑚∑(�̂�𝑖 − 𝛽)
2 (4.26)
𝑚
𝑖=1
where m is the number of replications, 𝛽 is the true parameter, and �̂� is the estimated
parameter.
𝑅𝑀𝑆𝐸(�̂�) =𝑀𝑆𝐸(�̂�𝑂𝐿𝑆) − 𝑀𝑆𝐸(�̂�𝑜𝑡ℎ𝑒𝑟 𝑚𝑒𝑡ℎ𝑜𝑑)
𝑀𝑆𝐸(�̂�𝑂𝐿𝑆) (4.27)
where 𝑀𝑆𝐸(�̂�𝑂𝐿𝑆) stands for mean squared errors of the estimated parameter �̂� using OLS,
and 𝑀𝑆𝐸(�̂�𝑜𝑡ℎ𝑒𝑟 𝑚𝑒𝑡ℎ𝑜𝑑) stands for the mean squared errors of the estimated parameter �̂�
using the other methods.
Relative mean square error has been used as a measure of the quality of parameter
estimation. In fact, RMSE can be interpreted as a proportionate change from baseline, using
the OLS estimator MSE within a given data condition as a baseline value. Positive values
of RMSE refer to the proportional reduction in the MSE of a given estimator with respect
to OLS estimation. Hence, RMSE is considered as a relative measure of performance above
and beyond that of the OLS estimator (Nevitt and Tam, 1998).
We explore eight different regression estimates: ordinary least squares (OLS), M-
estimation using Huber weights (M-H), M-estimation using Tukey weights (M-T),
repeated medians (RM), least median of squares (LMS), least trimmed squares (LTS), S-
56
estimate, MM-estimate. The program R.3.2.2. is used, the same criteria that were described
in the last section are also used here.
This simulation study is performed for sample sizes n =20, and n =100, to compare the
performance of these eight methods:
The model: we generate n samples {(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑛, 𝑦𝑛)} from the model
𝑦 = 3 + 2𝑥 + 𝜖
where 𝑥 ~ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚 (−5,5), with the following cases:
Case 1: ϵ ~ N(0 , 1) ( standard normal distribution.)
Case 2: ϵ ~ N (0 , 1) with 10% identical outliers in y direction (where we let the first
10% of y's equal to 40).
Case 3 : ϵ ~ N (0 , 1) with 25 % identical outliers in y direction (where we let the first
25% of y's equal to 40).
Case 4: ϵ ~ N (0,1) with 10% identical high leverage outliers (where we let the first 10%
of x's equal to 30 ).
Case 5: ϵ ~ N (0,1) with 25% identical high leverage outliers (where we let the first 25 %
of x's equal to 30 ).
Case 6 :ϵ ~ 0.90N(0, 1) + 0.10N(0; 100) (contaminated normal mixture).
Case 7: ϵ ~ Laplace(0 , 4) (double exponential distribution, with mean=0, scale=4)
Case 8: ϵ ~ 𝑡3 (t-distribution with 3 degrees of freedom).
Tables (4.10) to (4.25) report the estimates of the simulated parameters �̂� and their MSE
and RMSE for each estimation method with sample size n=20, and n=100, respectively.
The number of replicates is 1000. And the true model is : 𝑦 = 3 + 2𝑥 + 𝜖
57
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.995978 0.04583142 0 1.999841 0.0048380 0
M-H 2.995145 0.04911524 -0.07164994 2.000145 0.0050053 -0.03459696
M-T 2.995071 0.05135233 -0.12046105 2.000573 0.0051709 -0.06882508
RM 2.992821 0.07895747 -0.72278011 1.999981 0.0064337 -0.32983238
LMS 2.992933 0.23095011 -4.03912151 2.007179 0.0217917 -3.50428522
LTS 2.995637 0.08152247 -0.77874623 2.003723 0.0083696 -0.72997397
S 2.994282 0.13839046 -2.01955409 2.002924 0.0141564 -1.92609450
MM 2.994530 0.05029169 -0.09731885 2.000454 0.0051294 -0.06023841
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.994201 0.01054727 0 2.000185 0.0013832 0
M-H 2.994965 0.01116005 -0.05809806 2.000301 0.0014644 -0.05865381
M-T 2.995026 0.01122873 -0.06460928 2.000321 0.0014717 -0.06393916
RM 2.994757 0.01576845 -0.49502622 2.000446 0.0019903 -0.43881233
LMS 2.993902 0.06875218 -5.51847774 1.999665 0.0086844 -5.27808152
LTS 2.994125 0.01521376 -0.44243480 2.001154 0.0020216 -0.46147568
S 2.999137 0.03217913 -2.05094290 2.000461 0.0045241 -2.27058359
MM 2.994909 0.01118382 -0.06035128 2.000339 0.0014734 -0.06516871
Table (4.10) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case1:ϵ~ N(0,1)
Table (4.11) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case1:ϵ~ N(0,1)
58
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 6.249055 10.60751701 0 2.209871 0.0510380 0
M-H 3.177105 0.09552916 0.9909942 2.021826 0.0090712 0.8222659
M-T 3.003997 0.06468554 0.9939019 2.000270 0.0087058 0.8294253
RM 3.165193 0.12628024 0.9880952 2.005892 0.0133337 0.7387498
LMS 2.992060 0.24748724 0.9766687 2.015049 0.0339619 0.3345769
LTS 3.006635 0.08264286 0.9922090 2.002126 0.0121035 0.7628529
S 3.001087 0.13283632 0.9874772 2.003805 0.0186237 0.6351005
MM 3.004187 0.06304418 0.9940567 2.000396 0.0085174 0.8331166
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 6.902860 15.24168571 0 1.327680 0.4531280 0
M-H 3.200050 0.05203813 0.9965858 1.975971 0.0020372 0.9955041
M-T 3.000867 0.01184571 0.9992228 2.001818 0.0014626 0.9967721
RM 3.144881 0.03844485 0.9974777 1.986661 0.0021626 0.9952273
LMS 3.007946 0.06356390 0.9958296 1.999836 0.0076666 0.9830806
LTS 3.001091 0.01324594 0.9991309 2.002061 0.0016720 0.9963099
S 3.001456 0.02861111 0.9981228 2.002527 0.0035112 0.9922510
MM 3.000773 0.01180496 0.9992255 2.001805 0.0014594 0.9967791
Table (4.12) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case 2: ϵ ~ N (0,1) with 10% identical outliers in y direction
(where we let the first 10% of y's equal to 40).
Table (4.13) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case 2:ϵ ~ N (0,1) with 10% identical outliers in y direction
(where we let the first 10% of y's equal to 40).
59
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 12.07198 82.3398961 0 1.599496 0.1635614 0
M-H 3.929834 1.00614249 0.9877806 2.047522 0.0099448 0.9391979
M-T 3.013318 0.07005663 0.9991492 2.001874 0.0074642 0.9543644
RM 3.425122 0.30628199 0.9962803 2.035532 0.0114673 0.9298898
LMS 3.009960 0.20560661 0.9975030 2.003935 0.0189918 0.8838856
LTS 3.010615 0.07204460 0.9991250 2.002237 0.0077773 0.9524500
S 3.013494 0.10521390 0.9987222 2.000408 0.0106107 0.9351269
MM 3.012502 0.06928318 0.9991586 2.001979 0.0073938 0.9547946
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 12.344139 87.320997 0 1.361525 0.408577 0
M-H 3.946737 0.9280973 0.9893714 1.977289 0.002140 0.9947616
M-T 3.000979 0.0145881 0.9998329 1.998911 0.001625 0.9960215
RM 3.478967 0.2537359 0.9970942 1.967613 0.003292 0.9919410
LMS 3.002669 0.0568738 0.9993487 2.000991 0.006130 0.9849964
LTS 3.000531 0.0149311 0.9998290 1.998954 0.001644 0.9959756
S 3.003359 0.0234364 0.9997316 1.999011 0.002554 0.9937489
MM 3.000738 0.0145242 0.9998337 1.998913 0.001619 0.9960355
Table (4.14) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case3:ϵ~ N(0,1) with 25% identical outliers in y direction
(where we let the first 25 % of y's equal to 40).
Table (4.15) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case3:ϵ~ N(0,1) with 25% identical outliers in y direction
(where we let the first 25 % of y's equal to 40).
60
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 1.288070 2.98099470 0 0.244617 3.0814555 0
M-H 1.089816 3.70799281 -0.2438777 0.237815 3.1053989 -0.00777016
M-T 1.126466 3.56746045 -0.1967349 0.232027 3.1258236 -0.01439843
RM 2.849285 0.13472359 0.9548058 1.915814 0.0190538 0.993816596
LMS 2.993807 0.24455836 0.9179608 2.005190 0.0334030 0.989159975
LTS 2.990957 0.07803726 0.9738217 1.999392 0.0113597 0.996313501
S 2.992096 0.14175331 0.9524476 1.999533 0.0186715 0.993940688
MM 2.996869 0.06041228 0.9797342 2.002878 0.0082177 0.997333146
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.085423 0.84655160 0 0.236196 3.1110229 0
M-H 2.028269 0.95464874 -0.1276911 0.234379 3.1174351 -0.00206110
M-T 2.012842 0.98480013 -0.1633079 0.227152 3.1430088 -0.01028145
RM 2.918987 0.02560080 0.9697587 1.931363 0.0066212 0.997871677
LMS 2.993970 0.06617510 0.9218298 2.003158 0.0077459 0.997510175
LTS 2.995040 0.01253096 0.9851976 2.003321 0.0016736 0.999462027
S 2.991978 0.02953308 0.9651137 2.003435 0.0038285 0.998769369
MM 2.995322 0.01083792 0.9871976 2.002815 0.0014156 0.999544955
Table (4.16) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case4:ϵ~ N(0,1) with 10 % identical outliers in x direction
(where we let the first 10 % of x's equal to 30).
Table (4.17) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case 4:ϵ~ N(0,1) with 10 % identical outliers in x direction
(where we let the first 10 % of x's equal to 30).
61
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.096356 0.88028827 0 0.154525 3.4058541 0
M-H 2.058576 0.95462952 -0.08445102 0.154592 3.4056108 7.1446e-05
M-T 2.059764 0.95091454 -0.08023084 0.149945 3.4227817 -4.97013e-03
RM 3.006138 0.27467702 0.68796924 1.753267 0.0767181 9.77474e-01
LMS 3.031105 2.17743042 -1.47354247 1.885200 0.2465114 9.27621e-01
LTS 3.095153 0.89125082 -0.01245336 1.949623 0.1165513 9.65779e-01
S 2.998620 0.11055581 0.87440954 2.001756 0.0119567 9.96489e-01
MM 3.011508 0.06661944 0.92432088 2.000884 0.0072728 9.97864e-01
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 3.135412 0.03200837 0 0.112413 3.5629991 0
M-H 3.211086 0.06459291 -1.0180009 0.097751 3.6185800 -0.01559946
M-T 3.201407 0.06087894 -0.9019697 0.096080 3.6249452 -0.01738593
RM 3.428996 0.21501995 -5.7176173 1.748848 0.0662590 0.98140356
LMS 2.993440 0.05512972 -0.7223533 2.000750 0.0064637 0.99818588
LTS 2.997902 0.01401274 0.5622163 2.000876 0.0016153 0.99954662
S 2.998735 0.02225039 0.3048571 2.001040 0.0026848 0.99924646
MM 2.997181 0.01375233 0.5703519 2.000983 0.0015690 0.99955964
Table (4.18) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case 5:ϵ~ N(0,1) with 25 % identical outliers in x direction
(where we let the first 25 % of x's equal to 30).
Table (4.19) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case 5:ϵ~ N(0,1) with 25 % identical outliers in x direction
(where we let the first 25 % of x's equal to 30).
62
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.701215 48.8654771 0 2.099622 5.8676453 0
M-H 2.997033 0.09496866 0.9980565 2.007969 0.0108911 0.9981439
M-T 3.002315 0.06410098 0.9986882 2.005798 0.0072504 0.9987643
RM 3.002522 0.10835005 0.9977827 2.010181 0.0113361 0.9980680
LMS 3.001221 0.24580465 0.9949698 2.007500 0.0260866 0.9955542
LTS 3.008946 0.08172036 0.9983276 2.005251 0.0096179 0.9983608
S 3.004459 0.13328527 0.9972724 2.011206 0.0149691 0.9974489
MM 3.002500 0.00630079 0.9987050 2.006079 0.0071685 0.9987783
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.884353 9.72510518 0 2.047791 0.92958906 0
M-H 2.998212 0.01604806 0.9983498 2.001620 0.00162942 0.9982472
M-T 2.999628 0.01251891 0.9987127 2.000164 0.00120415 0.9987046
RM 2.998187 0.02138542 0.9978010 2.002489 0.00183232 0.9980289
LMS 3.007326 0.07177498 0.9926196 2.000362 0.00600956 0.9935352
LTS 2.999713 0.01426245 0.9985334 2.000604 0.00136659 0.9985299
S 2.999414 0.03015062 0.9968997 2.001449 0.00288640 0.9968950
MM 2.999562 0.01247196 0.9987175 2.000160 0.00120070 0.9987084
Table (4.20) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100)
Table (4.21) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case 6:ϵ~ 0.90N(0, 1) + 0.10N(0; 100)
63
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.977393 1.571826 0 2.014600 0.1959045 0
M-H 2.968720 1.227067 0.21933656 2.006962 0.1543497 0.21211768
M-T 2.965907 1.245367 0.20769422 2.003825 0.1604707 0.18087269
RM 2.988083 1.517493 0.03456703 2.001139 0.1688664 0.13801660
LMS 2.976926 3.119543 -0.98466147 2.007580 0.4165185 -1.12613043
LTS 2.950252 1.540569 0.01988604 2.006313 0.1992438 -0.01704538
S 3.004480 1.940194 -0.23435647 1.986397 0.2623846 -0.33934969
MM 2.969494 1.227378 0.21913853 2.007090 0.1574824 0.19612649
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 3.017684 0.3425886 0 2.003316 0.0388032 0
M-H 3.018909 0.2481231 0.2757403 2.002451 0.0281271 0.2751344
M-T 3.022460 0.2482821 0.2752763 2.001754 0.0288226 0.2572122
RM 3.008481 0.2234773 0.3476801 1.997753 0.0262700 0.3229953
LMS 3.021568 0.6906369 -1.0159366 1.993617 0.0751224 -0.9359819
LTS 3.013680 0.2493054 0.2722891 1.996493 0.0288520 0.2564533
S 3.004208 0.2896170 0.1546215 1.992827 0.0347323 0.1049114
MM 3.020064 0.2473662 0.2779497 2.001668 0.0287226 0.2597878
Table (4.22) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case7:ϵ~ Laplace (0,4)
Table (4.23) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case7:ϵ~ Laplace (0,4)
64
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 3.001632 0.15626264 0 2.002546 0.01717085 0
M-H 3.008082 0.09657450 0.38197320 2.002850 0.01210816 0.2948419
M-T 3.010029 0.09422788 0.39699032 2.003081 0.01234496 0.2810516
RM 3.004152 0.12667501 0.18934547 2.004627 0.01392421 0.1890786
LMS 3.008877 0.10430774 0.33248445 2.002809 0.01323696 0.2291026
LTS 3.013229 0.13166488 0.15741288 - 2.003121 0.01901876 -0.1076190
S 3.011049 0.16906975 -0.08195891 2.004820 0.02258645 -0.3153952
MM 3.010296 0.09390307 0.39906896 2.002237 0.01229587 0.2839105
Method �̂�0 MSE(�̂�0 ) RMSE(�̂�0 ) �̂�1 MSE(�̂�1 ) RMSE(�̂�1 )
OLS 2.995572 0.03194884 0 1.997565 0.00332865 0
M-H 2.997807 0.01592158 0.5016541 1.999147 0.00191222 0.42552514
M-T 2.998543 0.01556788 0.5127249 1.999688 0.00191913 0.42344990
RM 3.000725 0.01919937 0.3990591 1.999239 0.00210048 0.36897003
LMS 2.998201 0.01721277 0.4612396 1.998625 0.00204566 0.38543730
LTS 3.000598 0.01754478 0.4508477 1.999787 0.00229034 0.31193150
S 3.003417 0.02694119 0.1567397 2.000963 0.00357390 -0.07367771
MM 2.998975 0.01561128 0.5113663 1.999643 0.00191381 0.42504901
Table (4.24) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 20 Case8:ϵ~ 𝑡3
Table (4.25) :Simulated Mean of the Estimated parameters, MSE, and RMSE of point
Estimates ,with n = 100 Case 8: ϵ~ 𝑡3
65
(4.3) Discussion
Under ideal conditions (normal error distribution, no contamination) , from the first two
tables (4.10), (4.11) , we can see that the best estimates are the ordinary least squares (OLS)
as expected. However, we can see that M-estimates with Huber and Tukey criteria and
MM-estimates are very close to the OLS estimates, which indicates that they have high
efficiency. Repeated median (RM) performed well to some extent, so it has a good
efficiency. The RMSE values of the LTS estimates became better when n increased from
20 to 100, but we can say that LTS is not efficient in the normal case. In fact, we can see
the poor performance of the other estimates LMS and S when they are compared to the
other methods, and the corresponding RMSE values indicate this result.
In the other tables, as we expected, least squares estimates perform poor in all
contamination scenarios.
For Y outliers in both 10% and 25 % ( tables (4.12) to (4.15) ), in general, the robust
estimators have good estimated coefficients and positive RMSE values, in particular we
can see the strong performance of M-H, M-T, MM , RM and LTS estimates.
For X outliers in both 10% and 25 % ( tables (4.16) to (4.19) ), the results are different, in
particular, the poor performance in M-H and M-T, can be shown by the negative values of
their RMSE values, while the other methods performed well , MM-estimates and LTS
estimates are considered the best in presence of X outliers.
For the tables ( 4.20) and (4.21), which contain the contaminated normal distribution, we
can see the good performance of all methods , with advantage to the MM and M-T
estimates.
66
For the heavy tailed distributions ( Laplace(0,4) and 𝑡3) that can be seen in tables (4.22) to
(4.25) ,M-H is considered the best estimator, While MM-estimates and M-T performed
well under these distributions.
However, we can see that the best estimates are the MM-estimates, which have the least
MSE in most scenarios, and for both Y-outliers and X-outliers (leverage points), and also
for different sample sizes ( n=20, n=100), although it is not appropriate if the errors follow
Laplace distribution.
M-estimators in Huber and Tukey weights are not sensitive to Y-outliers, but they are so
sensitive to the leverage points (X-outliers). Repeated median estimation is not sensitive to
both Y-outliers and X-outliers, and it is competitive in different scenarios. S-estimation
and LTS method are robust to both X and Y outliers, and we can see that S-estimates
perform well when the sample size is large (n=100). LMS estimation is not sensitive to X
and Y-outliers, but it is not competitive method when it is compared to the other methods.
If we refer to the real data set of Steel Employment example that was discussed in the last
section, we found that robust regression methods can be classified into two sets, the first
contains estimators that utilize all available data, but are resistant to outliers by down
weighting their effects (M-H, M-T, and MM), the second set contains estimators that
ignore outliers and do not utilize all available data (RM, LMS, LTS, and S). Our results in
the real data example and in the simulation study demonstrate that methods which utilize
all observations provide more accurate estimates of true population values. Hence, M-
estimators and MM estimators are recommended.
67
Conclusion
The usual assumption for a linear regression model is that error terms have a normal
distribution, which leads OLS estimation procedure to give optimal results.
However, in real life it is nearly impossible to find a data set that satisfies the normality
assumption. The poor performance of OLS estimation under the contaminated data
conditions and non-normal error distributions serves to ensure both the importance of
assessing underlying assumptions as part of any regression analysis and the need for
alternatives to OLS regression.
Looking at the summary tables for estimators, the OLS is invalid if only a single outlying
observation occurs in the data. Not only is detection of these influential points important
but also utilizing regression methods which are less sensitive to these points is more
important in regression analysis. The poor performance of OLS estimators with the
presence of outliers confirms our need for alternative methods.
In this thesis, seven robust regression methods were comparatively evaluated against OLS
regression method. In general, as expected, robust estimators have better performance than
ordinary least squares method (OLS) in presence of outliers.
The best estimator is the MM-estimator, which has high breakdown point and high
efficiency, and performs well in different outlier scenarios.
Repeated median method, is very robust method, and has high breakdown point and good
efficiency also performs well in outliers in X and Y spaces.
M-estimation is the best method if there are only Y-outliers, and also it has high efficiency
if the data do not contain outliers.
68
Least trimmed squares method (LTS) and S-estimator are good estimators and have good
robust properties, but their efficiencies are not high.
Finally, it is recommended that robust regression should be used in conjunction with and as
a check on the method of least squares. If the results of the two procedures are
substantially the same, use the least squares fit since confidence intervals and tests on the
regression coefficients can be made. On the other hand, if the two analyses give quite
different results, the robust method must be used.
69
REFERENCES
[1] Adnan, R. and Mohamad, N. M, (2003), Multiple Outliers Detection Procedures in
Linear Regression. Matematika, Jabatan Matematik, UTM, Volume (19): 29-45.
[2] Alma, O. G. (2011), Comparison of Robust Regression Methods in Linear Regression.
International Journal of Contemporary Mathematical Sciences, Volume (6): 409– 421.
[3] Birkes, D. and Dodge, Y. (1993), Alternative Methods of Regression. John Wiley &
Sons Ltd.
[4] Cook, R.D. , (1977), Detection of Influential Observation in Linear Regression.
Technometrics, Volume (19): 15-18.
[5] Croux, C., Rousseeuw, P. J., and Hossjer O. (1994), Generalized S-estimators. Journal
of American Statistical Association, Volume (89): 1271-1281.
[6] Dalgaard, P. (2008), Introductory Statistics with R. Second Edition. Springer.
[7] Draper, N.R and Smith, H. (1998), Applied Regression Analysis. Third edition.
Wiley-Interscience Publication.
[8] Faraway, J.J., (2002), Practical Regression and Anova Using R. Springer.
[9] Fox, J., (2002), An R and S-Plus Companion to Applied Regression. Sage
Publications, Inc.
[10] Gervini, D. and Yohai, V.J. (2002), A Class of Robust and Fully Efficient Regression
Estimators. The Annals of Statistics, Volume (30): 583-616.
[11] Hogg, R.V., McKean, J.W. and Craig, A. T. (2006), Introduction to Mathematical
Statistics. Sixth edition. Pearson Education International.
[12] Huber, P. J. and Ronchetti E. M. (2009), Robust Statistics. Second edition. John
Wiley & Sons Ltd.
[13] Kutner, M.H., Nachtsniem, C.J. and Neter, J. (2005), Applied Linear Statistical
Models. Fifth edition. Mc Graw-Hill.
[14] Lee, Y., MacEachern, S. N., and Jung, Y. (2011), Regularization of Case-Specific
Parameters for Robustness and Efficiency. Statistical Science, Volume (27): 350–372,
DOI: 10.1214/11-STS377
[15] Marona, R.R. Martin, V.J. Yohai, (2006), Robust Statistics Theory and Methods,
John Wiley & Sons Ltd., England.
70
[16] Mendenhall, W. and Sinich, T. (2012), Regression Analysis: A Second Course in
Statistics, 7th edition. Upper Saddle River, NJ: Prentice Hall.
[17] Mohebbi, M.K., Nourijelyani, K.H. and Zeraati, H. (2007), A Simulation Study on
Robust Alternatives of Least Squares Regression, Journal of Applied Sciences, Volume
(7): 3469-3476
[18] Montgomery, D. C. and Peck, E. A. (2006), Introduction to Linear Regression
Analysis, 4th Edition. John Wiley & Sons, Inc., New York.
[19] Nevitt, J. and Tam, H.P. (1998), A Comparison of Robust and Nonparametric
Estimators under the Simple Linear Regression Model. Multiple Linear Regression
Viewpoints, Volume (25): 54-69.
[20] Noor N.H. and Mohammad A.A. (2013), Model of Robust Regression with Parametric
and Nonparametric Methods. Mathematical Theory and Modeling, Volume (3): 27-39.
[21] Renaud, O. and Victoria-Feser, M.-P. (2010), A Robust Coefficient of Determination
for Regression, Journal of Statistical Planning and Inference, Volume (140): 1852-
1862.
[22] Rencher, A.C. and Schaalje, G.B. (2008), Linear Models in Statistics, Second
Edition, John Wiley & Sons ,Inc., New York.
[23] Rousseeuw, P.J. (1984), Least Median of Squares Regression. Journal of American
Statistical Association, Volume (79): 871-880.
[24] Rousseeuw, P.J. and Leroy, A.M, (1987), Robust Regression and Outlier Detection.
John Wiley & Sons, New York.
[25] Rousseeuw,P.J. and Yohai ,V. J.(1984), Robust Regression by Means of S-estimators.
In W. H. J. Franke and D. Martin (Editors.), Lecture Notes in Statistics, Springer Verlag,
New York, Volume (26): 256- 272
[26] Ryan, T. P. (1997), Modern Regression Analysis. John Wiley & Sons, New York
[27] Siegel, A. F. (1982), Robust Regression Using Repeated Medians. Biometrika,
Volume (69): 242-244.
[28] Staudte, R.G. , Sheatther, S.J. (1990), Robust Estimation and Testing. John Wiley &
Sons, New York.
[29] Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth
Edition. Springer.
71
[30] Weisberg, S. (2005), Applied Regression Analysis. John Wiley & Sons, New Jersy.
[31] Yan, X. and Su, X.G.(2009), Linear Regression Analysis: Theory and Computing .
World Scientific Publishing Co. Pte. Ltd.
[32] Yohai, V. J. (1987), High Breakdown-point and High Efficiency Robust Estimates for
Regression. The Annals of Statistics, Volume (15): 642-656.
72
وتطويرالطرق الحصينة في تحليل االنحدار: مقارنة
اعداد
محمد عبد المنعم حسين العملة
المشرف
ستاذ الدكتور فارس العذارياأل
ملخص
تعد طريقة المربعات الصغرى الطريقة األفضل المستخدمة في تحليل االنحدار ولكن تحت بعض الشروط المناسبة. ان
تأثيرات عكسية على التقديرات والنتائج المتعلقة الى القيم الشاذة من الممكن أن يؤدي دكوجوعدم تحقيق هذه الشروط
بهذه الطريقة. لذلك تم اقتراح الطرق الحصينة التي من شأنها أن ال تتأثر بوجود هذه القيم الشاذة . تهدف هذه الدراسة
افة الى لمعرفة سلوك القيم الشاذة وتأثيرها في نماذج االنحدار الخطي التي تستخدم طريقة المربعات الصغرى. باالض
تقدير االحتمال األرجح, طريقة ذلك يهدف البحث لدراسة الطرق الحصينة المستخدمة في تحليل االنحدار و تشمل :
ى, تصغير النطاق, و طريق تقدير االحتمال المربعات المشذبة الصغر,وسيط المربعات األدنىالمتكررة, وسيطاتال
األرجح المحسنة.
هذه الطرق الحصينة بناء على بعض الخصائص الحصينة والكفاءة من خالل دراسة محاكاة, كما أخيرا, لقد قمنا بمقارنة
تم استخدام قيم حقيقية الجراء هذه المقارنات مع طريقة المربعات الصغرى.
أو تقلل وزن هذه القيمفي هذه الدراسة وجدنا ان الطرق الحصينة ليست حساسة لوجود القيم الشاذة , بل انها يمكن أن
المحسنة هي افضل الطرق الحصينة و أكفؤها في ج أن طريقة تقدير االحتمال االرجحتتجاهل وجودها, كما تم استنتا
معظم الحاالت.