Half-Day 2: Robust Regression Estimation · Half-Day 2: Robust Regression Estimation 14 / 38...
Transcript of Half-Day 2: Robust Regression Estimation · Half-Day 2: Robust Regression Estimation 14 / 38...
Half-Day 2:Robust Regression Estimation
Andreas RuckstuhlInstitut fur Datenanalyse und ProzessdesignZürcher Hochschule fur Angewandte Wissenschaften
WBL Statistik 2018 — Robust Fitting
School ofEngineering
IDP Institute ofData Analysis andProcess Design
Zurich University
of Applied Sciences
Half-Day 2: Robust Regression Estimation 2 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Outline:Half-Day 1 Regression Model and the Outlier Problem
Measuring RobustnessLocation M-EstimationInferenceRegression M-EstimationExample from Molecular Spectroscopy
Half-Day 2 General Regression M-EstimationRegression MM-EstimationExample from FinanceRobust InferenceRobust Estimation with GLM
Half-Day 3 Robust Estimation of the Covariance MatrixPrincipal Component AnalysisLinear Discriminant AnalysisBaseline Removal: An application of robust fittingbeyond theory
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 3 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
2.3 General Regression M-EstimationThe regression M-estimators will fail at the presence of leverage points.To see that,
compare the following examples (modified Air Quality data):
2.4 2.5 2.6 2.7 2.8 2.9
2.2
2.4
2.6
2.8
log(Annual Mean)
log(
Dai
ly M
ean)
●
●●
●
●
●
● ●
●
●
LSrobust
2.4 2.6 2.8 3.0 3.2 3.4
2.5
2.6
2.7
2.8
2.9
log(Annual Mean)lo
g(D
aily
Mea
n)
●
●
●
●
●
●
●
●
●
●
LSrobust (M)
or check the influence function:
IF⟨
x , y ; θM,N⟩
= ψ⟨ rσ
⟩M x︸︷︷︸
unbounded
.
+ Bound the total influence function!
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 4 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
A First Workaround: GM-EstimationAn simple modification of the Huber’s ψ-function can remedy:
Either (Mallows)
n∑i=1
ψc
⟨ri
⟨θ⟩
σ
⟩x (k)
i w 〈d 〈x i〉〉 = 0 , k = 1, . . . , p,
or (Schweppe)
n∑i=1
ψc
⟨ri
⟨θ⟩
σ · w 〈d 〈x i〉〉
⟩x (k)
i w 〈d 〈x i〉〉 =n∑
i=1
ψc·w〈d〈x i〉〉
⟨ri
⟨θ⟩
σ
⟩x (k)
i = 0 ,
where w 〈〉 is a suitable weight function and d 〈x i〉 is some measure of the“outlyingness” of x i .
Examples for w 〈〉 and d〈xi 〉:
d〈xi 〉 =(xi − median〈xk〉)
MAD〈xk〉or Mahalanobis distance and w〈d〈xi 〉〉 = Huber’s weight function (cf. notes 2.3.b)
w〈xi 〉 = 1 − Hii or w〈xi 〉 =√
1 − Hii where Hii is the leverage
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 5 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example Air Quality (modified)x.h <- 1-hat(model.matrix(y ∼ x, AQ))AQ.GMfit <- rlm(y ∼ x, data=AQ, weights=x.h, wt.method="case")
2.4 2.6 2.8 3.0 3.2 3.4
2.5
2.6
2.7
2.8
2.9
log(Annual Mean)
log(
Dai
ly M
ean)
●
●
●
●
●
●
●
●
●
●
LSrobust (M)robust (GM)
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 6 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example Air Quality (Initial Data)
2.4 2.5 2.6 2.7 2.8 2.9
2.4
2.5
2.6
2.7
2.8
2.9
log(Annual Mean)
log(
Dai
ly M
ean)
●
●
●
●
●
●
●
●
●
●
LSrobust (Mallows)robust (GM tuned)
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 7 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example Air Quality: The MapBy using robust estimation methods, we
are able to run a regression analysis every hour automatically andobtain reliable estimates each time
Hence robust methods provide a sound basis for the false colour map!
The outliers are identified and ana-lyse separately. The result of thisanalysis is part of the false colourmap.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 8 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Breakdown point of GM-EstimatorsHowever, the maximum breakdown point of general regression M-estimatorscannot exceed 1/p! (p is the number of variables)
Hence, it is not possible to detect clusters of leverage points with theprojection matrix H in residual analysis.The following example gives some insight why this happens: Look at the “Residuals vsLeverage” plots, where one, two and three outliers are put at the outlying leverage point:
0.0 0.2 0.4 0.6
−3
−2
−1
0
1
Leverage
Sta
ndar
dize
d re
sidu
als
●
●●
●
●
●
●
●
●
●
Cook's distance
10.5
0.51
8
6
2
0.0 0.1 0.2 0.3 0.4
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●
●
●
●
●
●
●
●
●
Cook's distance1
0.5
0.5
1
8816
0.00 0.10 0.20 0.30
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●
●
●
●
●
●●
●
●●
Cook's distance0.5
0.5
6
882
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 9 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
2.4 Robust Regression MM-estimationRegressions M-Estimator with Redescending ψ
Computational Experiments show:Regression M-estimators are robust if distant outliers are rejected completely!Theoretically and computationally more convenient: Ignore influence of distantoutliers graduallyFor example by a so-called redescending ψ functions like Tukey’s biweight function (ψfunction (left) and corresponding weight function (right)
r
ψ(a)
−bb
w(b)
−b b
1
But the equation defining the M-estimator has many solutions and only onemay identify the outliers correctly.Solution depends on starting value! - Good initial values are required!
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 10 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Robust Estimator With High Breakdown PointRegression estimators with high breakdown point are e.g. the S-estimator.Instead of
n∑i=1
(ri 〈θ〉σ
)2!= min
θsolve 1
n − p
n∑i=1
ρ
⟨yi − xT
i θ
s
⟩= 0.5 ,
where s must be as small as possible(i.e. the equation should have a solution in θ).
A high breakdown point implies that ρ 〈·〉 must be symmetric and bounded.The function ρ 〈·〉 can be
ρ 〈·〉 = ρbo 〈u〉 :=
1−(
1−(
ubo
)2)3
if |u| < bo
1 otherwise
(its derivative is Tukey‘s bisquare function). To get a breakdown point of 0.5the tuning constant bo must be 1.548.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 11 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
ρ function of the least squares estimator (left) and of Tukey‘s bisquarefunction (right)
ρ(a)
ρ(b)
−b b
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 12 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Pros:high breakdown point of (about) 0.5computable(at least approximately for small data set, i.e. a few thousand observationsand about 20 – 30 predictor variables).
Cons:efficiency of just 28.7% at the Gaussian distribution!challenging computation – basically done by a random resamplingalgorithm.Such an approach may result in different solutions when the algorithm is run twice ormore times with the same data except but different seeds.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 13 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Regression MM-EstimatorWe have
a redescending M-etimator which is highly efficient but requires suitablestarting valuesan S-estimator which is highly resistant but very inefficient
Combining the strength of both estimators yields the regressionMM-estimator (modified M-estimator):
An S-estimator with breakdown point ε∗ = 1/2 is used as initial estimatorTukey’s bisquare function with ρbo =1.548: + θ
(o)and so
The redescending regression M-estimator is appliedusing Tukey’s bisquare ψ-function ψb1=4.687 and fixed scale parameter σ = so from the
initial estimation; starting value is θ(o)
.
The regression MM-estimator has a breakdown point of ε∗ = 1/2 and anefficiency and an asymptotic distribution like the regression M-estimator.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 14 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from FinanceReturn-Based Style Analysis of Fund of Hedge Funds(Joint work with P. Meier and his group)
A fund of hedge funds (FoHF) is a fund that invests in a portfolio of different hedge fundsto diversify the risks associated with a single hedge fund.
A hedge fund is an investment instrument that undertakes a wider range of investment andtrading activities in addition to traditional long-only investment funds.
One of the difficulties in risk monitoring of Fund of Hedge Funds (FoHF) istheir limited transparency.
Many FoHF will only disclose partial information on their underlying portfolioThe underlying investment strategy (style of FoHF), which is the crucialcharacterisation of FoHF, is self-declared
A return-based style analysis searches for the combination of indices ofsub-styles of hedge fund that would most closely replicate the actualperformance of the FoHF over a specified time period.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 15 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Such a style analysis is done basically by fitting a (in Finance) so-calledmultifactor model:
Rt =α +p∑
k=1βk Ik,t + Et
where
Rt = return on the FoHF at time tα = the excess return (a constant) of the FoHF
Ik,t = the index return of sub-style k (= factor) at time tβk = the change in the return on the FoHF per unit change in factor kp = the number of used sub-indices
Et = residual (error) which cannot be explained by the factors
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 16 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Residuals VS TimeLS fit using lm(...) robust MM-fit using lmrob(. . . )
−0.04
−0.02
0.00
0.02
0.04
0.06
0.08
resi
dual
s
1998 2002 2006
+/− 2.5 sd
−0.04
−0.02
0.00
0.02
0.04
0.06
0.08
resi
dual
s1998 2002 2006
+/− 2.5 sd
Robust MM-fit identifies clearly two different investment periods: one beforeApril 2000 and one afterwards.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 17 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Residual Analysis with MM-Fitplot(FoHF.rlm)
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
● ●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
● ●●
●●
●
●●
●●
●
● ●●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
● ●
●●●
●●
●
●
●
●●
●
●
●
●
●
5 10 15
−5
0
5
10
Robust Distances
Rob
ust S
tand
ardi
zed
resi
dual
s
Standardized residuals vs. Robust Distances
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
● ●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●●●
●●
●
●●
●●
●
● ●●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
● ●
●●●
●●
●
●
●
●●
●
●
●
●
●
−2 −1 0 1 2
−0.05
0.00
0.05
0.10
Theoretical Quantiles
Res
idua
ls
Normal Q−Q vs. Residuals
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●●
● ●●
●
●
●
●● ●●●
●●
●●
●
●
●
●●●
●
●
●●
●
●
●●
●●●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
−0.05 0.00 0.05 0.10
−0.05
0.00
0.05
0.10
Fitted Values
Res
pons
e
Response vs. Fitted Values
●●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
● ●●
●●
●
●●
●●
●
● ●●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●●
●●
●
●
●
●●
●
●
●
●
●
−0.06 −0.04 −0.02 0.00 0.02 0.04
−0.05
0.00
0.05
0.10
Fitted Values
Res
idua
ls
Residuals vs. Fitted Values
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
−0.06 −0.04 −0.02 0.00 0.02 0.04
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Fitted Values
Sqr
t of a
bs(R
esid
uals
)
Sqrt of abs(Residuals) vs. Fitted Values
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 18 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
2.5 Robust Inference and Variable Selection
Outliers may also influence the result of a classical test crucially.It might happen that the null hypothesis is rejected, because an interfering alternativeHI (outliers) is present. That is, the rejection of the null hypothesis is justified butaccepting the actual alternative HA is unjustified.
To understand such situations better, one can explore the effect ofcontamination on the level and power of hypothesis tests.
Heritier and Ronchetti (1994) showed that the effects of contamination onboth the level and power of a test are inherited from the underlyingestimator (= test statistic).
That means that
the test is robust if its test statistic is based on a robust estimator.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 19 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Asymptotic Distribution of the MM-estimatorThe Regression MM-estimator is asymptotically Gaussian distributed with
mean (expectation) θ andcovariance matrix σ2τC−1, where C = (1/n)
∑i x i xT
i .
The covariance matrix of θ is estimated by
V = s2on τ C−1
where C =1n∑
i wi x i xTi
1n∑
i wi, τ =
1n∑n
i=1
(ψb1
⟨r i⟩ )2(
1n∑n
i=1 ψ′⟨
r i⟩)2
with ri :=yi − xT
i θ(o)
so, wi :=
ψb1 〈ri 〉ri
Note that θ(o)
and so come from the initial estimation.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 20 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
A Further Modification to the MM-Estimator
The estimated covariance matrix V depends on quantities (θ(o)
and so)from the initial estimatorThe initial S-estimator, however, is known to be very inefficient.
Koller and Stahel (2011, 2014) investigated the effects of this constructionon the efficiency of the estimated confidence intervals . . .
. . . and, as a consequence, came up with an additional modification:Extend the current regression MM-estimator by two additional steps:I replaces so by a more efficient scale estimatorI Then apply another M-estimator but with a more slowly redescending
psi-function.
They called this estimation procedure regression SMDM-estimator and itis implemented in lmrob(..., setting="KS2014").
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 21 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance with another target FoHFReturn-Based Style Analysis of Fund of Hedge Funds - RBSA2
lm(FoHF ∼ . , data=FoHF2) lmrob(FoHF ∼ ., data=FoHF2, setting=”KS2011”)
Estimate se Pr(> |t|) Estimate se Pr(> |t|)(I) -0.0019 0.0017 0.2610 -0.0030 0.0014 0.0381RV 0.0062 0.3306 0.9850 0.3194 0.2803 0.2575CA -0.0926 0.1658 0.5780 -0.0671 0.1383 0.6288FIA 0.0757 0.1472 0.6083 -0.0204 0.1279 0.8733EMN 0.1970 0.1558 0.2094 0.2721 0.1328 0.0434ED -0.3010 0.1614 0.0655 -0.4763 0.1389 0.0009EDD 0.0687 0.1301 0.5986 0.1019 0.1112 0.3621EDRA 0.0735 0.1882 0.6971 0.0903 0.1583 0.5698LSE 0.4407 0.1521 0.0047 0.5813 0.1295 0.0000GM 0.1723 0.0822 0.0390 -0.0159 0.0747 0.8322EM 0.1527 0.0667 0.0245 0.1968 0.0562 0.0007SS 0.0282 0.0414 0.4973 0.0749 0.0356 0.0382
Residual standard error: 0.009315 Residual standard error: 0.007723
The 95% confidence interval of βSS is0.028± 1.98 · 0.041 = [−0.053, 0.109] 0.075± 1.96 · 0.036 = [0.004, 0.146]where 1.98 is the 0.975 Quantile of t108 where 1.98 is the 0.975 Quantile of N 〈0, 1〉
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 22 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance: RBSA2 (cont.)A fund of hedge funds (FoHF) may be classified by the style of their targetfunds into focussed directional, focussed non-directional or diversified.If our considered FoHF is a focussed non-directional FoHF, then it should beinvested in LSE, GM, EM, SS and hence the other parameter should be zero.
Goal: We want to test the hypothesis that q < p of the p elements of theparameter vector θ are zero.
First, let’s introduce some notation to express the results more easily:There is no real loss of generality if we suppose that the model isparameterized so that the null hypothesis can be expressed as H0 : θ1 = 0where θ = (θT
1 , θT2 )T .
Further, let V 11 be the quadratic submatrix containing the first q rowsand columns of V .
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 23 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
The so-called Wald-type test statistic can now be expressed as
W = θT1 (V 11)−1 θ1 .
It can be shown, that this test statistic is asymptotically χ2 distributed withq degrees of freedom.
This test statistic also provides the basis for confidence intervals of a singleparameter θk :
θk ± qN1−α/2 ·√
V kk
where qN1−α/2 is the (1− α/2) quantil of the standard Gaussian distribution.
Comments:Since we have estimated the covariance matrix Cov
⟨θ⟩
by V , it seems obvious fromclassical regression inference that replacing χ2
q by Fq, n−p is a reasonable adjustment inthe distribution of the test statistic for estimating the variance σ2.However, to do so has no formal justification and it would be better to avoid too smallsample sizes (thus 1
q · χ2q ≈ Fq, n−p) because all results are asymptotical in their nature
anyway.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 24 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
For an MM-estimator, we may define a robust deviance by
D⟨
y , θMM
⟩= 2 · s2
o ·n∑
i=1
ρ
⟨yi − xT
i θMM
so
⟩.
The robust generalisation of the F-test(
SSreduced − SSfull)/
qSSfull/(n − p)
=
(SSreduced − SSfull
)/q
σ2
is to replace the sum of squares by the robust deviance (similar to generalisedlinear models) so that we obtain the test statistic
∆∗ = τ∗ ·D⟨
y , θrMM
⟩− D
⟨y , θ
fMM
⟩s2o
with τ∗ =( 1
n
n∑i=1
ψ′b1〈ri 〉)/( 1
n
n∑i=1
(ψb1 〈ri 〉
)2).
Then ∆∗ a∼ χ2q under the null hypothesis.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 25 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance - RBSA2# Least squares estimator:> FoHF2.lm1 < − lm(FoHF ∼ ., data=FoHF2)> FoHF2.lm2 < − lm(FoHF ∼ LSE + GM + EM + SS, data=FoHF2)> anova(FoHF2.lm2, FoHF2.lm1)
Analysis of Variance TableModel 1: FoHF ∼ LSE + GM + EM + SSModel 2: FoHF ∼ RV + CA + FIA + EMN + ED + EDD + EDRA + LSE + GM + EM + SS
Res.Df RSS Df Sum of Sq F Pr(> F )1 96 0.00850242 89 0.0077231 7 0.00077937 1.2831 0.2679
# Robust with SMDM-estimator> FoHF.rlm < − lmrob(FoHF ∼ ., data=FoHF, setting=”KS2011”)> anova(FoHF.rlm, FoHF ∼ LSE + GM + EM + SS, test=”Wald”)
Robust Wald Test TableModel 1: FoHF ∼ RV + CA + FIA + EMN + ED + EDD + EDRA + LSE + GM + EM + SSModel 2: FoHF ∼ LSE + GM + EM + SSLargest model fitted by lmrob(), i.e. SMDM
pseudoDf Test.Stat Df Pr(> chisq)1 892 96 24.956 7 0.0007727 ***
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 26 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance - RBSA2 (cont.)# Robust with SMDM-estimator and Wald test> FoHF.rlm < − lmrob(FoHF ∼ ., data=FoHF, setting=”KS2011”)> anova(FoHF.rlm, FoHF ∼ LSE + GM + EM + SS, test=”Wald”)
Robust Wald Test TableModel 1: FoHF ∼ RV + CA + FIA + EMN + ED + EDD + EDRA + LSE + GM + EM + SSModel 2: FoHF ∼ LSE + GM + EM + SSLargest model fitted by lmrob(), i.e. SMDM
pseudoDf Test.Stat Df Pr(> chisq)1 892 96 24.956 7 0.0007727 ***
# Robust with SMDM-estimator and Deviance test> anova(FoHF.rlm, FoHF ∼ LSE + GM + EM + SS, test=”Deviance”)
Robust Deviance TableModel 1: FoHF ∼ RV + CA + FIA + EMN + ED + EDD + EDRA + LSE + GM + EM + SSModel 2: FoHF ∼ LSE + GM + EM + SSLargest model fitted by lmrob(), i.e. SMDM
pseudoDf Test.Stat Df Pr(> chisq)1 892 96 25.089 7 0.0007318 ***
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 27 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance - RBSA2 (i.e., Slide 21 again)
Return-Based Style Analysis of Fund of Hedge Fundslm(FoHF ∼ . , data=FoHF) lmrob(FoHF ∼ ., data=FoHF, setting=”KS2011”)
Estimate se Pr(> |t|) Estimate se Pr(> |t|)(I) -0.0019 0.0017 0.2610 -0.0030 0.0014 0.0381RV 0.0062 0.3306 0.9850 0.3194 0.2803 0.2575CA -0.0926 0.1658 0.5780 -0.0671 0.1383 0.6288FIA 0.0757 0.1472 0.6083 -0.0204 0.1279 0.8733EMN 0.1970 0.1558 0.2094 0.2721 0.1328 0.0434ED -0.3010 0.1614 0.0655 -0.4763 0.1389 0.0009EDD 0.0687 0.1301 0.5986 0.1019 0.1112 0.3621EDRA 0.0735 0.1882 0.6971 0.0903 0.1583 0.5698LSE 0.4407 0.1521 0.0047 0.5813 0.1295 0.0000GM 0.1723 0.0822 0.0390 -0.0159 0.0747 0.8322EM 0.1527 0.0667 0.0245 0.1968 0.0562 0.0007SS 0.0282 0.0414 0.4973 0.0749 0.0356 0.0382
Residual standard error: 0.009315 Residual standard error: 0.007723
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 28 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance - RBSA2: Residuals VSTime
LS fit robust MM-fit
2000 2004 2008
−0.04
−0.02
0.00
0.02
0.04
resi
dual
s
+/− 2.5 sd
2000 2004 2008
−0.04
−0.02
0.00
0.02
0.04
resi
dual
s
+/− 2.5 sd
Robust MM-fit identifies clearly two outliers.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 29 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance - RBSA2: ResidualAnalysis with LS-Fit
−0.02 0.00 0.02 0.04 0.06
−0.04
−0.03
−0.02
−0.01
0.00
0.01
0.02
Fitted values
Res
idua
ls ●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●● ●
●
●
● ●●
●
● ●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●
●● ●
●
●
●
●
●
●
●●
●●●
●
●
●●●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●
Residuals vs Fitted
2000−03
2000−04
2000−07
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●●●
●
●●
●●
● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●●●
●
●
●●●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●
−2 −1 0 1 2
−4
−2
0
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
2000−03
2000−04
2000−07
−0.02 0.00 0.02 0.04 0.06
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
● ●●
●
● ●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
Scale−Location2000−03
2000−042000−07
0.0 0.1 0.2 0.3 0.4
−4
−2
0
2
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
● ●
●●●
●
●
● ● ●
●
●●
● ●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●● ●
●
●
●
●
●
●
●●
●● ●
●●
●● ●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●
Cook's distance1
0.5
0.5
Residuals vs Leverage
2000−03
2000−04
2001−01
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 30 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Example from Finance - RBSA2: ResidualAnalysis with Robust Fit
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
● ●●
●●
●
●● ●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
2 4 6 8 10 12 14 16
−6
−4
−2
0
2
Robust Distances
Rob
ust S
tand
ardi
zed
resi
dual
s
Standardized residuals vs. Robust Distances
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
● ●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
−2 −1 0 1 2
−0.05
−0.04
−0.03
−0.02
−0.01
0.00
0.01
0.02
Theoretical Quantiles
Res
idua
ls
Normal Q−Q vs. Residuals
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
●●
●
●●
●
●
−0.04 0.00 0.02 0.04 0.06 0.08
−0.04
−0.02
0.00
0.02
0.04
0.06
0.08
Fitted Values
Res
pons
e
Response vs. Fitted Values
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
●● ●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
● ●
●
●
−0.02 0.00 0.02 0.04 0.06
−0.05
−0.04
−0.03
−0.02
−0.01
0.00
0.01
0.02
Fitted Values
Res
idua
ls
Residuals vs. Fitted Values
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
−0.02 0.00 0.02 0.04 0.06
0.05
0.10
0.15
0.20
Fitted Values
Sqr
t of a
bs(R
esid
uals
)
Sqrt of abs(Residuals) vs. Fitted Values
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 31 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
3 Generalised Linear Models3.1 Unified Model Formulation
Generalized linear models were formulated by John Nelder and RobertWedderburn as a way of unifying various statistical regression models,including linear regression, logistic regression, Poisson regression and Gammaregression.
The generalization is based on a refomulation of the linear regression model. Instead of
Yi = θ0 + θ1 · x (1)i + . . .+ θp · x (p)
i + Ei , i = 1, . . . n, with Ei ind. ∼ N⟨
0, σ2⟩
use
Yi ind. ∼ N⟨µi , σ
2⟩
with µi = θ0 + θ1 · x (1)i + . . .+ θp · x (p)
i
The expectation µi may be linked to the linear predictor ηi = θ0 + θ1 · x (1)i + . . .+ θp · x (p)
iby another function than the identity function. In general, we assume
g 〈µi 〉 = ηi
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 32 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Discrete Generalised Linear ModelsThe two discrete generalised linear models are the binary / binomialregression model and Poisson regression model.Let Yi , i = 1, . . . , n be the response and ηi = xT
i θ =∑p
j=1 x (j)i · θj its linear
predictor. ThenBinary / Binomial Regression
Yi indep. ∼ B 〈πi ,mi〉 with
E⟨ Yi
mi|x i⟩
= πi
Var⟨ Yi
mi|x i⟩
= πi ·(1−πi )mi
Poisson RegressionYi indep. ∼ P 〈λi〉 with
E 〈Yi |x i〉 = λi
Var 〈Yi |x i〉 = λi
The mean response πi and λi , respectively, are related to the linear predictor ηi bythe link function g 〈·〉: g 〈πi〉 = ηi . Popular choices for the link includeg 〈πi 〉 = log 〈πi/(1− πi )〉 Logit model
g 〈πi 〉 = Φ−1 〈πi 〉 Probit modelg 〈πi 〉 = log 〈− log 〈1− πi 〉〉 Complementary
log-log-model
g 〈λi 〉 = log 〈λi 〉 log-linear modelg 〈λi 〉 = λi identity
g 〈λi 〉 =√λi square root
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 33 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Gamma RegressionLet Yi , i = 1, . . . , n be the response and ηi = xT
i θ =∑p
j=1 x (j)i · θj its linear
predictor. ThenYi indep. ∼ Gamma 〈αi , βi〉 with Common link functions are
E 〈Yi |x i〉 = αiβi
g 〈µi〉 = 1µi
inverseVar 〈Yi |x i〉 = αi
β2i
g 〈µi〉 = log 〈µi〉g 〈µi〉 = µi identity.
In GLM it is assumed thatthe response Yi is independently distributed according to a distribution from theexponential family with expectation E 〈Yi〉 = µi .The expectation µi is linked by a function g to the linear predictor xT
i β:g 〈µi〉 = xT
i β
The variance of the response depends on E 〈Yi〉: Var 〈Yi〉 = φV 〈µi〉.The so-called variance function V 〈µi 〉 is determined by the distribution. φ is thedispersion parameter.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 34 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Estimating Equation
The estimating equations of GLM can be written in a unified form,
0 =n∑
i=1
yi − µiV 〈µi〉
µ′i x i =n∑
i=1
yi − µi√V 〈µi〉
µ′i√V 〈µi〉
x i
where µ′i = ∂µ 〈ηi 〉 /∂ηi is the derivative of the inverse link function.yi−µi√V 〈µi〉
are called Pearson residuals.
If there are no leverage points, their variance is approximately constant,
Var
⟨yi − µi√
V 〈µi 〉
⟩≈ φ√
1− Hii .
In R use glm(Y ∼ ..., family=..., data=...) to fit a GLM to data.
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 35 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
In GLM, we face similar problems with the standard estimator as in linearregression problems at the presence of contaminated data.
If only two observations are misclassified . . .
2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
x
y
●
●
●
● ●
●●
●
● ● ●
●
●
●
●●
●● ●
●
●●●
●
●
●
●
● ● ●●
●
●
●●
●●
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 36 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Mallows type (robust) quasi-likelihood estimatorCantoni and Ronchetti (2001) suggested a Mallows type robustification ofthe estimating equation of the GLM-estimator:
0 =n∑
i=1
(ψc 〈ri〉
µ′i√V 〈µi〉
x i w 〈x i〉 − fcc 〈θ〉),
where ψc 〈〉 is the Huber function and the vector constant fcc 〈θ〉 ensures theFisher consistency of the estimator.
The “weights” w 〈x i〉 can be used to down-weight leverage points.If w 〈x i〉 = 1 for all observations i then the influence of position is notbounded (cf. regression M-estimator).To bound the total influence, one may, e.g., use w 〈x i〉 =
√1− Hii .
Since such an estimator will not yet have high breakdown point, we betteruse the inverse of the Mahalanobis distance for x i which is based on arobust covariance estimator with high breakdown point (cf. later).
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 37 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Inference and Implementation
The advantage of this approach is that standard inference based both onthe asymptotic Gaussian distribution of the estimator and on robustquasi-deviances is available.I do not dare to present the formulas here, because they look frightening.But they are implemented in R and you can run the inference procedurelike glm(...) and anova(...).
The theory and the implementation cover Poisson, binomial, gammaand Gaussian (i.e., linear regression GM-estimator)
Fitting in R is done byglmrob(Y ∼ ..., family=..., data=...,
weights.on.x=c("none", "hat", "robCov", "covMcd"))
Testing in R is done byanova(Fit1, Fit2, test=c("Wald", "QD", "QDapprox"))
WBL Statistik 2018 — Robust Fitting
Half-Day 2: Robust Regression Estimation 38 / 38General Regression M-Estimation Robust Regression MM-estimation Robuste Inferenz GLM
Take Home Message Half-Day 2Least-squares estimation are unreliable if contaminated observations arepresentBetter use a Regression MM- (or SMDM-) Estimator
In the R packages ’robustbase’ you find:lmrob(...) Regression MM-Estimatorlmrob(..., setting="KS2011") Regression SMDM-Estimatoranova(...) Comparing models using robust proceduresplot(‘‘lmrob object’’ ) graphics for a residual analysis
Remark: lmrob(...) is based on a improved algorithm (since robustbase 0.9-2) and can handleboth numeric and factor variables as exploratory variables.
Robust GM-estimators are also available for generalised linear models (GLMs)In the R packages ’robustbase’ you find:
glmrob(...) (Mallows type) Regression GM-Estimatorsanova(...) Comparing models using robust procedures
WBL Statistik 2018 — Robust Fitting