Mahalanobis distances on factor structured data
Transcript of Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Mahalanobis distances on factor structured data
Deliang Dai1
1Department of Economics and StatisticsLinnaeus University, Vaxjo
August 24, 2014
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Outline
1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
2 Definitions and assumptions
3 Distributional properties
4 Contaminated data
5 Empirical application
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Outline
1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
2 Definitions and assumptions
3 Distributional properties
4 Contaminated data
5 Empirical application
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Background of Mahalanobis distance
Mahalanobis distance (MD) was proposed by Mahalanobis (1936).It is used for measuring the distance between two random vectors orone random vector and its mean vector. It is widely used in manyareas such as cluster analysis; discriminant analysis (Fisher; 1940);assessing multivariate normality (Mardia; 1974; Mardia et al.; 1980;Mitchell and Krzanowski; 1985; Holgersson and Shukur; 2001); hy-pothesis testing and outlier detecting (Wilks; 1963; Mardia et al.;1980).
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Definition of Mahalanobis distance
The definition of Mahalanobis distance is given as follows,
Definition 1
Let Xi : p × 1 be a random vector such that E [Xi ] = µp×1 andE[(Xi − µ) (Xi − µ)′
]= Σp×p for i = 1, ..., n. Then we make the
following definition:
Dii := (Xi − µ)′Σ−1 (Xi − µ) . (1)
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Advantages of MD
The Mahalanobis distance which includes the covariance matrix islinear transformation invariant.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
Background of factor model
Factor model is widely applied in many researches on the unobservedmeasurements. For example, the arbitrage pricing theory (Ross;1976) based on the assumption that asset returns follow a factorstructure. In consumer theory, factor analysis shows that utilitymaximizing could be described by at most three factors (Gormanet al.; 1981; Lewbel; 1991). In desegregate business cycle analysis,the factor model is used to identify the common and specific shocksthat drive a country’s economy (Gregory and Head; 1999; Bai; 2003).
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
The combination of the factor model and Mahalanobis distance of-fers an option for both approximation of the structure of covarianceand measuring the distance between observations at the same time.The balance between simplicity of structure and information offersa good method for the dependent structured data.In this presentation, we only concern the detection of outlier.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Outline
1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
2 Definitions and assumptions
3 Distributional properties
4 Contaminated data
5 Empirical application
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Definition of factor model
Definition 2
Let x ∼ N(µ,Σ), be the p × 1 random observations with mean µand covariance matrix Σ. The factor model could be representedas
x− µ(p×1) = L(p×m)F(m×1) + ε(p×1),
where x is the observed date (p > m), µ the corresponding ex-pectation , L the factor loadings and F is a m × 1 vector of factorscores.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Assumptions of factor model
Beside the definition of factor model above, we need some moreassumptions as follows.
Definition 3
Let E (F) = 0, Cov(F) = Im×m and E (ε) = 0. As defined above,assume ε ∼ N(0p×1,Ψ) , where Ψ is a diagonal matrix andF ∼ N(0m×1, I) are distributed independently (Cov(ε,F) = 0).
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Definitions of MDs on factor model
Definition 4
The MD on F part is
D (Fi ) = Fi′cov(Fi )
−1Fi = Fi′Fi ,
while for the ε part, the MD is
D (εi ) = εi′Ψ−1εi .
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Definitions of MDs on factor model
Definition 5
The MDs for the estimated factor scores and error are:
D(
Fi
)= F′icov(Fi )
−1Fi and D (εi ) = ε′iΨ−1εi .
where Fi = L′S−1
(xi − x) is the estimated factor scores from theregression method (Johnson and Wichern; 2002).
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Outline
1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
2 Definitions and assumptions
3 Distributional properties
4 Contaminated data
5 Empirical application
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Joint distribution of seperated MDs
Proposition 1
The joint distribution of D (Fi ) and D (εi ) is,
Z =
[D (Fi )D (εi )
]∼
[χ2
(m)
χ2(p)
], cov (Z ) =
[2m 00 2p
].
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Joint distribution of seperated MDs
Proposition 2
The asymptotic distributions of the MD under the factor model withan unknown factor structure is
Y =
[D(
Fi
)D (εi )
]`→[χ2
(m)
χ2(p)
], cov
(Y)
=
[2m 00 2p
].
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Outline
1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
2 Definitions and assumptions
3 Distributional properties
4 Contaminated data
5 Empirical application
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Definitions of contaminated data
Definition 6
Assuming two kinds of outliers are interested: (i) an additive outlierin εi and (ii) an additive outlier in Fi : m × 1. The correspondingdefinitions are given as follows (assuming known mean):
(i) Xi = LFi + εi + δi , δi = 0,∀i 6=k , δi : p × 1 and δk′δk = 1.
(ii) Xi = L (Fi + δi ) + εi , δi = 0,∀i 6=k , δk : m × 1.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Contaminated data
Let dδj , j = 1, 2 be the MDs for the two kinds of outliers, andlet S = n−1
∑ni=1 (LFi + εi ) (LFi + εi )
′ be the sample covariancematrix with known mean. To simplify calculations, let δk⊥Fi andδk⊥εi by assumption, so that they are not only uncorrelated but alsoorthogonal. The following expressions separate the contaminatedand un-contaminated parts of the MD:
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Contaminated data
Proposition 3
For case (i),
dδ1 =
dii i 6= k
dkk −n(1− xk
′n−1S−1δk)2
1 + δ′kn−1S−1δk+ n i = k
For case (ii),
dδ2 =
dii i 6= k
dkk −n(1− xk
′n−1S−1Lδk)2
1 + δk′L′n−1S−1Lδk
+ n. i = k
where dii = nxi′W−1xi = xi
′S−1xi .
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Outline
1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data
2 Definitions and assumptions
3 Distributional properties
4 Contaminated data
5 Empirical application
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Dataset
We implicate the results on a data setting which is consisted by themonthly closing stock prices from ten companies.The companies are Exxon Mobil Corp., Chevron Corp., Apple Inc.,General Electric Co., Wells Fargo Co., Procter Gamble Co., Mi-crosoft Corp., Johnson & Johnson, Google Inc., Johnson ControlInc. They follow by the order of variables.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Dataset
We modified the closing prices into the monthly return rate (r),
r =yt − yt−1
yt−1,
where yt stands for the closing price in the tth month. The wholeanalysis based on the returns on the stock prices.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Boxplot
Figure: The boxplot of the original MDs.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Figure: The boxplot of error terms.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Figure: The boxplot of factor scores.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Bibliography I
Bai, J. (2003). Inferential theory for factor models of largedimensions, Econometrica 71(1): 135–171.
Fisher, R. A. (1940). The precision of discriminant functions,Annals of Human Genetics 10(1): 422–429.
Gorman, N., McConnell, I. and Lachmann, P. (1981).Characterisation of the third component of canine and felinecomplement, Veterinary immunology and immunopathology2(4): 309–320.
Gregory, A. W. and Head, A. C. (1999). Common andcountry-specific fluctuations in productivity, investment, and thecurrent account, Journal of Monetary Economics44(3): 423–451.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Bibliography II
Holgersson, H. and Shukur, G. (2001). Some aspects ofnon-normality tests in systems of regression equations,Communications in Statistics-Simulation and Computation30(2): 291–310.
Johnson, R. A. and Wichern, D. W. (2002). Applied multivariatestatistical analysis, Vol. 5, Prentice hall Upper Saddle River, NJ.
Lewbel, A. (1991). The rank of demand systems: theory andnonparametric estimation, Econometrica: Journal of theEconometric Society pp. 711–730.
Mahalanobis, P. (1936). On the generalized distance in statistics,Proceedings of the National Institute of Sciences of India,Vol. 2, New Delhi, pp. 49–55.
Mahalanobis distances on factor structured data
IntroductionDefinitions and assumptions
Distributional propertiesContaminated data
Empirical applicationReferences
Bibliography III
Mardia, K. (1974). Applications of some measures of multivariateskewness and kurtosis in testing normality and robustnessstudies, Sankhya: The Indian Journal of Statistics, Series Bpp. 115–128.
Mardia, K., Kent, J. and Bibby, J. (1980). Multivariate analysis.
Mitchell, A. and Krzanowski, W. (1985). The mahalanobisdistance and elliptic distributions, Biometrika 72(2): 464–467.
Ross, S. A. (1976). The arbitrage theory of capital asset pricing,Journal of economic theory 13(3): 341–360.
Wilks, S. (1963). Multivariate statistical outliers, Sankhya: TheIndian Journal of Statistics, Series A pp. 407–426.
Mahalanobis distances on factor structured data