Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Mahalanobis distances on factor structured data

Deliang Dai1

1Department of Economics and StatisticsLinnaeus University, Vaxjo

August 24, 2014





Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application





BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Outline




4 Contaminated data







Background of Mahalanobis distance

Mahalanobis distance (MD) was proposed by Mahalanobis (1936).It is used for measuring the distance between two random vectors orone random vector and its mean vector. It is widely used in manyareas such as cluster analysis; discriminant analysis (Fisher; 1940);assessing multivariate normality (Mardia; 1974; Mardia et al.; 1980;Mitchell and Krzanowski; 1985; Holgersson and Shukur; 2001); hy-pothesis testing and outlier detecting (Wilks; 1963; Mardia et al.;1980).






Definition of Mahalanobis distance

The definition of Mahalanobis distance is given as follows,

Definition 1

Let Xi : p × 1 be a random vector such that E [Xi ] = µp×1 andE[(Xi − µ) (Xi − µ)′

]= Σp×p for i = 1, ..., n. Then we make the

following definition:

Dii := (Xi − µ)′Σ−1 (Xi − µ) . (1)






Advantages of MD

The Mahalanobis distance which includes the covariance matrix islinear transformation invariant.






Background of factor model

Factor model is widely applied in many researches on the unobservedmeasurements. For example, the arbitrage pricing theory (Ross;1976) based on the assumption that asset returns follow a factorstructure. In consumer theory, factor analysis shows that utilitymaximizing could be described by at most three factors (Gormanet al.; 1981; Lewbel; 1991). In desegregate business cycle analysis,the factor model is used to identify the common and specific shocksthat drive a country’s economy (Gregory and Head; 1999; Bai; 2003).






The combination of the factor model and Mahalanobis distance of-fers an option for both approximation of the structure of covarianceand measuring the distance between observations at the same time.The balance between simplicity of structure and information offersa good method for the dependent structured data.In this presentation, we only concern the detection of outlier.





Outline




4 Contaminated data






Definition of factor model

Definition 2

Let x ∼ N(µ,Σ), be the p × 1 random observations with mean µand covariance matrix Σ. The factor model could be representedas

x− µ(p×1) = L(p×m)F(m×1) + ε(p×1),

where x is the observed date (p > m), µ the corresponding ex-pectation , L the factor loadings and F is a m × 1 vector of factorscores.





Assumptions of factor model

Beside the definition of factor model above, we need some moreassumptions as follows.

Definition 3

Let E (F) = 0, Cov(F) = Im×m and E (ε) = 0. As defined above,assume ε ∼ N(0p×1,Ψ) , where Ψ is a diagonal matrix andF ∼ N(0m×1, I) are distributed independently (Cov(ε,F) = 0).





Definitions of MDs on factor model

Definition 4

The MD on F part is

D (Fi ) = Fi′cov(Fi )

−1Fi = Fi′Fi ,

while for the ε part, the MD is

D (εi ) = εi′Ψ−1εi .





Definitions of MDs on factor model

Definition 5

The MDs for the estimated factor scores and error are:

D(

Fi

)= F′icov(Fi )

−1Fi and D (εi ) = ε′iΨ−1εi .

where Fi = L′S−1

(xi − x) is the estimated factor scores from theregression method (Johnson and Wichern; 2002).





Outline




4 Contaminated data






Joint distribution of seperated MDs

Proposition 1

The joint distribution of D (Fi ) and D (εi ) is,

Z =

[D (Fi )D (εi )

]∼

[χ2

(m)

χ2(p)

], cov (Z ) =

[2m 00 2p

].





Joint distribution of seperated MDs

Proposition 2

The asymptotic distributions of the MD under the factor model withan unknown factor structure is

Y =

[D(

Fi

)D (εi )

]`→[χ2

(m)

χ2(p)

], cov

(Y)

=

[2m 00 2p

].





Outline




4 Contaminated data






Definitions of contaminated data

Definition 6

Assuming two kinds of outliers are interested: (i) an additive outlierin εi and (ii) an additive outlier in Fi : m × 1. The correspondingdefinitions are given as follows (assuming known mean):

(i) Xi = LFi + εi + δi , δi = 0,∀i 6=k , δi : p × 1 and δk′δk = 1.

(ii) Xi = L (Fi + δi ) + εi , δi = 0,∀i 6=k , δk : m × 1.





Contaminated data

Let dδj , j = 1, 2 be the MDs for the two kinds of outliers, andlet S = n−1

∑ni=1 (LFi + εi ) (LFi + εi )

′ be the sample covariancematrix with known mean. To simplify calculations, let δk⊥Fi andδk⊥εi by assumption, so that they are not only uncorrelated but alsoorthogonal. The following expressions separate the contaminatedand un-contaminated parts of the MD:





Contaminated data

Proposition 3

For case (i),

dδ1 =

dii i 6= k

dkk −n(1− xk

′n−1S−1δk)2

1 + δ′kn−1S−1δk+ n i = k

For case (ii),

dδ2 =

dii i 6= k

dkk −n(1− xk

′n−1S−1Lδk)2

1 + δk′L′n−1S−1Lδk

+ n. i = k

where dii = nxi′W−1xi = xi

′S−1xi .





Outline




4 Contaminated data






Dataset

We implicate the results on a data setting which is consisted by themonthly closing stock prices from ten companies.The companies are Exxon Mobil Corp., Chevron Corp., Apple Inc.,General Electric Co., Wells Fargo Co., Procter Gamble Co., Mi-crosoft Corp., Johnson & Johnson, Google Inc., Johnson ControlInc. They follow by the order of variables.





Dataset

We modified the closing prices into the monthly return rate (r),

r =yt − yt−1

yt−1,

where yt stands for the closing price in the tth month. The wholeanalysis based on the returns on the stock prices.





Boxplot

Figure: The boxplot of the original MDs.





Figure: The boxplot of error terms.





Figure: The boxplot of factor scores.





Bibliography I

Bai, J. (2003). Inferential theory for factor models of largedimensions, Econometrica 71(1): 135–171.

Fisher, R. A. (1940). The precision of discriminant functions,Annals of Human Genetics 10(1): 422–429.

Gorman, N., McConnell, I. and Lachmann, P. (1981).Characterisation of the third component of canine and felinecomplement, Veterinary immunology and immunopathology2(4): 309–320.

Gregory, A. W. and Head, A. C. (1999). Common andcountry-specific fluctuations in productivity, investment, and thecurrent account, Journal of Monetary Economics44(3): 423–451.





Bibliography II

Holgersson, H. and Shukur, G. (2001). Some aspects ofnon-normality tests in systems of regression equations,Communications in Statistics-Simulation and Computation30(2): 291–310.

Johnson, R. A. and Wichern, D. W. (2002). Applied multivariatestatistical analysis, Vol. 5, Prentice hall Upper Saddle River, NJ.

Lewbel, A. (1991). The rank of demand systems: theory andnonparametric estimation, Econometrica: Journal of theEconometric Society pp. 711–730.

Mahalanobis, P. (1936). On the generalized distance in statistics,Proceedings of the National Institute of Sciences of India,Vol. 2, New Delhi, pp. 49–55.





Bibliography III

Mardia, K. (1974). Applications of some measures of multivariateskewness and kurtosis in testing normality and robustnessstudies, Sankhya: The Indian Journal of Statistics, Series Bpp. 115–128.

Mardia, K., Kent, J. and Bibby, J. (1980). Multivariate analysis.

Mitchell, A. and Krzanowski, W. (1985). The mahalanobisdistance and elliptic distributions, Biometrika 72(2): 464–467.

Ross, S. A. (1976). The arbitrage theory of capital asset pricing,Journal of economic theory 13(3): 341–360.

Wilks, S. (1963). Multivariate statistical outliers, Sankhya: TheIndian Journal of Statistics, Series A pp. 407–426.


Mahalanobis distances on factor structured data

Documents

Transcript of Mahalanobis distances on factor structured data