Semi-Parametric Estimation for Multivariate Skew ... · ESMT European School of Management and...

Semi-Parametric Estimation for Multivariate Skew-EllipticalDistributions

Jing HuangESMT European School of Management and Technology

Abstract

We investigate a class of skew-elliptical distributions which generalize the skewed distributionsproposed by Azzalini and Dalla Valle (1996). These distributions are generated by elliptical con-tours plus an unknown distribution function which assigns weights or densities to these contours.Decomposing such a skewed distribution into the two constituent parts allows us to estimate themseparately, with both parametric and non-parametric methods. For the one and two dimensionalcases, we use simulated data to illustrate the estimation algorithm. In a parametric framework andunder the assumption that the density assigning function has a closed form, our approach leads to afaster estimation process than traditional log likelihood estimation (MLE) through direct optimiza-tion or through expectation maximization (EM). Finally, we show the results from fitting our modelto a real data set on body mass measurements of Australian athletes.

1

1 Introduction

A class of skewed distributions was proposed by Azzalini and Dalla Valle (1996). They use a con-ditional elliptical distribution to create skewness in the marginal distribution. Suppose, for exam-ple, that (x,y) come from a two-dimensional normal distribution with parameters µ = [0,0] andΣ = [1,δ ;δ ,1]. Then the conditional distribution f (y|x > 0) follows a skewed distribution with pa-rameter λ (δ ), which has density

f (y,λ ) = 2φ(y)Φ(λy),

where φ and Φ denote the N(0,1) density and distribution function. Gupta (2003) constructed amultivariate skewed t-distribution by defining Yj = X j/

√W/v, where X = (X1, . . . ,Xp)

′ follows askewed normal distribution and W ∼ χ2

v is independent of X . Fang and Zhang (1990) suggested aclass of elliptical distributions given by x = Ru(k), where R > 0 follows g(r) and u′u = 1 is a randomvariable uniformly distributed on the unit sphere. Since the density function of F(r) may not exist,they assumed that the elliptically distributed random variable x has characteristic function φ(x). Theunit sphere can be changed to an elliptical contour by a change of shape Σ, together with a locationparameter µ: x = µ +RΣuk. A general notation for the elliptical contour is ECk(µ,Σ,φ). Basedon this property, Branco and Dey (2001) derived a general class of multivariate skewed-ellipticaldistributions by using a generalized density function g(r) instead of the normal distribution or the tdistribution. Other parametric forms of skewed distributions are the skew-Cauchy distribution pro-posed by Arnold and Beaver (2000), and the extended skew-exponential power distribution suggestedby Salinas et al. (2007), with a combined expression for both common ways of defining skewed dis-tributions. Dey and Liu (2005) studied the density of different kinds of combinations for skeweddistributed random variables.

For the non-parametric case, Genton and Loperfido (2005) suggested using a general skew func-tion π(Σ−1/2(x−µ)) to replace the cdf part for those definitions in Azzalini and Dalla Valle (1996) orBranco and Dey (2001). In Ma et al. (2005), a semiparametric estimation method is given; they provethat the estimator is locally efficient if the model that is used to estimate the parameters contains thetrue model. They also assume that the general unknown function π0(x) has the form of Φm(x),where Φ is the normal cdf and m(x) is an odd function. They used an odd polynomial PK(x) of orderK in order to estimate m(x), hoping that ΦPK(x) is close to π0(x).

In this paper, we develop an approach for estimating the parameters Σ and µ of the ellipticalcontour without knowing the explicit form of the distribution of g(r) defined according to Fang andZhang (1990, p.59). In the later part of this paper, we also assume that the density assigning functionis known, in order to compare our estimation approach with direct maximization of the likelihoodand with the expectation-maximization algorithm. The elliptical contour appears in many commondefinitions, as for example the multivariate normal or the multivariate t distribution. In fact, anydistribution function where the random variable X appears in the form of (X −µ)′Σ−1(X −µ) is anelliptical distribution, and any skewed distribution defined on it can be estimated using the methoddiscussed in this paper. We focus here on the one dimensional case, and make the single assumptionthat this unknown function follows an elliptical distribution. The definition of elliptical distributionis not unique, but all definitions are equivalent. One definition given by Fang and Zhang (1990) is asfollows:

Definition 1 If a n× 1 random vector follows an elliptical distribution, then it has a stochasticrepresentation R ·u(n), where R≥ 0 is independent of un, and un follows a uniform distribution on thesphere ||u(n)||2 = 1.

Thus an elliptical distribution has two independent parts, R ≥ 0 and u(n). Here u(n) is a sphere,and under some transformation (discussed in the next section) u(n) can be an elliptical contour. Fig-

2

ure 1 shows an example of skew distribution generated from the normal distribution. The coloredlines between the two graphs indicate the correspondence between densities and elliptical contours.

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 1: The density function of R (right) and the corresponding elliptical contours (left)

In this paper, we make three contributions. First, we show how a multivariate distribution (skewedor non-skewed) can be decomposed into two parts: a n−dimensional elliptical sphere, and a uni-variate function g(r). Second, we develop a non-parametric approach for estimating the functiong(r). Third, we add skewness to those elliptical spheres which satisfy the definition in Azzaliniand Dalla Valle (1996). In the one-dimensional and two-dimensional cases, we give the momentconditions which can be used for parametric estimation.

2 Skewed elliptical contour

In this section we discuss the skewed elliptical contour in the Rn case, which is the basic elementfor the skewed elliptical distribution. There are two steps in reaching the skewed elliptical contourfrom a uniformly distributed ball with radius r: ||un||2 = r2. In the first step we follow Azzalini andDalla Valle (1996) and calculate the conditional density of un−1 given u1 > 0, where [un−1;u1] is apartition of un. In the second step we apply the linear transformation x = T ·un−1 to change the shapefrom a half ball to a skewed elliptical contour.

Let un−1 and u1 be random variables uniformly distributed on the Rn−1 ball with radius r and onthe R1 ball with radius r, respectively. Then the relationship between un−1, u1 and un−1, u1 is[

un−1u1

]d−→[

dn−1un−1

d1u1

]where (d2

n−1,d21)∼ B( 1

2 (n−1), 12 ). So r = R∗dn−1 has density:

g(r) =2Γ( n

2 )√πΓ( n−1

2 )R2−nrn−2(R2− r2)−

12 . (1)

Fang and Zhang (1990) gave another relationship between the multivariate density fn(x) and g(r):

g(r) =2πn/2

Γ(n/2)rn−1 fn(r2). (2)

3

In our case, fn−1(un−1) is the multivariate density for un−1, thus

f Rn−1(un−1) =

Γ( n2 )

πn/2 R2−n(R2−||un−1||2)−12 . (3)

Now we can proceed to the second step. Let [xn−1;x1] be a partition of x, where xn−1 ∈ Rn andx1 ∈ R. Furthermore, x and un have the linear relationship

x′Σ−1x = ||un||2,

where Σ is a positive semi-definite matrix. We need to find out the conditional density of xn−1 givenx1 > 0. Without loss of generality we assume that x is standardized, so the diagonal elements of Σ

are all equal to one and Σ can be written as

Σ =

[Σ11 δ

δ ′ 1

].

Let Σ11.2 = Σ11−δδ ′ and Σ11.2 = A′11.2A11.2. Then Σ can be decomposed as

Σ =

[I δ

0 1

]·[

Σ11.20 1

]·[

1 0δ ′ 1

].

Thus [xn−1x1

]=

[A′11.2un−1 +δu1

u1

]. (4)

Then the conditional density function can be written as

f R(xn−1|x1 > 0) = f (A′11.2un−1 +δu1|u1 > 0).

Since ||un−1||2 + ||u1||2 = R2 and xn−1 = A′11.2un−1 +δu1, we can solve for u1 and obtain

u1 =B(xn−1)±

√B(xn−1)2−A · (C(xn−1)−R2)

A, (5)

whereA = δ

′Σ−111.2δ +1, B(xn−1) = δ

′Σ−111.2xn−1, C(xn−1) = x′n−1Σ

−111.2xn−1.

Inserting (5) into (4), we can now solve for un−1 as a function of xn−1. Let |J| = | ∂ (un−1)∂xn−1

| be thedeterminant of the Jacobian. From (5) and (3) it follows that the marginal density of xn−1 is

f R(xn−1) =Γ( n

2 )

πn/2 R2−n||u1(xn−1)||−1|J|. (6)

Denote the conditional density by f R. The whole density is determined by two parts: g(r) which isthe density of R, and, for each R, the f R defined on it. The whole density function can be written as

f (xn−1|x1 > 0) =∫

∞

0f R(xn−1)g(R)dR. (7)

By using the properties from Fang and Zhang (1990, p.68-70), after some trivial calculations (seeAppendix), we can derive the whole density function (7):

f (xn−1|x1 > 0) = 2F∗(λ (xn−1−µ)) f1(x) (8)

where b = (xn−1−µ)′Σ11−1(xn−1−µ), λ = δ ′Σ−1/√

1−δ ′Σ−1δ and

F∗b (x) =

∫ x

−∞

fn(x2 +b)dx∫∞

−∞

fn(x2 +b)dx

4

is a cdf function, f1(·) is the one dimensional and fn(·) is the n-dimensional density in (2). Thisshows that our model is equivalent to the definition in Azzalini and Dalla Valle (1996).

In real life applications, g(r) is usually unknown, so we need to estimate it. If we know Σ11, let

r =√

x′n−1Σ11xn−1, then we can estimate the density of r non-parametrically. Denote the estimated

density function as gn−1(r), an estimation of the true density of the radius of the (R−1)-dimensionalsphere. For deriving properties of conditional densities we need the density (1), therefore we haveto find the link between gn−1(r) and g(r). We can again decompose un−1 into un−1

1 and un−1n−2, then

u1 and un−11 should have the same marginal density. Denote the densities of u1 and un−1

1 on the Rn

sphere and on the Rn−1 sphere as f n1 and f n−1

1 respectively, which can be easily derived from equation(3). Then their relationship can be derived from the equality of the marginals:∫

f n−11 gn−1(r)dr =

∫f n1 g(r)dr. (9)

In the following two sections we illustrate this structure in the one and two dimensional cases. Wediscuss in detail a semi-parametric estimation method who can be extended to the high dimensionalcase in a straightforward manner.

3 Skewed elliptical structure in the one dimensional case

3.1 Conditional density

When n = 2 and

Σ =

[1 δ

δ 1

],

we have Σ11.2 = 1−δ 2 and A11.2 =√

1−δ 2. Denote xn−1 = x and un−1 = u. Then (5) becomes

u1 = δx±√

δ 2x2− (x2− (1−δ 2)R2). (10)

Using (4) we obtain

|J|= |∂u∂x|= δx±

√(1−δ 2)(R2− x2)√

R2− x2. (11)

Inserting (10) and (11) into (6), we obtain the density function

f R(x) =1

π√

R2− x2. (12)

We can write [u,u1] as a function of [x,x1]. The condition x1 > 0, or equivalently u1 > 0, impliesruling out some areas where x can take values. Without loss of generality we assume δ > 0, so ifx < 0, from (10), the "±" can only take "+" and x≥−

√1−δ 2R. If x > 0, when x≤

√1−δ 2R, "±"

can only be "+". When x >√

1−δ 2R, the "±" can be both signs. The conditional density is then

f R(x) =

1

π

√R2−x2

if −√

1−δ 2R≤ x≤√

1−δ 2R

2π

√R2−x2

if√

1−δ 2R < x≤ R.

(13)

If we also consider a location parameter µ , the conditional density becomes

f R(x) =

1

π

√R2−(x−µ)2

if −√

1−δ 2R+µ ≤ x≤√

1−δ 2R+µ

2π

√R2−(x−µ)2

if√

1−δ 2R+µ < x≤ R+µ.

5

For each R, this density is weighted with probability g(r), so the full density function f (x) is∑ f ri(x)g(ri)∆r, or more specifically:

f (x) =

∫ x−µ√1−δ2

x−µ

2g(r)

π√

r2− (x−µ)2dr+

∫∞

x−µ√1−δ2

g(r)

π√

r2− (x−µ)2dr if x≥ µ

∫∞

x−µ√1−δ2

g(r)

π√

r2− (x−µ)2dr if x < µ.

(14)

As an example, let us consider the normal distribution. Replace x1 by y, let µ = 0 without loss ofgenerality, and recall that the two-dimensional normal distribution has density

12π|Σ|1/2 exp(−1

2[x,y]Σ−1

[xy

])). (15)

Let u = sin(θ), u1 = cos(θ), and β = arcsin(√

1−δ 2)/2. Then we can write the relationship be-tween [x,y] and [u,u1] as:

x = R · sin(θ −β )y = R · sin(θ +β )

(16)

From (16) and (15) it follows that∫∞

−∞

∫∞

−∞

12π|Σ|1/2 exp(−1

2[x,y]Σ−1

[xy

]))dxdy =

∫ 2π

0

∫∞

0

1

2π√

1−ρ2|J|︸︷︷︸

=rsin(2β )

exp(−12

r2)drdθ

=∫

∞

0r · exp(−1

2r2)dr,

where |J| is the Jacobian of the transformation from dxdy to drdθ . Comparing with (14), the g(r)density in the case of the normal distribution is

g(r) = r · exp(−12

r2). (17)

Thus the two-dimensional normal distribution can be fully determined by the elliptical contours (14)and the density (17). The same result for g(r) can be obtained from (2). For the convenience ofcalculations, we therefore use the polar system (16) in the two dimensional case where sin(2β ) =√

1−δ 2.

3.2 Distribution function

The goal of this paper is to estimate g(r) so that we can then estimate µ and δ . From (13) wesee that as δ decreases (β increases), the density moves from the interval [−R,−R ·

√1−δ 2] to

[R ·√

1−δ 2,R]. Since y must have a symmetric distribution when δ = 0 and β = π/4, the skewnessderives from the movement of these densities. So by moving them back, the original symmetricdistribution can be recovered. Note that we assume µ = 0. Suppose that when δ = 0, the densityfunction of y is f (x). Then for any ξ ≥ 0, we must have

f (x+ξ )− f (x+ξ ) = f (x−ξ )− f (x−ξ )

⇒ f (x+ξ ) = 12 ( f (x+ξ )+ f (x−ξ )).

Following Li and Racine (2007), we can use the kernel density k(x,X) to estimate f (x), thenwe can estimate the symmetric density f (y). Note that for some distributions (e.g. the multivariatenormal), x and y are independent when δ = 0, so fn(R2) = fx(

√R2−a2) · fy(

√a2) for any 0≤ a≤ c.

6

In our setting fx = fy = f , so we can estimate fx and then g(c) by giving a (e.g a = 0) in (2).The accuracy of this estimation depends on the accuracy of the kernel estimation. But for otherdistributions (e.g. the multivariate t), x and y are not independent, so we need to check those moregeneral cases.

Since (14) has a singular point which can raise difficulties in estimation, we calculate the dis-tribution function instead of the density function. On the left side of µ , the distribution functionis:

G(x) =∫

∞

R

1π

arcsin(x−µ

r)g(r)d(r)+

∫∞

R

1π

2β ·g(r)dr, (18)

where µ−Rsin(2β ) = x. On the right side of µ , the distribution function is:

1−G(x)=∫

∞

Rg(r)dr−

∫∞

R

1π

arcsin(x−µ

r)g(r)d(r)−

∫ R

R

1π

arcsin(x−µ

r)g(r)d(r)−

∫∞

R

1π

2β ·g(r)dr

(19)where R+µ = x and Rsin(2β )+µ = x. From (18) or (19) we can easily see that, if x = µ ,

G(µ) =2β

π.

Since δ is a one-to-one function of β , this provides a fast estimation method of δ if we knew µ , orof µ if we knew δ , under the condition that we have a good estimate of G(µ). Note also that whenf (x) is given, µ and δ can be determined by each other, so there is one less parameter to estimate.From now on, we only focus on µ .

3.3 Moments of g(r)

Proposition 1. Let f (x) be the density function defined in (14). Then the following relationshipstands: ∫

∞

−∞

(x−µ)n f (x)dx =Wn ·∫

∞

0rng(r)dr, (20)

where

Wn =sinn−1(2β )+ sinn−1(−2β )

nπ· cos(2β )+

n−1n

Wn−2, if n > 2

and

Wn =1π

∫ 2β

−2β

sinnθdθ +

2π

∫π/2

2β

sinnθdθ .

3.4 Parameter estimation

We know that f (x) can be uniquely determined by g(r). In this section, we provide a way to estimatethe function g(r). Assume we know µ , then the density function of |x− µ| is gn−1(·) in (9). Sincein this section n = 2, we write gn−1(·) as g1(·). According to the result we derived above, in theone-dimensional case (9) becomes:

g1(x) =∫

∞

x

g(R)

π√

R2− x2dR. (21)

In the one-dimensional case, we write g(r) as g2(r) to differentiate it from g1(r). Figure 2 showsthe good fitting performance of the estimated g(r) compared to the true g(r), for the normal and t3distributions.

After obtaining g(r) from the pre-assumed value of µ , we need to check whether this value of µ

is correctly chosen. We insert the estimated g(r) back to compute the density or distribution functionsand compare the result with the empirical or kernel densities. Since we do not have the closed form of

7

0 5 10 150

0.2

0.4

0.6

0.8

t3, n=50000

real g

2(r)

est g2(r)

0 5 10 150

0.2

0.4

0.6

0.8

t3, n=10000

real g

2(r)

est g2(r)

0 2 4 6 80

0.2

0.4

0.6

0.8normal, n=10000

real g

2(r)

est g2(r)

0 2 4 60

0.2

0.4

0.6

0.8normal, n=50000

real g

2(r)

est g2(r)

Figure 2: The estimated g2(R) versus the true g2(R) for the normal and t3 distributions, with 10000 and50000 simulated data

g(r) and computing the density involves integrating, using g(r) in the integrals can lead to significantinaccuracy. Therefore we can use the density function in (8). In the one-dimensional case it becomes

f (x) = 2F∗b (λ′(x−µ)) f1(x) (22)

where b = (x−µ)2, λ = δ/√

1−δ 2 and

F∗b (x) =

∫ x

−∞

f2(t2 +b)dt∫∞

−∞

f2(t2 +b)dt

is a cdf function. In the case of the skewed normal density, f1(x) = 1√2π

exp(− 12 x2) and f2(x) =

12π

exp(− 12 x2) is the standard two-dimensional density function. Note that f1(x) is the one-dimensional

standard normal density function since the diagonal elements of our Σ are all ones. In section 5 wewill briefly discuss the case when they are not.

Thus our algorithm for estimating µ is as follows:

a. pre-assign a starting value for µ , and center the data x = x−µ .

b. estimate g1(x) using g1(x) = k(|x|), where k(·) is kernel density or other non-parametric density function.

c. calculate g(r) according to (21).

d. calculate the distance d =∫|k(x)− f (x)|dx using kernel or other non-parametric estimator and (22);

write d as a function of µ:d = D(µ).

e. solve the minimization problem minµ D(µ)

The optimization process in step (e) is not the only possible choice and maybe not the optimalchoice either. We can also apply other criteria here, for example maximizing the likelihood, mini-mizing the mean squared error (MSE), etc. We do not discuss these criteria further here, since ourpurpose is mainly to introduce a new model.

8

3.5 A simulation study

In order to investigate the performance of our estimation approach, we ran a small simulation study.We generated data (xi, yi) with i = 1 · · ·10000, from the two-dimensional t and normal distributions:

tv([0,µ],[

1 δ

δ 1

]) or N([0,µ],

[1 δ

δ 1

]).

We retained yi when xi ≥ 0, and applied the estimation algorithm outlined in the previous section.We repeated the simulation 50 times in each scenario. To assess the estimation performance, wecomputed the bias and mean squared error (MSE) of our estimates as follows:

bias µ =1

50

50

∑n=1

(µn−µ),

and

MSE µ =1

50

50

∑n=1

(µn−µ)2,

where µn is the estimated value for the true µ . Similarly, we computed the bias and MSE for ρ .Table 1 shows the bias and MSE values in different simulation scenarios.

Model True µ True δ Bias µ Bias δ MSE µ MSE δ

N 2.8 0.1 -0.007 0.005 0.008 0.012N 2.9 0.3 -0.09 0.097 0.022 0.022N 1.3 0.7 -0.080 0.072 0.008 0.006N 3.0 0.9 0.059 -0.0429 1e-4 5e-5t3 2.8 0.1 -0.027 0.030 0.014 0.017t3 2.9 0.3 -0.032 -0.034 0.0192 0.0238t3 1.3 0.7 0.009 -0.0105 0.003 0.0023t3 3.0 0.9 0.006 -0.004 -0.0007 0.0002

Table 1: Estimation performance with data generated from the t distribution with 3 and 5 degrees offreedom and from the normal distribution

3.6 Comparison with other parametric models

In this section we compare our method with other popular parametric estimation methods. We assumethat the data follows a skew-normal distribution. Direct maximization of the likelihood functiondoes not generally work for this type of skewed data. According to Azzalini and Capitanio (1999),the biggest problem with likelihood maximization is that the expected Fisher information matrixbecomes singular at δ = 0. So the expectation-maximization method (EM) is proposed instead inorder to compute the MLE. To use EM in this setting, we assume that the whole data is (x1,x2), butwe only observe x2. Remember that (x1,x2) has a two-dimensional half normal distribution: if (y1,y2)

follows a bivariate normal distribution, then (y1, |y2|) follows a bivariate half normal distribution. So(x1,x2) has density function given by

f (x1,x2) =1π|Σ|−

12 exp(−1

2(x1−µ,x2)Σ

−1(x−µ,x2)′)

=1

π√

1−δ 2exp(− 1

2(1−δ 2)((x1−µ)2−ρ(x1−µ)x2 + x2

2),

where Σ =

[1 δ

δ 1

].

9

The log-likelihood function for (x1,x2) is

L(µ,δ |x1,x2) = −n · log(π)− n2

log(1−δ2)− 1

2(1−δ 2)

n

∑i=1

(x1i−µ)2

+δ

(1−δ 2)

n

∑i=1

(x1i−µ)x2i−1

2(1−δ 2)

n

∑i=1

x22i.

ThenE[L(µ,δ |x1,x2)|x1, µ, δ ]

= −n · log(π)− n2 log(1−δ 2)− 1

2(1−δ 2)

n

∑i=1

(x1i−µ)2

+ δ

(1−δ 2)

n

∑i=1

(x1i−µ)E(x2|x1i)−1

2(1−δ 2)

n

∑i=1

E(x22|x1i).

(23)

Note that f (x2|x1) has the density function

f (x2|x1) =

1√2π(1−δ 2)

exp(−12(

x2√1−δ 2

− δ (x1−µ)√1−δ 2

)2)

Φ(λ (x1−µ)),

so we shall have

E(x2|x1) =

√1− δ 2 A1

A0+ δ (x1− µ),

E(x22|x1) = (1− δ 2)

A2

A0+2δ

√1−δ 2(x1−µ)

A1

A0+ δ

2(x1− µ)2, (24)

whereA j =

∫∞

−λ (x1−µ)t j 1√

2πexp(−1

2t2)dt,

and λ = δ/√

1− δ 2. Then the two steps of the EM algorithm are as follows:

• E-step: compute E[L(µ,δ |x1,x2)|x1, µ, δ ] using (23) and (24).

• M-step: update (µ,δ ) by maximizing E[L(µ,δ |x1,x2)|x1, µ, δ ].

The biggest drawback of EM is that the iterations require a lot of calculation time until convergence.Here we introduce a fast parametric estimation method based on the moment condition in Theorem 1.Consider the first two moment equations:∫

∞

−∞

x f (x)dx−µ =2δ

2

∫∞

0rg(r)dr, (25)

∫∞

−∞

x2 f (x)dx−2µ

∫∞

−∞

x f (x)dx+µ2 =

12

∫∞

0r2g(r)dr, (26)

where∫

∞

−∞x2 f (x)dx and

∫∞

−∞x f (x)dx can be estimated by 1

n ∑ni=1 x2

i and 1n ∑

ni=1 xi respectively. In the

parametric setting we have g(r) = r · exp(− 12 r2), so we can solve for µ from (26), and then for δ

from (25). Using higher moments is not encouraged since the tail has too much weight and we onlyhave a limited number of data points.

Table 2 shows the results of performance comparisons between different methods. The resultsshow that the MLE has the worst performance; the bias of δ is almost equal to the true δ in each sce-nario, indicating that the MLE of δ is around 0 most of the time. The EM algorithms performs quitewell, but the calculation time is tremendous. Our estimation approach using moment conditions (SE)performs better than both MLE and EM, since it both requires small computation times and leadsto accurate estimates. Note that the computation time for our estimation method is almost indepen-dent of data size (the only difference is in estimating the mean of x and x2), while the computationtime required by EM increases with increasing data size. Therefore, our method is to be preferredparticularly when the data is large.

10

Model Method Time (seconds) Bias µ Bias δ MSE µ MSE δ

n = 1000, µ = 2.8, δ = 0.1SE 0.427 -0.036 0.037 0.020 0.029EM 1523 -0.056 -0.080 0.020 0.031

MLE 31.53 0.058 -0.080 0.008 0.014

n = 10000, µ = 2.8, δ = 0.1SE 0.472 0.010 -0.010 0.004 0.007EM 16802 -0.017 0.022 0.004 0.007

MLE 267.01 0.082 -0.100 0.007 0.010

n = 1000, µ = 2.9, δ = 0.3SE 0.392 0.009 -0.021 0.015 0.019EM 1103 0.024 -0.040 0.016 0.022

MLE 25.77 0.231 -0.300 0.055 0.090

n = 10000, µ = 2.9, δ = 0.3SE 0.431 -0.004 0.005 0.013 0.003EM 7700 -0.003 -0.003 0.002 0.003

MLE 252.56 0.240 -0.300 0.057 0.090

Table 2: Performance comparisons. SE stands for "skewed elliptical" method based on (25) and (26).EM stands for the "expectation maximization" algorithm and MLE stands for "numerical maximumlikelihood estimation". The data generation processes are the same as in Table 1. For each algorithm ineach scenario the simulation has been carried out 20 times.

3.7 Comparison with other semi-parametric modelsSemi-parametric estimation for skewed elliptical data has so far not been researched in depth. Gentonand Loperfido (2005) proposed a generalized skew-elliptical distribution with the form

f (x) = 2|Σ|−1/2g(Σ−1/2(x−µ))π(Σ−1/2(x−µ)), (27)

where π(·) is the skewing function which satisfies π(x) ∈ [0,1] and π(x) + π(−1) = 1. Ma andGenton (2004) proposed using Φ(Pk(x)) to estimate π(x), where Pk(x) is an odd polynomial of orderK, and Φ(·) could be the normal cumulative distribution function or the logit link function [i.e.,Φ(x) = 1/(1+ exp(−x))]. This estimation is efficient under regular assumptions (Ma et al. (2005)).Since (27) still assumed a half density form of the distribution function, this model is less generalthan our model which only assumes the elliptical form (x− µ)′Σ−1(x− µ), that is also the basicstructure of generalized skew-elliptical distribution (27).

We conducted a further simulation study in order to compare the estimates from our model withthe MLEs from the generalized skew-elliptical distributions (GSE). We used Φ(x)= 1/(1+exp(−x))and estimated Pk(x) for k = 1 and k = 3. The results are summarized in Table 3.

Model Method Time Bias µ Bias δ MSE µ MSE δ

normal, µ = 2.8, δ = 0.1SE 331.33(average) 0.090 -0.111 0.0175 0.0103

GSE(k = 1) 125.43 -0.004 - 0.005 -GSE(k = 3) 424.53 -0.128 - 0.029 -

t3, µ = 2.8, δ = 0.1SE 359.05(average) 0.097 -0.1015 0.004 0.005

GSE(k = 1) 166.29 -0.474 - 1.578 -GSE(k = 3) 384.42 -0.539 - 1.817 -

Table 3: Performance comparisons with the MLEs of the generalized skew-elliptical distribution. Foreach scenario, the simulation was conducted 20 times.

The results in Table 3 show that the MLE of GSE has some limitations. The true data generatingprocess needs to have known closed form of the first part of (27). Mis-specification of the first partcan lead to estimation failure. For the simulation, we assume that the first part of the density in (27)is normal. When the true density is t3, the estimated values are quite far from the true parameter

11

values, while our approach based on Fourier series can still lead to accurate estimates albeit with alarge cost of computation time.

4 Skewed elliptical structure in the two dimensional caseIn this section we provide the specific density function for the two-dimensional case, and brieflydiscuss parameter estimation. We have n = 3, and let xn−1 = [x;y] and

Σ =

1 ρ δ1ρ 1 δ2δ1 δ2 1

.According to (3), after simple calculations we obtain the density function as

f R(x,y) =1

2πR√

(1−ρ2)R2− (x2−2ρxy+ y2).

Here B(xn−1) and C(xn−1) becomeB(x,y) = (δ1−ρδ2)x+(δ2−ρδ1)y

C(x,y) = 1k [(1−δ 2

2 )x2 +2(δ1δ2−ρ)xy+(1−δ 2

1 )y2]

, (28)

where k = (1−δ 21 )(1−δ 2

2 )− (ρ−δ1δ2)2. With a similar argument as in the one-dimensional case,

we conclude that the conditional density is given by

f R(x,y|u1 > 0) =

2

2πR√(1−ρ2)R2− (x2−2ρxy+ y2)

if B(x,y)> 0 and C(x,y)> R2

1

2πR√(1−ρ2)R2− (x2−2ρxy+ y2)

otherwise.

(29)

I

II

Figure 3: The graph shows the areas where the conditional density is defined. The black line is B(x,y),the red elliptical contour is C(x,y) = R2, and the blue elliptical contour is x2−2ρxy+ y2 = (1−ρ2)R2.If B(x,y) > 0, the conditional density moves the density from area II to I (areas between the red curveand the blue curve) and vice versa.

12

Figure (3) shows the area where a skewed elliptical contour is defined. We can see that, asin the one-dimensional case, the skewness is caused by the "transport" of the density. In the two-dimensional case the density is moved to area A from area B symmetrically to the origin. If thereare location parameters, say µ1 for x and µ2 for y, then the density transport is symmetric to [µ1,µ2].This source of skewness suggests recovering the symmetric density by moving the density back tothe origin or the center. Once symmetry is recovered, we can estimate ρ with the contour.

4.1 Marginal density and moment conditionsFrom (28) and (29) we can derive the marginal density of x on each two-dimensional skewed spherewith radius R (assume that x is centered by x = x−µ1):

If δ1 < 0, then

f r=R(x)=

1R if mx1 ≤ x < mx2

1R −

arcsin(upper(x))−arcsin(lower(x))2πR if mx2 ≤ x < mxI

arcsin(upper(x))−arcsin(lower(x))2πR +

arcsin(lower(x))+π/2πR if mxI ≤ x < MxI and sign(ρ)(ρδ2)2 > sign(ρ)(δ1−ρδ2)2


π/2−arcsin(upper(x))πR if mxI ≤ x < MxI and sign(ρ)(ρδ2)2 ≤ sign(ρ)(δ1−ρδ2)2

arcsin(upper(x))−arcsin(lower(x))2πR if MxI ≤ x < Mx2 .

If δ1 > 0, then

f r=R(x)=

1R if Mx2 ≤ x < Mx1

1R −

arcsin(upper(x))−arcsin(lower(x))2πR if MxI ≤ x < Mx2


arcsin(lower(x))+π/2πR if mxI ≤ x < MxI and sign(ρ)(ρδ2)2 < sign(ρ)(δ1−ρδ2)2


π/2−arcsin(upper(x))πR if mxI ≤ x < MxI and sign(ρ)(ρδ2)2 ≥ sign(ρ)(δ1−ρδ2)2

arcsin(upper(x))−arcsin(lower(x))2πR if mx2 ≤ x < mxI

(30)

where Mx1 = R, mx1 =−R, Mx2 = R√

1−δ 21 , mx2 =−Mx2, I = (δ2−ρ ∗δ1)

2/(δ 21 +δ 2

2 −2ρδ1δ2), MxI =

R√

I, mxI =−MxI andupper(x) = (− δ1δ2−ρ

1−δ 21

x−ρx+

√k

1−δ 21(R2− x2

1−δ12 ))/√(1−ρ2)(R2− x2)

lower(x) = (− δ1δ2−ρ

1−δ 21

x−ρx−

√k

1−δ 21(R2− x2

1−δ12 ))/√(1−ρ2)(R2− x2).

To derive the moment function, we first introduce two propositions:

Proposition 2. If δ1 < 0, then the following equation holds:

Mn(x) =∫

∞

−∞

(x−µ)n(∫

∞

h(x)f r=R(x)g(R)dR)dx =

∫∞

0(∫ √1−δ 2

1 R

−R(x−µ)n f r=R(x)dx)g(R)dR,

where h(x) is a linear function of x which can be either (x− µ)/√

1−δ 21 or µ− x depending on whether x > µ

or x≤ µ .

The proof of this proposition is relatively straightforward: f r=R(x) can be defined to be equal to zero whenR < x, so that

∫∞

h(x) f r=R(x)g(R)dR becomes∫

∞

0 f r=R(x)g(R)dR. Also note that f r=R(x) = 0 when x is outside

the interval [−R,√

1−δ 21 R], so that we can write

∫∞

−∞(x−µ)n f r=R(x)dx as

∫√1−δ 21 R

−R (x−µ)n f r=R(x)dx, whichcompletes the proof.

Proposition 3. For densities on spheres with different radius, we have

R · f r=R(Rx) = f r=1(x).

13

This proposition follows directly from (30).With these two propositions, we have the following moments conditions (δ1 < 0):

∫∞

−∞(x−µ)n fmargin(x)dx =

∫∞

0(∫ √1−δ 2

1

−Rxn f r=R(x)dx)g(R)dR

=∫ √1−δ 2

1

−1x f r=R(x)dx

∫∞

0Rng(R)dR

= Mr=1n ·

∫∞

0Rng(R)dR,

(31)

where x = x−µ1 and Mr=1n is the n-th moment of the random variable uniformly distributed on the sphere with

radius one. This gives us the relationship between the moments of the random variable and the moments ofg(R). As in the previous section, if we know the closed form of g(R), we can estimate the parameters throughmoment conditions.

4.2 Parameter estimationSimilar to subsection 3.4 we first need to estimate the function g(r). According to (3) and (9) we have∫

∞

x

1

π√

R2− x2g2(R)dR =

∫∞

x

12R

g(R)dR. (32)

Then g2(R) can be estimated non-parametrically if we know µ1, µ2 and ρ , and assume that we have data(xi,yi), i = 1, · · · ,n:

a. center the data: (xi, yi) = (xi−µ1,yi−µ2).

b. calculate the radius: ri =√

(x2i −2ρ xiyi + y2

i )/(1−ρ2)

c. calculate g2(R) in (32) non-parametrically (kernel or empirical method) using ri, denote the estimate by g2(R).

Once we have g2(R), we can estimate the left hand side of (32), denote it by H(x). Then g(R) can be estimatedby

g(R) =−2R∂ H(R)

∂R.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

real g

3(r)

est g3(r)

0 2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

real g

3(r)

est g3(r)

Figure 4: The left plot gives g(R) versus the true g(R) in the case of the normal distribution, while theright plot shows the case of the t3 distribution.

There is a close relationship between µ1, µ2 and ρ by which we can estimate ρ through µ1 and µ2. Wediscussed how the skewness is caused by density transportation. Figure 5 shows the contours for the two-dimensional t and normal densities with parametric forms ft(x′Σ−1x) and fn(x′Σ−1x) respectively. The redpoints represent the elliptical contour x′Σ−1x = 1 whose shape is determined by the value of δ . We see thatthe t and the normal distributions produce exactly the same shape of elliptical contours, even though they aregenerated by different functions g(r). Therefore the estimation of ρ is independent of the g(r) function. Inother words, without knowing the explicit form of g(r) we can still estimate ρ . According to our theory, if x,yfollows a skew-elliptical distribution, then x,y follows a conditional distribution. Suppose we also observe thethird variable z, and the underlying 3-dimensional density is f (x,y,z), then x,y follows f (x,y|z > 0). Apparently2µ1− x,2µ2− y follows f (x,y|z≤ z) and x,y∪2µ1− x,2µ2− y follow the marginal density∫

∞

−∞

f (x,y,z)dz.

14

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Figure 5: The contours for the t pdf (left), the normal pdf, and x′Σ−1x = 1

Denote x, y= x,y∪2µ1− x,2µ2− y. Since x, y follow the non-skew marginal density, ρ can be easilyestimated with the correlation of x, y. In other words, if

(x, y) =[

1 ρ

ρ 1

],

then ρ is a good estimator of ρ . If we replace ρ by ρ in the process of estimating g(R), we only need µ1 and µ2to estimate g(R). So from now on we write g(R) = g(R,µ1,µ2).

After deriving g(R,µ1,µ2), we can recover the whole density of xi,yi by using (29). As in the one-dimensional case, in order to increase the accuracy, we use the difference of density at (x,y) and (2µ1−x,2µ2−y), the absolute value of which, by (29) and the discussion of density transportation, should be equal to

d(x,y) =∫ ub(x,y)

lb(x,y)

2g(R,µ1,µ2)

2πR√

(1− ρ2)R2− ((x−µ1)2−2ρ(x−µ1)(y−µ2)+(y−µ2)2)dR

where lb(x,y) =√((x−µ1)2−2ρ(x−µ1)(y−µ2)+(y−µ2)2)/(1− ρ2)

ub(x,y) =√

C(x−µ1,y−µ2) (C(x,y) is defined in (28)).

We can take different pairs xi,yi and compare d(xi,yi). Note that d(xi,yi) is a function of µ1, µ2, δ1, δ2, so wecan write d(xi,yi) as d(xi,yi,µ1,µ2,δ1,δ2). One criterion of choosing parameters could be minimizing:

D(µ1,µ2,δ1,δ2) =∫ ∫

d(x,y,µ1,µ2,δ1,δ2)dxdy. (33)

In practice, we can choose sufficiently many pairs xi,yi and let D(µ1,µ2,δ1,δ2) be the summation of all d(xi,yi).The choice of criterion is again not unique; this can be, for example, minimizing the distance between theestimated density, maximizing the likelihood function, etc. In the next section we give estimation results from asimulated data set in both a semi-parametric and a parametric framework.

We generate a data set of n observations (xi,yi,zi), i = 1 · · ·n from either the normal or the t3 distribution,with [µ1 =−1,µ2 = 0.3,µ3 = 0] and

Σ =

1 0.6 0.30.6 1 −0.50.3 −0.5 1

.Then we only keep (xi,yi) where zi ≥ 0, so we have about n

2 observations left. First we estimate the functiong(r) non-parametrically. The results are shown in Tables 4 and 5.

Parametric estimation is much more straightforward. Just as in the one-dimensional case, we can use mo-ment conditions to solve for all the parameters. Note that since we have all the information on g(R), we alsoknow its characteristic function, so we can choose n in (31) as a non-integer number. However, the empiricalestimation of the higher moment condition is not very accurate, so it is better not to use values for n larger than3. The results of parametric estimation are shown in Table 6.

15

Distribution µ1 µ2 δ1 δ2 ρ

Normaltrue -1 3 0.3 -0.5 0.6

estimate -1.0722 3.0788 0.3670 -0.5705 0.5533

t3true -1 3 0.3 -0.5 0.6

estimate -0.9281 3.1388 0.2251 -0.5910 05877

Table 4: Estimation results assuming g(r) unknown and n = 1,000,000.

Distribution µ1 µ2 δ1 δ2 ρ d

Normaltrue -1 3 0.3 -0.5 0.6 −

estimate -0.9281 3.0659 0.2076 -0.5434 0.6032 0.0841

t3true -1 3 0.3 -0.5 0.6 −

estimate -0.9650 3.0450 0.2660 -0.5266 0.6057 0.0283

Table 5: Estimation results assuming g(r) unknown and using the true distribution

We can also calculate the correlation on the sphere with radius one as a function of δ1, δ2, ρ . For example,here we show the correlation in the case when δ1 > 0:∫ ∫

xy f r=1(x,y)dxdy ≡ covr=1(x,y)

= 2(∫ MxI

mxI

S(x,−π

2,

π

2)dx−

∫ Mx2

MxI

S(x,arcsin(lower(x)),arcsin(upper(x)))dx)

+∫ Mx2

mx2

S(x,arcsin(lower(x)),arcsin(upper(x)))dx

+2∫ Mx1

Mx2

S(x,−π

2,

π

2)dx+M

where

S(x, lb,up) = x(

ρx(ub(x)−lb(x))2π

−√

(1−ρ2)(1−x2)2π(cos(ub(x))−cos(lb(x)))

)upper(x) = (− δ1δ2−ρ

1−δ 21

x−ρx+

√k

1−δ 21(R2− x2

1−δ12 ))/√

(1−ρ2)(R2− x2)

lower(x) = (− δ1δ2−ρ

1−δ 21

x−ρx−

√k

1−δ 21(R2− x2

1−δ12 ))/√

(1−ρ2)(R2− x2)

and

M =

2∫ MxI

mxI

S(x,arcsin(upper(x)),π

2)dx if sign(ρ)δ 2

2 > sign(ρ)(δ1−ρ ∗δ2)2

2∫ MxI

mxI

S(x,−π

2,arcsin(lower(x)))dx if sign(ρ)δ 2

2 ≤ sign(ρ)(δ1−ρ ∗δ2)2.

It is easy to check that if f (x,y) is the whole density, the following equation stands:∫ ∫xy f (x,y)dxdy =

∫ ∫xy f r=1(x,y)

∫∞

0R2g(R)dR. (34)

This can be estimated by 1n ∑

ni=1 xiyi quite accurately. So we can choose the first and second moments and the

covariance condition (34) to form the equation system:

1n ∑

ni=1(xi−µ1)(yi−µ2) =

∫ ∫xy f r=1(x,y)dxdy

∫∞

0R2g(R)dR

1n−1 ∑

ni=1(xi−µ1)

2 =∫

x2 f r=1(x)dx∫

∞

0R2g(R)dR

1n−1 ∑

ni=1(yi−µ2)

2 =∫

y2 f r=1(y)dy∫

∞

0R2g(R)dR

1n ∑

ni=1(xi−µ1) =

∫x f r=1(x)dx

∫∞

0Rg(R)dR

1n ∑

ni=1(yi−µ2) =

∫y f r=1(y)dy

∫∞

0Rg(R)dR

These five equations can then be used to solve for the five parameters. Further moment conditions can also beadded, as we mentioned above, by letting n be non-integer. The estimation results are showed in Table 6.

16

Distribution Computing time µ1 µ2 δ1 δ2 ρ

Normal 1648s (total)true -1 3 0.3 -0.5 0.6bias -5.3e-4 3.9e-4 8e-4 -1.2e-4 -6.6e-4MSE 1.2e-5 7.1e-6 2.1e-5 6.3e-6 1.1e-5

t3 2931s (average)true -1 3 0.3 -0.5 0.6bias 0.0183 -0.0095 -0.0244 0.0117 0.0057MSE 0.0042 0.0022 0.0039 0.0018 4.6e-4

Table 6: Parametric estimation result assuming g(r) is known.

5 Empirical illustrationIn this section we show the results of applying our model to the analysis of a dataset of body mass indexmeasurements on Australian athletes. So far, we have assumed that the diagonal elements of Σ are all one, butin reality this is not often the case. Therefore, we first discuss how the model changes when the variance is notone.

5.1 Adding variance parametersWith small modifications to Σ in subsection 3.1, it becomes:

Σ =

[∆2 δ

δ 1

]=

[∆ 00 1

]·[

1 δ/∆

δ/∆ 1

]·[

∆ 00 1

]. (35)

Thus, if a univariate random variable x is generated by a skew-elliptical distribution as defined in previoussections with µ and Σ as in (35), then (x−µ)/∆+µ follows another skew-elliptical distribution with the samedensity of radius g(r), the same µ and new Σ given by:[

1 δ/∆

δ/∆ 1

].

Unfortunately, since we do not know µ , we can not perform the standardization by subtracting the mean anddividing by the standard deviation. We need to treat ∆ as a new parameter. After some straightforward calcula-tions, the first two moment conditions (25) and (26) become:∫

∞

−∞

(x−µ) f (x)dx =2δ

π

∫∞

0r ·g(r)dr∫

∞

−∞

(x−µ)2 f (x)dx =∆2

2

∫∞

0r2 ·g(r)dr.

(36)

In parametric settings, the function g(r) is already known, In non-parametric settings, g(r) is not known andwe can use the same method described in subsection 3.4 to estimate g(r). We can then estimate µ , δ , and ∆ asfollows:

a. write δ and ∆ as functions of µ according to (36).

b. write the density function f as a function of µ . In parametric settings, we can use (7) to calculate the density. In non-parametric settings, we can use the estimated g(r) and (14). By using (14), we need to perform a transformation from xto x/Delta.

c. compute the difference between the estimated density and the kernel density k(x). This is given by d(µ) =∫|k(x)−

ˆf (x)|dx.

d. solve for µ by minimizing d(µ) and then solve for δ and ∆ from the moment conditions.

We conduct two simulations studies to show how our model performs with simulated data. We generated10000 data points xi, i = 1 · · ·10000 with µ = 10, δ = 4.5, and ∆ = 5 assuming both the normal and thet3 distributions, and we estimated the parameters using both the parametric and the non-parametric methodsdescribed above. This scenario was repeated 50 times, and the resulting bias and mean squared errors (MSE)are reported in Tables 7 and 8.

In the non-parametric framework, there is a large bias for parameters δ and ∆.

17

Model True µ True δ True ∆ Bias µ Bias δ Bias ∆ MSE µ MSE δ MSE ∆

N 10 4.5 5 -0.025 0.003 0.004 0.002 4e-4 4e4-4t3 10 4.5 5 -0.042 0.002 7e-4 0.003 0.001 6e-4

Table 7: Parametric estimation result with ∆

Model True µ True δ True ∆ Bias µ MSE µ Bias δ Bias ∆ Bias δ/∆ MSE δ/∆

N 10 4.5 5 0.321 0.113 -2.301 -2.405 -0.052 0.003t3 10 4.5 5 -0.042 0.002 7e-4 0.003 0.001 6e-4

Table 8: Non-parametric estimation result with ∆

In the two-dimensional case, the decomposition becomes: ∆21 ρ δ1

ρ ∆22 δ2

δ1 δ2 1

=

∆1 0 00 ∆2 00 0 1

· 1 ρ

∆1∆2

δ1∆1

ρ

∆1∆21 δ2

∆2δ1∆1

δ2∆2

1

· ∆1 0 0

0 ∆2 00 0 1

.The estimation process is similar to that discussed above.

5.2 Estimation with real dataWe next illustrate the use of our model for the analysis of a dataset containing the body mass index (BMI) ofAustralian athletes. The data was reported by Cook and Weisberg (1994), and it includes BMI measurementsfor 102 male and 100 female athletes. First, we assume that the BMI variable has a skew-normal distribution ora skew-t distribution. Table 9 gives the parameter estimates in the different cases.

Model µ δ ∆ δ/∆

N 20.3346 3.2853 3.6992 0.8881t12 20.6479 2.7074 3.2502 0.8330

Non-parametric 21.1551 12.5939 19.1074 0.6592

Table 9: Estimation results for the Australian athletes BMI data

Figure 6 shows that neither the skewed normal nor the skewed-t distributions can capture the right tail ofthe BMI variable very well. We next check if the fit improves when no assumption is made on a specific densityfor g(r). Figure 7 shows that the BMI data can be fitted much better under a non-parametric framework. Thedifference between the density from our model estimation and the true density is mainly due to the accuracyof the kernel estimation. With a better estimate of the non-parametric density function, our model would betterpredict the potential g(r) functions.

6 ConclusionsIn this paper, we decomposed the traditional skewed elliptical distribution into component structures given by theelliptical contours and the density assigning function. The decomposition allows us to estimate these two com-ponents separately, resulting in a faster and more accurate estimation process than both direct likelihood max-imization and the expectation-maximization algorithm, if we assume a parametric form. In a semi-parametricframework when we assume that the assigning function is unknown, our model still provides accurate estimators.

We also discuss the extension of our approach to higher dimension settings. Since the density assigningfunctions are always one-dimensional, independent of the dimensions of the skewed data, the assigning functionis always easy to estimate no matter how many dimensions we have. This density function can take any form,making our model very flexible and general.

Another important application of our model is its use in the symmetric case. If we let the skew parametersbe zero, our model becomes a generalized class of elliptical distributions, including the multivariate normal

18

0 5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

mu=20.3346; delta=3.2853; Delta=3.6992; distance=0.087131

kernel

est

−5 0 5 10 15 20 25 30 35 40 450

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

mu=20.6479; delta=2.7074; Delta=3.2502; distance=0.08853

kernel

est

Figure 6: Australian athletes BMI data: fitting distributions by adding ∆ in the normal (left plot) and te(right plot) cases

and t-distribution. The decomposition introduced in our paper provides thus another useful perspective for theanalysis of this class of elliptical distributions.

AppendicesA Proof of (8)First, recall two lemmas from Fang and Zhang (1990, p.68-69).Lemma 1: Suppose that nonnegative random variables R and S are independent, R has cdf F , and S is absolutelycontinuous with density g. Let T = RS, then

P(R≤ ρ|T = t) =1

h(t)

∫ρ

0r−1g(t/r)dF(r)

whereh(t) =

∫∞

0r−1g(t/r)dF(r).

Lemma 2: Under the notation of Lemma 1, let R2a = R(a2) for simplicity. We have:

P(R2a ≤ ρ) =

∫ (ρ2+a2)12

a(r2−a2)

m2 −1r−(n−2)dF(r)∫

∞

a(r2−a2)

m2 −1r−(n−2)dF(r)

(37)

for ρ ≥ 0,a > 0 and F(a)< 1.Now we start to prove (8) with Lemma 1 and Lemma 2. The density function of z can be calculated as:

f (z) = Pr(xn−1 = z|x1 > 0)

=Pr(xn−1 = z,x1 > 0)

Pr(x1 > 0)= 2Pr(x1 > 0|xn−1 = z) fn−1(z). (38)

To calculate Pr(x1 > 0|xn−1 = z) we need to know the form of f (x1|xn−1 = z). In our skewed copula formula,a2 = (z− µ)

′Σ−111 (z− µ), m = 1. Denote the denominator in Lemma 2 by Ca2 , the conditional density is given

by taking the derivative of (37) :

f (x1|xn−1 = z) =1

Ca2

√1−δ T Ω−1δ

πn2

Γ( n2 )

h(x20)

where

h(x21) = fn(

(x1−δ T Σ−111 (z−µ))2

1−δ T Σ−111 δ

+(z−µ)TΣ−111 (z−µ))

19

10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

mu=21.1551:delta=12.5939; Delta=19.1047; distance=0.019036

kernel

est

Figure 7: Australian athletes BMI data: fitting distributions adding ∆ and assuming that g(r) is unknown

and

Ca2 =∫

∞

a(r2−a2)(

m2 −1)r−(n−1)gn(r)dr

=∫

∞

a(r2−a2)−

12 r−ngn(r)dr

=∫

∞

a(r2−a2)−

12 r−n 2π

n2

Γ( n2 )

rn−1 fn(r2)dr

=π

n2

Γ( n2 )

∫∞

0fn(x2 +a2)dx

=π

n2

Γ( n2 )

∫∞

0fn(x2 +(z−µ)T

Σ−111 (z−µ))dx.

It follows that

P(x1 > 0|xn−1 = z) =1

2√

1−δ T Σ−111 δ

∫∞

0 h(x2)dx∫∞

0 fn(x2 +(z−µ)T Σ−111 (z−µ))dx

(let t =(x−δ T Σ

−111 (y−µ))2

1−δ T Σ−111 δ

and write x = t)

=

∫∞

− δT Σ−111 (z−µ)√

1−δT Ω−1δ

fn(z2 +(z−µ)TΣ−111 (z−µ))dx

2∫

∞

0fn(x2 +(z−µ)T

Σ−111 (z−µ))dx

=

∫∞

−λ T (z−µ)fn(x2 +(z−µ)T

Σ−111 (z−µ))dx

2∫

∞

0fn(x2 +(z−µ)T

Σ−111 (z−µ))dx

. (39)

Inserting (39) into (38) now leads to (8).

20

B Proof of Proposition 1Proof : Let x−µ = r · sin(θ). Then∫

∞

−∞

(x−µ)n f (x)dx =∫

∞

−∞

(x−µ)n∫

∞

0fr(x,r,β )g(r)drdx

=1π

∫ 2β

−2β

sinnθdθ︸︷︷︸

=:E1n

∫∞

0rng(r)dr+

2π

∫π/2

2β

sinnθdθ︸︷︷︸

=:E2n

∫∞

0rng(r)dr

E1n = − 1

π

∫ 2β

−2β

sinnθdcosθ

= − 1π[sinn−1

θcosθ |2β

−2β− (n−1)

∫ 2β

−2β

cos2θsinn−2

θdθ ]

= − 1π

sinn−1θcosθ |2β

−2β︸︷︷︸A1

+n−1

π

∫ 2β

−2β

sinn−2θdθ − n−1

π

∫ 2β

−2β

sinnθdθ

⇒ E1n =

A1

n+

n−1n

E2n .

Similarly,

E2n =

sinn−1(2β )cos(2β )

nπ+

n−1n

E2n−2

⇒Wn = E1n +E2

n =sinn−1(2β )+ sinn−1(−2β )

nπ· cos(2β )+

n−1n

Wn−2.

ReferencesARNOLD, B. C. AND R. J. BEAVER (2000): “The skew-Cauchy distribution,” Statistics & Probability Letters, 49, 285–290.

AZZALINI, A. AND A. CAPITANIO (1999): “Statistical applications of the multivariate skew normal distribution,” Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 61, 579–602.

AZZALINI, A. AND A. DALLA VALLE (1996): “The multivariate skew-normal distribution,” Biometrika, 83, 715–726.

BRANCO, M. D. AND D. K. DEY (2001): “A general class of multivariate skew-elliptical distributions,” Journal of MultivariateAnalysis, 79, 99–113.

COOK, R. D. AND S. WEISBERG (1994): An Introduction to Regression Graphics, Wiley.

DEY, D. K. AND J. LIU (2005): “A new construction for skew multivariate distributions,” Journal of Multivariate Analysis, 95,323–344.

FANG, K. AND Y.-T. ZHANG (1990): Generalized Multivariate Analysis, Springer.

GENTON, M. G. AND N. M. LOPERFIDO (2005): “Generalized skew-elliptical distributions and their quadratic forms,” Annalsof the Institute of Statistical Mathematics, 57, 389–401.

GUPTA, A. (2003): “Multivariate skew t-distribution,” Statistics: A Journal of Theoretical and Applied Statistics, 37, 359–363.

LI, Q. AND J. S. RACINE (2007): Nonparametric Econometrics: Theory and Practice, Princeton University Press.

MA, Y. AND M. G. GENTON (2004): “Flexible class of skew-symmetric distributions,” Scandinavian Journal of Statistics, 31,459–468.

MA, Y., M. G. GENTON, AND A. A. TSIATIS (2005): “Locally efficient semiparametric estimators for generalized skew-elliptical distributions,” Journal of the American Statistical Association, 100, 980–989.

SALINAS, H. S., R. B. ARELLANO-VALLE, AND H. W. GÓMEZ (2007): “The extended skew-exponential power distributionand its derivation,” Communications in Statistics, Theory and Methods, 36, 1673–1689.

21

Semi-Parametric Estimation for Multivariate Skew ... · ESMT European School of Management and...

Documents

Transcript of Semi-Parametric Estimation for Multivariate Skew ... · ESMT European School of Management and...