An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM...

31
An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions * José Raimundo Gomes Pereira Celso Rômulo Barbosa Cabral Leyne Abuim de Vasconcelos Marques José Mir Justino da Costa Departamento de Estatística – Universidade Federal do Amazonas – Brazil Abstract We investigate, via simulation study, the performance of the EM algorithm for maximum likelihood estima- tion in finite mixtures of skew-normal distributions with component specific parameters. The study takes into account the initialization method, the number of iterations needed to attain a fixed stopping rule and the ability of some classical model choice criteria to estimate the correct number of mixture components. The results show that the algorithm produces quite reasonable estimates when using the method of moments to obtain the starting points and that, combining them with the AIC, BIC, ICL or EDC criteria, represents a good alternative to estimate the number of components of the mixture. Exceptions occur in the estimation of the skewness parameters, notably when the sample size is relatively small, and in some classical problematic cases, as when the mixture components are poorly separated. Key Words: EM algorithm; Skew-normal distribution; Finite mixture of distributions. 1 Introduction We define the (finite) mixture of the densities f 1 ,..., f k as the density g(x)= k i=1 p i f i (x), where p 1 ,..., p k are unknown positive numbers, named mixing weights, satisfying k i=1 p i = 1. We name f i the i-th compo- nent of the mixture. Finite mixtures have been widely used as a powerful tool to model heterogeneous data and to approxi- mate complicated probability densities, presenting multimodality, skewness and heavy tails. These models have been applied in several areas like genetics, image processing, medicine and economics. For compre- hensive surveys, see McLachlan & Peel (2000) and Frühwirth-Schnatter (2006). Maximum likelihood estimation in finite mixtures is a research area with several challenging aspects. There are nontrivial issues, like lack of identifiability and saddle regions surrounding the possible local maxima of the likelihood. Another problem is that the likelihood is possibly unbounded, which happens when the components are normal densities, for example. * The authors acknowledge the partial financial support from CAPES, CNPq and FAPEAM. Email addresses: (José Raimundo Gomes Pereira), (Celso Rômulo Barbosa Cabral), (Leyne Abuim de Vasconcelos Marques), (José Mir Justino da Costa) 1

Transcript of An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM...

Page 1: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

An Empirical Comparison of EM Initialization Methods and ModelChoice Criteria for Mixtures of Skew-Normal Distributions∗

José Raimundo Gomes Pereira Celso Rômulo Barbosa CabralLeyne Abuim de Vasconcelos Marques

José Mir Justino da Costa

Departamento de Estatística – Universidade Federal do Amazonas – Brazil

Abstract

We investigate, via simulation study, the performance of the EM algorithm for maximum likelihood estima-tion in finite mixtures of skew-normal distributions with component specific parameters. The study takesinto account the initialization method, the number of iterations needed to attain a fixed stopping rule and theability of some classical model choice criteria to estimate the correct number of mixture components. Theresults show that the algorithm produces quite reasonable estimates when using the method of moments toobtain the starting points and that, combining them with the AIC, BIC, ICL or EDC criteria, represents agood alternative to estimate the number of components of the mixture. Exceptions occur in the estimation ofthe skewness parameters, notably when the sample size is relatively small, and in some classical problematiccases, as when the mixture components are poorly separated.Key Words: EM algorithm; Skew-normal distribution; Finite mixture of distributions.

1 Introduction

We define the (finite) mixture of the densities f1, . . . , fk as the density g(x) = ∑ki=1 pi fi(x), where p1, . . . , pk

are unknown positive numbers, named mixing weights, satisfying ∑ki=1 pi = 1. We name fi the i−th compo-

nent of the mixture.Finite mixtures have been widely used as a powerful tool to model heterogeneous data and to approxi-

mate complicated probability densities, presenting multimodality, skewness and heavy tails. These modelshave been applied in several areas like genetics, image processing, medicine and economics. For compre-hensive surveys, see McLachlan & Peel (2000) and Frühwirth-Schnatter (2006).

Maximum likelihood estimation in finite mixtures is a research area with several challenging aspects.There are nontrivial issues, like lack of identifiability and saddle regions surrounding the possible localmaxima of the likelihood. Another problem is that the likelihood is possibly unbounded, which happenswhen the components are normal densities, for example.

∗The authors acknowledge the partial financial support from CAPES, CNPq and FAPEAM. Email addresses:[email protected] (José Raimundo Gomes Pereira), [email protected] (Celso Rômulo Barbosa Cabral),[email protected] (Leyne Abuim de Vasconcelos Marques), [email protected] (José Mir Justino da Costa)

1

Page 2: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

There is a lot of literature involving mixtures of normal distributions, some references can be foundin the above-mentioned books. In this work we consider mixtures of skew-normal (SN) distributions, asdefined by Azzalini (1985). This distribution is an extension of the normal distribution that accommodatesasymmetry.

The standard algorithm for maximum likelihood estimation in finite mixtures is the Expectation Max-imization (EM) of Dempster et al. (1977), see also McLachlan & Krishnan (2008). It is well known thatit has slow convergence and that its performance is strongly dependent on the stopping rule and startingpoints. For normal mixtures, several authors have computationally investigated the performance of the EMalgorithm by taking into account initial values (Karlis & Xekalaki, 2003; Biernacki et al., 2003) and asymp-totic properties (Nityasuddhi & Böhning, 2003). See also Dias & Wedel (2004) for a broader simulationexperiment comparing EM with SEM and MCMC algorithms.

Although there are some purposes to overcome the unboundedness problem in the normal mixture case,involving constrained optimization and alternative algorithms – see Hathaway (1985), Ingrassia (2004), andYao (2010), for example, it is interesting to investigate the performance of the (unrestricted) EM algorithm inthe presence of skewness in the component distributions, since algorithms of this kind have been presentedin recent works as Lin et al. (2007a), Lin et al. (2007b), Basso et al. (2009), Lin (2009a), Lin (2009b) andLin & Lin (2009).

The main goal of this work is to study the performance of the estimates produced by the EM algorithm,taking into account two different purposes to obtain initial values – method of moments and a random ini-tialization method, the number of iterations needed to attain a fixed stopping rule and the ability of someclassical model choice criteria (AIC, BIC, ICL and EDC) to estimate the correct number of mixture compo-nents. For each initialization method, the consistency of the estimates is analyzed by computing their biasand mean squared errors over repeated generated samples and the mean number of iterations is observed.Fixing the method of initialization, repeated samples of finite mixtures of SN distributions are generatedwith different number of components and the percentage of times some given criterion chooses the correctnumber of components is observed. The work is restricted to the univariate case.

The remainder of the paper is organized as follows. In Sections 2, 3 and 4 for the sake of completeness,we give a brief sketch of the SN distribution, of the skew-normal mixture model and of estimation via theEM algorithm, respectively. The simulation study through the analysis of the initialization methods andnumber of iterations is presented in Section 5. The study through the analysis of the model choice criteria ispresented in Section 6. Finally, in order to compare results obtained by the several initialization and modelchoice methods, two real data sets are analyzed in Section 7.

2 The Skew-normal Distribution

The skew-normal distribution, introduced by Azzalini (1985), is given by the density

SN(y|µ,σ2,λ ) = 2N(y|µ,σ 2)Φ(

λy−µ

σ

),

where N(·|µ,σ2) denotes the univariate normal density with mean µ ∈R and variance σ2 > 0 and Φ(·) is thedistribution function of the standard normal distribution. In this definition, µ , λ ∈ R and σ2 are parametersregulating location, skewness and scale, respectively. For a random variable Y with this distribution, we usethe notation Y ∼ SN(µ,σ2,λ ).

For a SN random variable Y , a convenient stochastic representation is given next. It can be used tosimulate realizations of Y and to implement the EM-type algorithm that will be shown soon. The prooffollows easily from Henze (1986).

2

Page 3: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Lemma 1. A random variable Y ∼ SN(µ,σ2,λ ) has a stochastic representation given by

Y = µ +σδT +σ(1−δ 2)1/2T1,

where δ = λ/√

1+λ 2, T = |T0|, T0 and T1 are independent standard normal random variables and | · |denotes absolute value.

In what follows, in order to reduce computational difficulties related to the implementation of the algo-rithms used for estimation, we use the parametrization

Γ = (1−δ 2)σ2 and ∆ = σδ ,

which was first suggested by Bayes & Branco (2007). Note that (λ ,σ2)→ (∆,Γ) is a one to one mapping.To recover λ and σ2, we use

λ = ∆/√

Γ and σ2 = ∆2 +Γ.

Then, it follows easily from Lemma 1 that

Y |T = t ∼ N(µ +∆t,Γ) and T ∼ HN(0,1), (1)

where HN(0,1) denotes the half-normal distribution with parameters 0 and 1.The expectation, variance and skewness coefficient of Y ∼ SN(µ,σ 2,λ ) are respectively given by

E(Y ) = µ +σ∆√

2/π, Var[Y ] = σ2(

1− 2π

δ 2)

, γ(Y ) =κδ 3

(1− 2π δ 2)3/2

, (2)

where κ = 4−π2 ( 2

π )3/2, see Azzalini (2005, Lemma 2).

3 The FM-SN Model

In this section we give a brief introduction to the finite mixture of SN distributions model, hereafter FM-SNmodel. More details can be found in Basso et al. (2009), where a more general class of distributions isconsidered, and references herein.

The FM-SN model is defined by considering a random sample y = (y1, . . . ,yn)> from a mixture of SNdensities given by

g(y j|Θ) =k

∑i=1

piSN(y j|θ i), j = 1, . . . ,n, (3)

where pi ≥ 0, i = 1, . . . ,k are the mixing probabilities, ∑ki=1 pi = 1, θ i = (µi,σ2

i ,λi)> is the specific vectorof parameters for the component i and Θ = ((p1, . . . , pk)>,θ>1 , . . . ,θ>k )> is the vector with all parameters.

For each j consider a latent classification random variable Z j taking values in {1, . . . ,k}, such that

y j|Z j = i∼ SN(θi), P(Z j = i) = pi, i = 1, . . . ,k; j = 1, . . . ,n.

Then it is straightforward to prove, integrating out Z j, that y j has density (3). If we combine this result with(1), we have the following stochastic representation for the FM-SN model

y j|Tj = t j,Z j = i∼ N(µi +∆it j,Γi),

Tj ∼ HN(0,1),

P(Z j = i) = pi, i = 1, . . . ,k; j = 1, . . . ,n,

whereΓi = (1−δ 2

i )σ2i , ∆i = σiδi, δi = λi/

√1+λ 2

i , i = 1, . . . ,k. (4)

3

Page 4: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

4 Estimation

4.1 An EM-type Algorithm

In this section, for the sake of completeness, we present an EM-type algorithm for estimation of the param-eters of a FM-SN distribution. This algorithm was presented before in Basso et al. (2009), where a widerclass of distributions that includes the FM-SN one is considered.

Here we obtain estimates of the parameters of the FM-SN model using a faster extension of EM calledthe ECM algorithm (Meng & Rubin, 1993). When applying it to the FM-SN model, we obtain a simple setof closed form expressions to update a current estimate of the vector Θ, as we will see below. It is importantto emphasize that this procedure differs from the algorithm presented by Lin et al. (2007b), because in theformer case the updating equations for the component skewness parameter have a closed form.

In what follows we consider the parametrization (4). Let

Θ(m) = ((p(m)1 , . . . , p(m)

k )>,(θ (m)1 )>, . . . ,(θ (m)

k )>)>

be the current estimate (at the mth iteration of the algorithm) of Θ, where θ (m)i = (µ(m)

i , ∆(m)i , Γ(m)

i )>. TheE-step of the algorithm consists in evaluate the expected complete data function – known as the Q−function,defined as

Q(Θ|Θ(m)) = E[`c(Θ)|y,Θ(m)],

where `c(Θ) is the complete-data log-likelihood function, given by

`c(Θ) = c+n

∑j=1

k

∑i=1

zi j

(log pi− 1

2logΓi− 1

2Γi(y j−µi−∆it j)2

),

where zi j is the indicator function of the set (Z j = i) and c is a constant that is independent of Θ.The M-step consists in maximizing the Q-function over Θ. As the M-step turns out to be analytically

intractable, we use, alternatively, the ECM algorithm, which is an extension that essentially replaces it witha sequence of conditional maximization (CM) steps.

The following scheme is used to obtain an updated value Θ(m+1). We can find more details about theconditional expectations involved in the computation of the Q-function and the related maximization stepsin Basso et al. (2009). Here, φ denotes the standard normal densityE-step: Given a current estimate Θ(m), compute zi j, s1i j and s2i j, for i = 1, . . . ,n and j = 1, . . . ,k, where

z(m)i j =

p(m)i SN(y j|θ (m)

i )

∑ki=1 p(m)

i SN(y j|θ (m)i )

, (5)

s(m)1i j = z(m)

i j

µ(m)

Ti j+

φ(

µ(m)Ti j

/σ (m)Ti

)

Φ(

µ(m)Ti j

/σ (m)Ti

) σ (m)Ti

,

s(m)2i j = z(m)

i j

(µ(m)

Ti j)2 +(σ (m)

Ti)2 +

φ(

µ(m)Ti j

/σ (m)Ti

)

Φ(

µ(m)Ti j

/σ (m)Ti

) µ(m)Ti j

σ (m)Ti

,

µ(m)Ti j

=∆(m)

i

Γ(m)i +(∆(m)

i )2(y j− µ(m)

i ),

σ (m)Ti

=

(Γ(m)

i

Γ(m)i +(∆(m)

i )2

)1/2

.

4

Page 5: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

CM-steps: Update Θ(m) by maximizing Q(Θ|Θ(m)) over Θ, which leads to the following closed form ex-pressions:

p(m+1)i = n−1

n

∑j=1

z(m)i j ,

µi(m+1) =

∑nj=1(y j z

(m)i j − ∆(m)

i s(m)1i j )

∑nj=1 z(m)

i j

,

Γ(m+1)i =

∑nj=1(z

(m)i j (y j− µ(m+1)

i )2−2(y j− µ(m+1)i )∆(m)

i s(m)1i j +(∆(m)

i )2s(m)2i j )

∑nj=1 z(m)

i j

,

∆(m+1)i =

∑nj=1(y j− µ(m+1)

i )s(m)1i j

∑nj=1 s(m)

2i j

.

The algorithm iterates between the E and CM steps until a suitable convergence rule is satisfied. Someclassical purposes are (i) if ||Θ(m+1)− Θ(m)|| is sufficiently small or (ii) if |`(Θ(m+1))− `(Θ(m))| is smallenough or (iii) if |`(Θ(m+1))/`(Θ(m))− 1| is small enough, where `(Θ) the actual log-likelihood. In thiswork we use the purpose (ii).

4.2 Problems with Estimation in Finite Mixtures

4.2.1 Unbounded Likelihood

An important issue is the existence of the maximum likelihood estimator for finite mixture models. It iswell known that the likelihood of normal mixtures can be unbounded – see Frühwirth-Schnatter (2006,Chapter 6), for example. FM-SN models also have this feature: for instance, let yi, i = 1, . . . ,n, be a randomsample from a mixture of two SN densities, SN(µ,σ 2

1 ,λ1) with weight p ∈ (0,1) and SN(µ,σ 22 ,λ2). Fixing

p, σ 21 , λ1 and λ2, we have that the likelihood function L(µ,σ 2

2 ) evaluated at µ = y1 satisfies

L(y1,σ22 ) = g(σ 2

2 )n

∏i=2

(ci +hi(σ 22 )) > g(σ2

2 )n

∏i=2

ci,

where

g(σ22 ) =

(p(2πσ2

1 )−1/2 +(1− p)(2πσ22 )−1/2

),

ci = pSN(yi|y1,σ 21 ,λ1),

hi(σ22 ) = (1− p)SN(yi|y1,σ 2

2 ,λ2), i = 2, . . . ,n.

The strict inequality is valid because ci and hi(σ22 ) are positive. Now, we can choose a sequence of positive

numbers (σ22 )m, m ∈ N, such that limm→∞(σ2

2 )m = 0. Then, because ∏ni=2 ci > 0 and limm→∞ g((σ2

2 )m) =+∞, we have that limm→∞ L(y1,(σ 2

2 )m) = +∞.Thus, following Nityasuddhi & Böhning (2003), we mention the estimates produced by the algorithm

of section 4.1 as “EM estimates”, that is, some sort of solution of the score equation, instead of “maximumlikelihood estimates”.

5

Page 6: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

4.2.2 Non-identifiability

Another nontrivial issue is the lack of identifiability. Strictly speaking, finite mixtures are always non-identifiable because an arbitrary permutation of the labels of the component parameters lead to the samefinite mixture distribution. In the finite mixture context, a more flexible concept of identifiability is usedand is defined as follows: the class of FM-SN models will be identifiable if distinct component densitiescorrespond to distinct mixtures. More specifically, if we have two representations for the same mixturedistribution, say

k

∑i=1

p′iSN(·|µ ′i ,(σ ′i )

2,λ ′i ) =k

∑i=1

piSN(·|µi,σ2i ,λi),

thenp′i = pρ(i), µ ′i = µρ(i), (σ ′

i )2 = σ2

ρ(i), λ ′i = λρ(i)

for some permutation ρ of the indexes 1, . . . ,k (Titterington et al., 1985, Chapter 3).The normal mixture model identifiability was first verified by Yakowitz & Spragins (1968), but it is

interesting to note that subsequent discussions in the related literature concerning mixtures of Student-tdistributions – see, for example, Peel & McLachlan (2000), Shoham (2002), Shoham et al. (2003) and Linet al. (2004) – do not presented a formal proof of its identifiability.

The above commentaries may cause some caution when using maximum likelihood estimation in thefinite mixture case. One way to circumvent the unboundedness problem is the constrained optimization ofthe likelihood, imposing conditions on the component variances in order to obtain global maximization. SeeHathaway (1985), Ingrassia (2004), Ingrassia & Rocci (2007) and Greselin & Ingrassia (2009), for example.The non-identifiability problem is not a major one if we are interested only in the likelihood values, whichare robust to label switching. This is the case, for example, when density estimation is the main goal.

In our study we investigate, as in Nityasuddhi & Böhning (2003), only the performance of the EMalgorithm when are considered component specific parameters (that is, unrestricted) of the mixture.

5 A Simulation Study of Initial Values for the EM Algorithm

5.1 A Study of Consistency Through the Analysis of Bias and MSE

The EM algorithm is an efficient standard tool for maximum likelihood estimation in finite mixtures ofdistributions. It is well known that its performance is strongly dependent on the choice of the criterionof convergence and starting points. In this work, we do not consider the stopping rule issue. A detaileddiscussion about this theme can be founded in McLachlan & Peel (2000). Our fixed rule is to stop theprocess at stage m when

|`(Θ(m+1))− `(Θ(m))|< c,

where we fixed c = 10−5.The choice of starting values for the EM algorithm plays a big role in parameter estimation. In the

mixture context, its importance is amplified because, as noted by Dias & Wedel (2004), there are varioussaddle regions surrounding the possible local maxima of the likelihood function, and the EM algorithm canbe trapped in some of these subsets of the parameter space.

In this work, we make a simulation study in order to compare some methods to obtain starting points forthe algorithm proposed in section 4.1. Previous related work, considering normal mixtures, can be found inKarlis & Xekalaki (2003) and Biernacki et al. (2003). An interesting question is to investigate, in a similarfashion, the performance of the EM algorithm when a skewness parameter for each component density isconsidered.

6

Page 7: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

We consider the following methods to obtain initial values:

The True Values Method (TVM): consists in to initialize the algorithm with the distributional parametervalues used to generate the artificial random samples. Of course, the best results are expected when usingthis method. It has been included in our experiment playing a gold standard role.

The Random Values Method (RVM): to fit the FM-SN model with k components, we first divide the gen-erated random sample into k sub-samples. To do this, the k-means method is employed – see Johnson &Wichern (2007) to obtain details about this clustering method. Let ϕi be the sub-sample i. Consider thefollowing points artificially generated from uniform distributions over the specified intervals

ξ (0)i ∼ U(min{ϕi},max{ϕi}) ,

ω(0)i ∼ U(0,Var{ϕi}) , (6)

γ(0)i ∼ U(−0.9953,0.9953) ,

where min{ϕi}, max{ϕi} and Var{ϕi} denote, respectively, the minimum, the maximum and the samplevariance of ϕi, i = 1, ..,k. These quantities are taken as rough estimates for the mean, variance and skewnesscoefficient associated to sub-population i (from which sub-sample i is supposedly collected), respectively.The suggested interval for γ(0)

i is inspired by the fact that the range for the skewness coefficient in SN modelsis (−0.9953,0.9953).

The starting points for the specific component locations, scale and skewness parameters are given re-spectively by

µ(0)i = ξ (0)

i −√

2/πδ(λ (0)

i )σ (0)

i ,

σ (0)i =

√√√√ ω(0)i

1− 2π δ 2

(λ (0)i )

, (7)

λ (0)i = ±

√√√√ π(γ(0)i )2/3

21/3(4−π)2/3− (π−2)(γ(0)i )2/3

,

where δ(λ (0)

i )= λ (0)

i /

√1+(λ (0)

i )2, i = 1, ..,k and the sign of λ (0)i is the same of γ(0)

i . They are obtained

by replacing E(Y ), Var(Y ) and γ(Y ) in (2) with their respective estimators in (6) and solving the resultingequations in µi, σi and λi. The initial values for the weights pi are obtained as

(p(0)1 , . . . , p(0)

k ) ∼ Dirichlet(1, . . . ,1),

– a Dirichlet distribution with all parameters equal to 1 – that is, a uniform distribution over the unit simplex{(p1, . . . , pk); pi ≥ 0, ∑k

i=1 pi = 1}

.

Method of Moments (MM): in this case the initial values are obtained using equations (7), but replacing ξ (0)i ,

ω(0)i and γ(0)

i with the mean, variance and skewness coefficient of sub-sample i, respectively. Let n be thesample size and ni be the size of sub-sample i. The initial values for the weights are given by

p(0)i =

ni

n, i = 1, ..,k.

7

Page 8: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

To compare the methods, we generated samples from the FM-SN model with k = 2 and k = 3 com-ponents. For each fixed sample we obtained estimates of the parameters using the algorithm presented insection 4.1 initialized by each method proposed.

Another important issue is the degree of heterogeneity of the components. For k = 2 we consideredthe “moderately separated” (2MS), “well separated” (2WS) and “poorly separated” (2PS) cases. For k = 3we considered the “two poorly separated and one well separated” (3PWS) and “the three well separated”(3WWS) cases. These degrees of heterogeneity were obtained informally, based on the location parametervalues. The main reason to consider them as an important factor to our study is that the convergence of theEM algorithm is typically affected when the components overlap largely – see Park & Ozeki (2009) and thereferences herein.

Figures 1 and 2 show some histograms exemplifying these degrees of heterogeneity. In tables 1 and 2are presented the parameters values used in the study.

Figure 1: Histograms of FM-SN data: (a) 2MS, (b) 2WS and (c) 2PS.

Table 1: FM-SN parameters with k = 2.

Case µ1 µ2 σ21 σ2

2 λ1 λ2 p1 p2

2MS 5 20 9 16 6 -4 0.6 0.42WS 5 40 9 16 6 -4 0.6 0.42PS 5 15 9 16 6 -4 0.6 0.4

All the computations were made using the R system (R Development Core Team, 2009). Sample sizeswere fixed as n = 500, 1000, 5000 and 10000. For each combination of parameters and sample size, 500samples from the FM-SN model were artificially generated. Then we computed the bias and mean squared

8

Page 9: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Figure 2: Histograms of FM-SN data: (a) 3PWS and (b) 3WWS.

Table 2: FM-SN parameters with k = 3.

Case µ1 µ2 µ3 σ21 σ 2

2 σ23 λ1 λ2 λ3 p1 p2 p3

3PWS 5 20 28 9 16 16 6 -4 4 0.4 0.3 0.33WWS 5 30 38 9 16 16 6 -4 4 0.4 0.3 0.3

error (MSE) over all samples. For µi they are defined as

bias =1

500

500

∑j=1

µ( j)i −µi and MSE =

1500

500

∑j=1

(µ( j)i −µi)2,

respectively, where µ( j)i is the estimate of µi when the data is sample j. Definitions for the other parameters

are obtained by analogy.As a note about implementation, a expected consequence of the non-identifiability cited in section 4.2.2

is the permutation of the component labels when using the k−means method to perform an initial clusteringof the data. It has serious implications when the main subject is the consistency of the estimates. Toovercome this problem we adopted an order restriction on the initial values of the location parameters.

Tables 3 and 4 present, respectively, bias and MSE of the estimates in the 2MS case. With two mod-erately separated components and using the TVM and MM methods, the convergence of the estimates isevidenced, as we can conclude observing the decreasing values of bias and MSE when the sample size in-creases. They also show that the estimates of the weights pi and of the location parameters µi have lowerbias and MSE. On the other side, investigating the MSE values, we can note a different pattern of (slower)convergence to zero for the skewness parameters estimates. It is possibly due to well known inferentialproblems related to the skewness parameter (DiCiccio & Monti, 2004), suggesting the use of larger samplesin order to attain the consistency property.

9

Page 10: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

When we analyze the initialization methods performances, we can see that MM and TVM have a similar(good) one, presenting better results than the RVM method for all sample sizes and parameters. When usingthe RVM we can see that, in general, the absolute value of the biases of the estimates of σ 2

i and λi are verylarge compared with that obtained when using the other methods. In general, according to our criteria, wecan conclude that the MM method presented a satisfactory performance in all situations.

The bias and MSE of the estimates for the 2WS case are presented in tables 5 and 6, respectively. Asin the 2MS case, their values decrease when the sample size increases (using the methods TVM and MM),although in a slower rate. Comparing the initialization methods, we can see again the poor performance ofRVM, notably when estimating σ2

i and λi. The performances of MM and TVM are similar and satisfactory.Thus, we can say that, in general, the conclusions made for the 2MS case are still valid here.

We present the results for the 2PS case in tables 7 and 8. Bias and MSE are larger than in the 2MSand 2WS cases (for all sample sizes) when using MM and RVM. Also, the consistency of the estimatesseems to be unsatisfactory, clearly not attained in the σ2

i and λi cases. According to the related literature,such drawbacks of the algorithm are expected when the population presents a remarkable homogeneity. Anexception is made when the initial values are closer to the true parameter values see, for example, McLachlan& Peel (2000) and the above-mentioned work of Park & Ozeki (2009).

Results for the 3PWS case are showed in tables 9 and 10. It seems that consistency is achieved for pi, µi

and σ 2i , using TVM and MM. However, this is not the behavior in the λi case. This instability is common to

all initialization methods, according to the MSE criterion. Using the RVM method we obtained, as before,larger values of bias and MSE.

Finally, tables 11 and 12 present the results for the 3PWW case. Concerning the estimates pi and µi, verysatisfactory results are obtained, with small values of bias and MSE when using TVM and MM. The valuesof MSE of σ 2

i exhibit a decreasing behavior when the sample size increases. On the other side, although weare in the well separated case, the values of bias and MSE of λi are larger, notably when using RVM as theinitialization method.

Concluding this section, we can say that, as a general rule, MM and TVM methods presented very goodsimilar results for the initialization of the EM algorithm. Considering TVM as the gold standard, this meansthat MM can be seen as a good alternative for real applications. If this condition is maintained, our studysuggests that the consistency property holds for all EM estimates – maybe the convergence is slower for thescale parameter case – except for the skewness parameter, indicating that a sample size larger than 5000 isnecessary to achieve consistency in this case. The study also suggests that the degree of heterogeneity of thepopulation has a remarkable influence on the quality of the estimates.

5.2 Density Estimation Analysis

In this section we investigate the density estimation issue, that is, the point estimation of the parameter`(Θ), the log-likelihood evaluated at the true value of the parameter. We considered FM-SN models withtwo components and restricted ourselves to the cases 2MS, 2WS and 2PS, with sample sizes n =100, 500,1000, 5000. The following measure was considered to compare the several methods of initialization

dr(M) =

∣∣∣∣∣`(Θ)− `(M)(Θ)

`(Θ)

∣∣∣∣∣×100,

where `(M)(Θ) is the log-likelihood evaluated at the EM estimate Θ, which was obtained using the initializa-tion method M. According to this criterion, an initialization method M is better than M′ if dr(M) < dr(M′).Obviously, we expect the lowest dr values for the TVM method, the gold standard one.

10

Page 11: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e3:

Bia

sof

the

estim

ates

–k

=2,

mod

erat

ely

sepa

rate

d(2

MS)

nM

etho

dµ 1

µ 2σ

2 1σ

2 2λ 1

λ 2p 1

p 250

0T

VM

0.01

2916

66-0

.035

1854

90.

0166

1924

0.18

4764

60.

5253

373

-0.4

9595

44-0

.001

5565

830.

0015

5658

3

RVM

0.97

4826

4-1

.323

073

-4.1

6303

6-0

.123

6784

-1.8

6761

7-3

.058

108

-0.0

0557

9744

0.00

5579

744

MM

0.01

2819

75-0

.041

9933

-0.1

9027

140.

3501

801

0.52

9611

9-0

.477

2044

-0.0

0136

8731

0.00

1368

731

1000

TV

M0.

0146

0258

-0.0

2278

766

-0.0

6984

684

0.11

0084

40.

1304

437

-0.1

4552

34-0

.001

5700

180.

0015

7001

8

RVM

1.07

2717

-1.2

9584

7-4

.469

534

0.23

5595

2-2

.595

832

1.15

6491

-0.0

1132

529

0.01

1325

29

MM

0.01

4371

74-0

.025

6469

8-0

.086

1297

60.

0923

462

0.13

4413

6-0

.133

0872

-0.0

0140

9981

0.00

1409

981

5000

TV

M0.

0012

7738

60.

0001

4364

020.

0219

3594

0.04

5697

420.

0421

8275

-0.0

4685

099

-5.4

5267

3e-0

55.

4526

73e-

05

RVM

0.97

4323

-1.0

0248

7-4

.040

373

1.54

1311

-2.4

4467

00.

7957

775

-0.0

1203

366

0.01

2033

66

MM

0.00

0964

1692

-0.0

0138

8175

0.03

0964

520.

0153

7434

0.04

7261

14-0

.039

0465

76.

3243

71e-

05-6

.324

371e

-05

1000

0T

VM

-0.0

0058

7850

80.

0001

1765

630.

0066

4092

30.

0179

7871

0.00

7833

535

-0.0

2678

607

0.00

0495

2318

-0.0

0049

5231

8

RVM

1.00

3622

-1.0

6729

9-4

.154

656

1.04

9366

-2.5

8721

60.

9152

174

-0.0

1212

533

0.01

2125

33

MM

-0.0

0090

5076

-0.0

0109

3679

0.01

4760

33-0

.007

2071

030.

0127

8570

-0.0

2045

472

0.00

0595

9484

-0.0

0059

5948

4

11

Page 12: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e4:

MSE

ofth

ees

timat

es-k

=2,

mod

erat

ely

sepa

rate

d(2

MS)

nM

etho

dµ 1

µ 2σ

2 1σ

2 2λ 1

λ 2p 1

p 250

0T

VM

0.01

6187

350.

1123

241

2.60

9074

21.0

5012

16.8

9325

3.96

3948

0.00

0768

7456

0.00

0768

7456

RVM

2.25

0586

4.26

6357

26.1

9319

78.7

4164

34.6

3418

6729

.553

0.00

2770

527

0.00

2770

527

MM

0.01

6277

960.

1324

369

2.20

3446

18.5

8105

16.9

1288

3.97

7364

0.00

0771

4305

0.00

0771

4305

1000

TV

M0.

0073

5256

30.

0438

1211

1.32

0318

9.83

0591

1.57

0595

1.00

7072

0.00

0396

7391

0.00

0396

7391

RVM

2.36

6453

4.08

8751

27.8

0727

70.8

3071

19.3

6655

9.24

7467

0.00

1719

727

0.00

1719

727

MM

0.00

7378

450.

0447

3454

1.24

6387

9.68

0204

1.56

9277

1.01

5799

0.00

0398

1356

0.00

0398

1356

5000

TV

M0.

0015

8175

20.

0073

0923

30.

2642

777

2.03

950.

2449

436

0.18

4589

18.

4407

37e-

058.

4407

37e-

05

RVM

2.16

5632

3.19

515

26.1

2405

70.8

6125

17.2

2613

7.42

687

0.00

1879

775

0.00

1879

775

MM

0.00

1590

433

0.00

7424

660.

2694

587

2.07

6536

0.24

6535

20.

1866

866

8.50

2204

e-05

8.50

2204

e-05

1000

0T

VM

0.00

0847

5006

0.00

3877

835

0.12

3834

51.

1091

950.

1206

761

0.08

5568

484.

3413

59e-

054.

3413

59e-

05

RVM

2.18

3152

3.34

6971

25.8

5164

69.0

9116

17.6

1139

7.63

6557

0.00

1412

294

0.00

1412

294

MM

0.00

0853

3806

0.00

3948

246

0.12

6905

91.

1344

170.

1216

638

0.08

6953

4.37

8802

e-05

4.37

8802

e-05

12

Page 13: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e5:

Bia

sof

the

estim

ates

-k=

2,w

ells

epar

ated

(2W

S)

nM

etho

dµ 1

µ 2σ

2 1σ

2 2λ 1

λ 2p 1

p 250

0T

VM

-0.0

0134

2122

-0.0

0089

4665

40.

0708

7477

0.07

3461

840.

5476

595

-0.3

6930

640.

0011

0362

3-0

.001

1036

23

RVM

1.14

5212

-1.6

0942

5-3

.282

642

-4.0

8094

5-2

.568

067

1.81

6600

0.00

1102

224

-0.0

0110

2224

MM

-0.0

0144

0393

0.00

1164

155

0.07

1483

510.

1861

907

0.81

2108

3-0

.394

4749

0.00

1069

988

-0.0

0106

9988

1000

TV

M0.

0026

7377

5-0

.001

7472

50.

0047

3462

10.

0052

8477

40.

1993

675

-0.1

6654

100.

0004

0793

17-0

.000

4079

317

RVM

1.20

5876

-1.7

5960

3-3

.485

664

-4.5

7172

3-2

.863

520

2.13

3156

0.00

0406

9826

-0.0

0040

6982

6

MM

0.00

2777

323

-0.0

0192

2072

0.00

5182

833

0.00

5432

662

0.20

2555

8-0

.167

8265

0.00

0407

9617

-0.0

0040

7961

7

5000

TV

M-0

.001

0891

580.

0025

8044

60.

0066

1772

80.

0187

4684

0.03

1302

16-0

.031

0103

2-0

.000

7086

542

0.00

0708

6542

RVM

1.29

8881

-1.8

0411

6-3

.685

889

-4.8

0707

2-3

.238

162

2.28

6944

-0.0

0071

1124

30.

0007

1112

43

MM

-0.0

0104

6705

0.00

2517

380

0.00

6534

310.

0184

3972

0.03

1467

8-0

.030

9596

8-0

.000

7086

519

0.00

0708

6519

1000

0T

VM

0.00

0376

472

-0.0

0173

8574

0.00

3361

113

-0.0

2059

716

0.02

2554

72-0

.012

5948

3-0

.000

2449

717

0.00

0244

9717

RVM

1.41

6778

-1.7

7324

1-3

.851

791

-4.8

5917

3-3

.541

599

2.25

3168

-0.0

0024

8539

50.

0002

4853

95

MM

0.00

0357

3312

-0.0

0175

2626

0.00

3659

679

-0.0

2058

202

0.02

3148

23-0

.012

6740

2-0

.000

2449

714

0.00

0244

9714

13

Page 14: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e6:

MSE

ofth

ees

timat

es-k

=2,

wel

lsep

arat

ed(2

WS)

nM

etho

dµ 1

µ 2σ

2 1σ

2 2λ 1

λ 2p 1

p 250

0T

VM

0.01

1615

860.

0616

181

0.82

1638

25.

1750

63.

3562

311.

9544

430.

0004

4208

630.

0004

4208

63

RVM

2.78

1579

5.21

3882

16.5

4363

37.8

6230

18.3

8287

9.24

199

0.00

0442

0563

0.00

0442

0563

MM

0.01

1955

560.

0645

2464

0.84

2286

610

.645

2837

.273

052.

1990

250.

0004

4372

830.

0004

4372

83

1000

TV

M0.

0066

4421

40.

0298

738

0.38

0233

12.

2913

711.

1911

680.

7158

823

0.00

0249

1188

0.00

0249

1188

RVM

2.92

4958

5.63

0983

17.4

4514

40.6

9785

18.5

4422

9.17

7224

0.00

0249

1342

0.00

0249

1342

MM

0.00

6828

364

0.03

0304

110.

3902

985

2.32

1045

1.23

0409

0.73

0105

10.

0002

4911

970.

0002

4911

97

5000

TV

M0.

0011

8719

10.

0058

8391

30.

0779

6818

0.49

0199

60.

2062

550

0.10

2121

14.

9505

53e-

054.

9505

53e-

05

RVM

3.11

4875

5.68

6154

18.6

0732

41.6

3566

19.5

9477

9.20

1493

4.95

0964

e-05

4.95

0964

e-05

MM

0.00

1215

798

0.00

5946

889

0.07

9780

340.

4948

037

0.21

1575

90.

1033

688

4.95

0552

e-05

4.95

0552

e-05

1000

0T

VM

0.00

0588

9338

0.00

2968

532

0.03

6591

780.

2307

873

0.09

6023

110.

0529

5109

2.26

7009

e-05

2.26

7009

e-05

RVM

3.39

0618

5.56

8737

19.8

5308

42.9

8142

21.3

2373

9.03

7181

2.26

6685

e-05

2.26

6685

e-05

MM

0.00

0602

5367

0.00

3005

308

0.03

7585

90.

2332

507

0.09

8670

730.

0537

0464

2.26

7008

e-05

2.26

7008

e-05

14

Page 15: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e7:

Bia

sof

the

estim

ates

-k=

2,po

orly

sepa

rate

d(2

PS)

nM

etho

dµ 1

µ 2σ

2 1σ

2 2λ 1

λ 2p 1

p 250

0T

VM

0.01

1302

08-0

.102

9151

0.82

2288

8-2

.537

293

0.79

4874

7-0

.259

8434

0.03

0277

12-0

.030

2771

2

RVM

0.73

7736

2-1

.413

347

-4.7

8207

7-3

.800

659

-1.6

0642

-1.5

9273

60.

0206

6073

-0.0

2066

073

MM

0.57

6095

4-2

.337

444

-6.5

2663

4-5

.732

598

-1.3

6840

83.

8786

450.

0914

515

-0.0

9145

15

1000

TV

M0.

0051

3527

7-0

.031

0267

20.

5121

32-1

.235

290

0.23

3222

6-0

.111

8209

0.01

8142

67-0

.018

1426

7

RVM

0.64

5093

7-1

.351

040

-4.4

5307

4-3

.756

38-2

.403

474

2.00

1093

0.01

9545

08-0

.019

5450

8

MM

0.53

2643

2-2

.340

892

-6.6

8651

1-5

.682

92-1

.799

719

4.05

4375

0.09

6282

67-0

.096

2826

7

5000

TV

M0.

0014

0111

40.

0008

8786

850.

0624

029

-0.1

6147

830.

0540

8275

-0.0

3992

003

0.00

2785

076

-0.0

0278

5076

RVM

0.46

5385

-1.0

1983

7-3

.645

296

-2.6

4772

3-1

.864

61.

7282

860.

0163

5918

-0.0

1635

918

MM

0.41

3232

8-2

.313

775

-6.5

3142

3-5

.741

616

-2.1

2780

24.

0416

190.

0904

191

-0.0

9041

91

1000

0T

VM

0.00

1821

830

0.00

0150

5038

0.03

0255

59-0

.114

0996

0.00

6581

757

-0.0

0026

9398

0.00

1758

447

-0.0

0175

8447

RVM

0.50

6123

6-1

.031

940

-3.6

9262

8-2

.634

238

-2.1

0580

51.

6772

710.

0077

5284

-0.0

0775

284

MM

0.28

7132

9-2

.205

388

-6.4

8101

2-5

.305

207

-1.9

0987

14.

0506

290.

1025

349

-0.1

0253

49

15

Page 16: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e8:

MSE

ofth

ees

timat

es-k

=2,

poor

lyse

para

ted

(2P

S)

nM

etho

dµ 1

µ 2σ

2 1σ

2 2λ 1

λ 2p 1

p 250

0T

VM

0.01

6901

0.15

3715

8.10

4147

31.8

3997

13.7

952

10.9

3927

0.00

6690

502

0.00

6690

502

RVM

1.46

428

4.31

3297

34.9

5844

38.2

0203

205.

0405

5609

.656

0.01

0705

960.

0107

0596

MM

1.25

8757

6.29

5013

44.2

611

52.1

9197

11.5

8125

16.2

4227

0.01

7145

220.

0171

4522

1000

TV

M0.

0085

3905

50.

0642

0256

4.44

3361

14.5

5311

2.11

5562

1.76

4568

0.00

3219

140

0.00

3219

140

RVM

1.29

0693

3.79

1282

32.5

5702

31.7

4903

14.6

9199

8.17

5651

0.00

8868

588

0.00

8868

588

MM

1.03

5179

5.99

5885

45.3

8505

48.8

999

10.0

9152

16.4

6602

0.01

7352

440.

0173

5244

5000

TV

M0.

0018

0055

10.

0092

7133

40.

6618

201

1.43

2464

0.37

4085

0.15

2128

70.

0004

0517

850.

0004

0517

85

RVM

0.87

3221

22.

4779

0425

.597

2419

.833

529.

8165

46.

5639

710.

0075

9387

80.

0075

9387

8

MM

0.63

5642

85.

8210

9443

.626

743

.924

638.

1717

8416

.363

730.

0163

0810

0.01

6308

10

1000

0T

VM

0.00

0777

9437

0.00

5205

021

0.31

2441

80.

7374

107

0.16

6761

80.

0834

3909

0.00

0188

2813

0.00

0188

2813

RVM

1.16

1408

2.75

7048

26.1

7153

20.1

3907

11.1

0304

6.23

0668

0.00

7534

106

0.00

7534

106

MM

0.36

5803

75.

2071

2342

.813

3135

.699

685.

8196

6616

.408

980.

0172

6078

0.01

7260

78

16

Page 17: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e9:

Bia

sof

the

estim

ates

-k=

3,tw

opo

orly

sepa

rate

dan

don

ew

ells

epar

ated

(3PW

S)

nM

etho

dµ 1

µ 2µ 3

σ2 1

σ2 2

σ2 3

λ 1λ 2

λ 3p 1

p 2p 3

500

TV

M0.

0124

259

-0.0

5853

80.

0789

760.

0752

56-0

.095

745

-0.4

6277

4-2

.038

312

18.0

6389

-8.2

4194

-0.0

9909

90.

0983

80.

0007

106

RVM

0.83

9994

-1.6

2511

2.28

647

-4.7

5611

-7.5

1127

1.38

9074

-0.3

2077

-3.7

1601

9-3

.224

084

0.00

9681

9-0

.025

417

0.01

5735

MM

0.01

2744

-0.0

9969

10.

0789

37-0

.334

22-2

.318

771.

9496

48.

1623

06-0

.175

09-0

.160

34-0

.000

5117

9-0

.014

921

0.01

5433

1000

TV

M0.

0009

163

-0.0

1305

30.

0277

10.

2364

55-0

.030

0267

-0.2

4516

-1.9

5914

10.4

8063

-8.1

8114

7-0

.100

0959

0.10

1737

-0.0

0164

1

RVM

0.84

1724

-1.5

8681

92.

2195

5-4

.759

36-7

.316

270.

6959

45-1

.038

880.

9023

9-3

.244

660.

0055

68-0

.017

829

0.01

226

MM

0.00

0666

6-0

.015

744

0.02

7732

0.09

991

-1.5

7201

1.40

084

0.54

447

-0.1

6887

-0.0

1763

10.

0018

592

-0.0

1164

40.

0097

849

5000

TV

M-0

.002

4898

-0.0

0716

47-0

.001

6895

0.06

3075

-0.0

5103

2-0

.021

537

-1.9

6329

210

.094

9-8

.011

75-0

.100

198

0.10

076

-0.0

0056

612

RVM

0.86

8112

-1.5

3414

2.20

497

-4.7

778

-7.3

0824

0.65

4812

-1.3

3676

1.72

577

-3.2

5474

0.00

3257

9-0

.013

555

0.01

0297

MM

-0.0

0293

99-0

.009

0219

-0.0

0169

90.

0755

84-0

.811

220.

7025

090.

1023

2-0

.002

4549

0.03

6783

0.00

0870

3-0

.005

1312

0.00

4260

8

1000

0T

VM

0.00

0722

90.

0004

312

0.00

5227

-0.0

0127

70.

0820

34-0

.034

1-2

.004

4210

.044

5-8

.036

2-0

.100

480.

1001

80.

0003

056

RVM

0.76

53-1

.425

82.

4471

-4.4

972

-7.1

377

0.48

926

-1.3

367

1.59

05-3

.325

10.

0049

4-0

.015

250.

0103

0

MM

0.00

0272

-0.0

0100

20.

0052

226

0.01

0146

-0.4

979

0.51

546

0.05

1814

-0.0

2867

-0.0

0438

40.

0002

749

-0.0

0335

0.00

3078

17

Page 18: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e10

:MSE

ofth

ees

timat

es-k

=3,

two

poor

lyse

para

ted

and

one

wel

lsep

arat

ed(3

PWS)

nM

etho

dµ 1

µ 2µ 3

σ2 1

σ2 2

σ2 3

λ 1λ 2

λ 3p 1

p 2p 3

500

TV

M0.

0256

408

0.12

6759

0.08

6256

4.11

1284

25.7

3358

6.28

545

5.37

295

2457

5.55

70.4

3225

0.01

0261

10.

0104

158

0.00

0663

46

RVM

2.09

4829

5.42

932

8.41

0817

27.1

7296

69.4

9805

181.

8071

32.0

8280

6671

.365

12.9

9524

0.00

3902

50.

0034

232

0.00

0851

44

MM

0.02

6501

60.

2668

460.

0867

212.

9006

0310

.974

7816

.823

724

090.

792.

7372

91.

0681

90.

0006

7316

0.00

0516

960.

0005

3893

1000

TV

M0.

0123

157

0.05

6870

72.

4331

613

.128

283.

0555

64.

5338

711

.279

468

.311

360

.102

070.

0106

960.

0003

2769

0.00

0328

3

RVM

1.95

1414

5.05

061

7.46

6443

26.8

1183

66.3

2812

97.3

9908

13.4

5374

330.

0578

13.0

7518

0.00

1704

30.

0014

459

0.00

0441

76

MM

0.01

2360

70.

0574

984

0.03

6433

51.

8917

76.

0545

58.

6940

63.

0529

471.

4181

050.

6019

880.

0003

4945

0.00

0272

030.

0002

453

5000

TV

M0.

0023

085

0.01

1105

30.

0076

361

0.43

256

2.92

208

0.62

1509

3.98

858

102.

270

64.4

356

0.01

0079

0.01

0225

6.52

5753

e-05

RVM

1.97

832

4.88

315

7.39

087

27.0

6606

65.7

5679

.267

712

.624

849.

9027

813

.058

70.

0016

350.

0010

396

0.00

0443

9

MM

0.00

2320

90.

0112

920.

0076

426

0.44

148

1.82

228

1.77

380.

3738

0.25

0619

0.13

556

7.33

5385

e-05

5.68

1646

e-05

4.85

0018

e-05

1000

0T

VM

0.00

1100

430.

0050

401

0.00

4103

080.

1821

21.

3066

0.31

966

4.08

7410

1.07

264

.692

0.01

0118

0.01

0068

2.80

6801

e-05

RVM

1.74

505

4.82

237

9.01

6929

24.7

309

64.0

593

80.2

7004

11.1

823

8.88

2013

.332

0.00

189

0.00

1888

0.00

0455

1

MM

0.00

1109

0.00

514

0.00

4107

0.18

730.

7276

0.92

790.

1849

0.11

501

0.06

9822

3.23

0155

e-05

2.54

5253

e-05

2.42

436e

-05

18

Page 19: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e11

:Bia

sof

the

estim

ates

-k=

3,th

eth

ree

wel

lsep

arat

ed(3

WW

S)

nM

etho

dµ 1

µ 2µ 3

σ2 1

σ2 2

σ2 3

λ 1λ 2

λ 3p 1

p 2p 3

500

TV

M0.

0026

29-0

.030

459

0.17

4474

0.02

879

-0.0

1010

93-1

.095

813

-2.6

6170

723

.330

88-8

.294

356

-0.0

9867

930.

0986

485

3.08

7517

e-05

RVM

1.12

8429

-1.8

9298

12.

3233

37-4

.006

363

-7.9

0329

-1.3

7325

5.44

9758

0.72

7331

-3.4

7222

70.

0034

786

-0.0

1705

080.

0135

722

MM

0.00

2049

-0.0

7305

60.

1644

9-0

.000

293

-1.9

645

0.94

1228

7.49

67-0

.286

28-0

.689

06-0

.000

8928

-0.0

1298

0.01

387

1000

TV

M0.

0035

82-0

.010

470.

0997

30.

0148

15-0

.044

9-0

.607

2-2

.418

028

10.3

1084

-8.1

5506

-0.0

9904

0.09

9609

-0.0

0056

9

RVM

1.14

596

-1.7

5138

2.24

6543

-3.8

2581

4-7

.676

75-1

.014

807

-2.1

0663

51.

932

-3.3

737

0.00

3889

-0.0

1303

0.00

9140

5

MM

0.00

3149

-0.0

1126

0.10

0995

0.02

0765

-1.3

3552

0.66

874

0.32

52-0

.153

007

-0.4

2507

-0.0

0038

8-0

.000

3888

-0.0

0928

5000

TV

M0.

0008

79-0

.003

1127

0.02

638

-0.0

0371

9-0

.024

0517

-0.1

6356

-2.1

1300

110

.062

6-8

.031

46-0

.099

870.

1002

55-0

.000

378

RVM

1.34

845

-2.1

397

2.32

68-4

.117

4-8

.091

005

0.68

9014

4-2

.475

972.

4024

8-3

.498

240.

0096

67-0

.011

420.

0017

53

MM

0.00

0467

-0.0

0347

0.02

667

0.00

0168

3-0

.558

190.

3651

90.

0688

7-0

.030

102

-0.1

1407

0.00

0255

-0.0

0446

50.

0042

097

1000

0T

VM

0.00

4409

0.00

0328

80.

0125

3-0

.020

480.

0049

85-0

.066

435

-2.0

6170

310

.011

59-8

.018

37-0

.099

0.10

033

-0.0

0034

97

RVM

1.35

49-2

.150

32.

4061

46-4

.125

6-7

.937

45-0

.763

128

-2.5

7214

2.37

3008

-3.5

7646

0.00

7488

-0.0

0948

40.

0019

96

MM

0.00

4030

49.

5033

94e-

050.

0126

8-0

.017

105

-0.3

395

0.27

4816

0.01

6875

-0.0

1749

-0.0

6225

0.00

0332

7-0

.003

137

0.00

2804

19

Page 20: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Tabl

e12

:MSE

ofth

ees

timat

es-k

=3,

the

thre

ew

ells

epar

ated

(3W

WS)

nM

etho

dµ 1

µ 2µ 3

σ2 1

σ2 2

σ2 3

λ 1λ 2

λ 3p 1

p 2p 3

500

TV

M0.

0228

40.

0842

50.

1031

31.

3009

26.

9941

86.

3296

7.49

639

122.

8570

.743

0.01

0132

0.01

0204

30.

0004

017

RVM

3.06

886.

8425

68.

3173

19.5

307

71.4

026

260.

9848

2562

2.18

335.

8204

13.6

056

0.00

2224

0.00

1829

0.00

0625

MM

0.02

3629

0.96

250.

1817

51.

3752

7.49

979.

696

1992

1.21

2.09

006

0.88

950.

0005

540.

0004

960.

0004

24

1000

TV

M0.

0096

320.

0382

30.

0488

360.

7212

73.

3526

083.

2399

6.13

927

108.

1138

67.2

790.

0100

098

0.01

0195

0.00

0225

8

RVM

3.06

376.

2251

47.

7863

718

.289

569

.655

632

8.19

8613

.184

19.

4350

5213

.326

690.

0019

764

0.00

1447

0.00

0351

06

MM

0.00

987

0.03

860.

0495

070.

7390

383.

8604

32.

7949

1.94

190.

8063

40.

4740

70.

0002

737

0.00

0213

350.

0002

143

5000

TV

M0.

0018

830.

0079

20.

0075

40.

1372

60.

7051

70.

6406

84.

5479

101.

544

64.6

403

0.01

0019

0.01

0098

3.96

2665

e-05

RVM

4.38

288.

9594

78.

4774

19.9

5474

.574

762.

535

14.1

316

9.94

7314

.043

900.

0040

459

0.00

2173

0.00

0472

8

MM

0.00

1926

0.00

8009

30.

0076

50.

1402

0.73

4012

0.62

4008

40.

2994

0.13

834

0.09

7003

4.70

2413

e-05

4.33

0964

e-05

4.00

1327

e-05

1000

0T

VM

0.00

1008

490.

0039

560.

0035

190.

0661

70.

3612

407

0.30

119

4.30

2710

0.37

5864

.376

980.

0100

169

0.01

0093

2.12

2882

e-05

RVM

4.46

6610

.107

097.

9793

20.1

2889

73.5

9039

419.

6245

14.6

3398

9.83

2314

.459

60.

0023

150.

0018

670.

0001

507

MM

0.00

1025

40.

0039

90.

0035

50.

0675

80.

3353

80.

3328

50.

1479

0.08

390.

0565

02.

6725

32e-

052.

0902

53e-

052.

0746

95e-

05

20

Page 21: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Table 13 presents the means and standard deviations of dr over 500 samples in the 2MS case. For TVMand MM, we can see that these values decrease when the sample size increases. The RVM presented thelowest mean value only when n = 100. For larger sample sizes its performance was inferior. In addition, forall sample sizes considered, it presented the largest standard deviation.

Table 13: Means and standard deviations (×10−4) of dr. 2MS case.

Method

TVM RVM MM

n Mean SD Mean SD Mean SD

100 1.50 15.90 1.20 21.00 1.50 15.96

500 0.27 2.87 0.87 15.13 0.27 2.87

1000 0.13 1.44 0.84 15.71 0.13 1.44

5000 0.03 0.29 0.88 17.16 0.03 0.29

Table 14: Means and standard deviations (×10−4) of dr. 2WS case.

Method

TVM RVM MM

n Mean SD Mean SD Mean SD

100 1.48 15.33 1.21 17.32 1.47 15.05

500 0.25 2.58 1.23 19.09 0.25 2.58

1000 0.12 1.26 1.40 20.97 0.12 1.26

5000 0.03 0.25 1.60 22.72 0.03 0.25

For the 2WS case we have again MM and TVM presenting a similar performance – see Table 14, with aclear monotonicity when the sample size increases. Excepting the RVM, all methods have smaller standarddeviations in this well separated case, maybe due to the improved ability of the k-means method to clusterthis heterogeneous data set.

Table 15: Means and standard deviations (×10−4) of dr. 2PS case.

Method

TVM RVM MM

n Mean SD Mean SD Mean SD

100 1.70 18.53 1.29 22.52 1.30 19.19

500 0.30 3.18 0.44 6.54 0.36 6.99

1000 0.15 1.68 0.43 8.53 0.40 7.83

5000 0.02 0.21 0.39 8.76 0.48 5.61

Results for the 2PS case are shown in Table 15. In this case the MM method has an opposed pattern.We do not observe a monotone behavior for dr and the standard errors are larger than that presented in the2MS and 2WS cases. It is reasonable to suppose that the homogeneous structure of the data is negativelyinfluencing the ability of the k-means method to cluster it.

21

Page 22: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

The main message is that the MM method seems to be suitable when we are interested in the estimationof the likelihood values, with some caution when the population is highly homogeneous.

5.3 Number of Iterations

It is well known that one of the major drawbacks of the EM algorithm is the slow convergence. The prob-lem become more serious when there is a bad choice of the starting values (McLachlan & Krishnan, 2008).Consequently, an important issue is the investigation of the number of iterations necessary to the conver-gence of the algorithm. Here we consider the 2MS, 2WS and 2PS cases, k = 2 and n = 100, 500, 1000,5000. For each combination of parameters and sample size, 500 samples were generated artificially, thenumber of iterations was observed and the means and standard deviations of this quantity were computed.The simulations results are reported in Tables 16, 17, and 18.

Table 16: Means and standard deviations of number of iterations. 2MS case

Method

TVM RVM MM

n Mean SD Mean SD Mean SD

100 2359.46 121.13 1612.09 100.79 2293.78 120.02

500 551.10 15.45 640.23 18.41 573.62 15.26

1000 468.65 8.17 666.58 11.06 501.72 7.67

5000 449.28 6.27 845.64 8.66 501.32 5.49

Table 17: Means and standard deviations of number of iterations. 2WS case.

Method

TVM RVM MM

n Mean SD Mean SD Mean SD

100 1519.10 105.98 942.32 79.56 1520.42 105.26

500 209.42 4.44 369.35 4.26 238.49 5.02

1000 189.85 2.60 403.22 3.88 220.24 2.68

5000 176.04 1.72 577.10 7.02 215.61 2.06

Results suggest that in the moderately and well separated cases (Tables 16 and 17, respectively), andusing TVM and MM, the mean number of iterations decreases as the sample size increases, but the same isnot true when RVM is adopted as the initialization method.

The 2PS case is illustrated in Table 18. As expected, we have a poor behavior possibly due to thepopulation homogeneity, as we commented before. An interesting fact is that, comparing with MM, RVMhas a smaller mean number of iterations.

6 A Simulation Study of Model Choice

Someone can argue that an arbitrary density can always be approximated by a finite mixture of normaldistributions, see McLachlan & Peel (2000, Chapter 1), for example. However, it is not enough to increasethe number of components in order to obtain a suitable fit. The main question is how to achieve this using

22

Page 23: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Table 18: Means and standard deviations of number of iterations. 2PS case.

Method

TVM RVM MM

n Mean SD Mean SD Mean SD

100 2470.30 105.94 1539.45 93.29 1713.71 90.74

500 1061.02 39.90 1090.39 38.42 1095.16 23.07

1000 900.66 30.85 1063.38 23.97 1135.90 15.85

5000 700.77 12.53 1274.86 19.16 1499.98 15.90

the smallest number of mixture components without overfit the data. One possible approach is to use somecriteria function and compute

k = argmink{C(Θ(k)), k ∈ {kmin, ...,kmax}},

where C(Θ(k)) is the criteria function evaluated at the EM estimate Θ(k), obtained by modeling the datausing the FM-SN model with k components, and kmin and kmax are fixed positive integers.

Our main purpose in this section is to investigate the ability of some classical criteria to estimate thecorrect number of mixture components. We consider the Akaike Information Criterion (AIC) (Akaike,1974), the Bayesian Information Criterion (BIC) (Schwarz, 1978), the Efficient Determination Criterion(EDC) (Bai et al., 1989) and the Integrated Completed Likelihood Criterion (ICL) (Biernacki et al., 2000).AIC, BIC and EDC have the form

−2`(Θ)+dk cn,

where `(·) is the actual log-likelihood, dk is the number of free parameters that have to be estimated underthe model with k components and the penalty term cn is a convenient sequence of positive numbers. Wehave cn = 2 for AIC and cn = log(n) for BIC. For the EDC criterion, cn is chosen so that it satisfies theconditions

cn/n→ 0 and cn/(log log(n))→ 0

when n→ ∞. Here we compare the following alternatives

cn = 0.2√

n, cn = 0.2log(n), cn = 0.2n/ log(n), and cn = 0.5√

n.

The ICL is defined as−2`∗(Θ)+dk log(n),

where `∗(·) is the integrated log-likelihood of the sample and the indicator latent variables , given by

`∗(Θ) =k

∑i=1

∑j∈Ci

log(piSN(y j|θi)),

where Ci is a set of indexes defined as: j belongs to Ci if, and only if, the observation y j is allocated tocomponent i by the following clustering process: after the FM-SN model with k components was fittedusing the EM algorithm we obtain the estimate of the posterior probability that an observation yi belongs tothe jth component of the mixture, zi j – see equation (5). If q = argmax j{zi j}we allocate yi to the componentq.

In this study we simulated samples of the FM-SN model with k = 3, p1 = p2 = p3 = 1/3, µ1 = 5, µ2 =20, µ3 = 28, σ2

1 = 9, σ22 = 16, σ 2

3 = 16, λ1 = 6, λ2 =−4 and λ3 = 4, and considered the sample sizes n =200, 300, 500,1000, 5000. Figure 3 shows a typical sample of size 1000 following this specified setup.

23

Page 24: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Figure 3: Histogram of a FM-SN sample with k = 3 and n = 1000

For each generated sample (with fixed number of 3 components) we fitted the FM-SN model with k = 2,k = 3 and k = 4, using the EM algorithm initialized by the method of moments. For each fitted modelthe criteria AIC, BIC, ICL and EDC were computed. We repeated this procedure 500 times and obtainedthe percentage of times some given criterion chooses the correct number of components. The results arereported in Table 19.

Table 19: Percentage of times the true model (with k = 3) is chosen using different criteria

n AIC BIC ICL EDC

cn = 0.2log(n) cn = 0.2√

n cn = 0.2n/ log(n) cn = 0.5√

n

200 94.2 99.2 99.2 77.8 98.4 99.4 99.4

300 94.0 98.8 98.8 78.2 98.4 98.8 98.8

500 95.8 99.8 99.8 86.4 99.8 99.8 99.8

1000 96.2 100.0 100.0 88.5 100.0 100.0 100.0

5000 95.6 100.0 100.0 92.8 100.0 100.0 100.0

We can see that BIC and ICL have a better performance than AIC for all sample sizes. Except for AIC,the rates presented an increasing behavior when the sample size increases. This possible drawback of AICmay be due to the fact that its definition does not take into account the sample size in its penalty term.Results for BIC and ICL were similar, while EDC showed some dependence on the term cn. In general, wecan say that BIC and ICL have equivalent abilities to choose the correct number of components and that,depending on the choice of cn, ICL can be not as good as AIC or better than ICL and BIC.

24

Page 25: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

7 Applications to Real Data Sets

Here we apply the modeling approach suggested in the previous sections to the analysis of two real datasets. That is, we model data using a FM-SN model, estimating parameters through the proposed EM-typealgorithm, with initial values obtained by the method of moments and random values method and fixing asstopping rule |`(Θ(k+1))− `(Θ(k))|< 10−5.

7.1 GDP data set

0 1 2 3 4 5 6

0.00.5

1.01.5

2.0

Figure 4: Histogram of the GDP data set and fitted FM-SN densities with 2 (solid line) and 3 (dotted line)components

In this application we consider observations from the gross domestic product (GDP) per capita (PPPUS$) in 1998 from 174 countries. This data set appeared before in Dias & Wedel (2004) where, beforefitting a mixture of two normal distributions to the data, was made a logarithmic transformation, in orderto eliminate the clear strong skewness. In figure 4 we have a histogram of the data, suggesting a bimodalor trimodal pattern. Thus, a mixture of 2 or 3 SN distributions seems to be a natural candidate to modelthis data, avoiding a possibly unnecessary data transformation. For this application, each observation in theoriginal data set was divided by 103.

Table 20 presents the initial values obtained by the MM and RVM methods. Table 21 shows the EMestimates. We can verify that, besides an apparent label switching, there are differences between startingpoints and EM estimates, which are more marked for the skewness parameter λi. In this case we have a smallsample, and some care is needed when interpreting this results, because of the deficiencies of the algorithmin estimating this parameter, commented in the previous sections.

As in Lin et al. (2007a), we applied the Kolmogorov-Smirnov (K-S) test to evaluate the goodness offit under different methods of initialization of the algorithm. The p− values are presented in Table 22 andindicate a better fit using the model with 3 components, regardless of the initialization method.

25

Page 26: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Table 20: GDP data: initial values obtained by MM and RVM methods

MethodParameter k RVM MM

µi 2 ( 1.8365; 0.3911) (0.0047; 1.6078)

3 (2.4609; 0.9195; 0.0903 ) (1.7354; 0.6208; 0.0896 )

σ2i 2 (0.3617; 0.1542) ( 0.2168; 0.4216)

3 (0.0445; 0.0660; 0.0007) (0.3689; 0.1730; 0.0522 )

λi 2 (1.0619; -3.9085) (11.5960; 1.5634)

3 (-2.2742; -3.2476; 0.6745 ) (3.3291; 2.4475; 2.1934)

pi 2 (0.5419; 0.4580) ( 0.7873; 0.2126)

3 (0.3566; 0.2274; 0.4158) (0.1724; 0.2068; 0.6206)

`(Θ) 2 -260.8486 -106.5715

3 -417.0599 -111.6361

Table 21: GDP data: EM estimates

MethodParameter k RVM MM

µi 2 (1.4124; 0.3043) (0.0452; 1.8755)

3 (1.9950; 0.4773; 0.0456) (1.7915; 0.3968; 0.0454)

σ2i 2 (0.5620; 0.0380) (0.2056; 0.2897)

3 (0.4937; 0.0370; 0.0137) (0.3102; 0.1840; 0.0521 )

λi 2 (0.2926; -0.0665) (2387.4927; 0.2941 )

3 (-0.4830; -0.0053; 1825.5701) (0.7189; 7.4099; 2127.1691 )

pi 2 (0.3388; 0.6611) (0.7799; 0.2200 )

3 (0.2857; 0.3813; 0.3329) (0.2059; 0.2565; 0.5375 )

`(Θ) 2 -123.0055 -95.6176

3 -94.4735 -92.0709

Fixing MM as the initialization method, we computed the model choice criteria AIC, BIC and EDC(with cn = 0.2

√n) for the models with 2 and 3 components. Table 23 shows the results, besides the log-

likelihood evaluated at the EM estimates and the number of iterations needed to convergence. Although themodel with 3 components has a smaller log-likelihood value, it has been penalized by the criteria becauseof the higher number of parameters. Surprisingly, the model with 2 components was rejected using the K-Stest.

Figure 4 also presents the plug-in densities – that is, the FM-SN density with 2 and 3 components inwhich the parameters have been replaced by the EM estimates – superimposed on a single set of coordi-nate axes. Based on the graphical visualization, it appears that the model with 2 components has a quitereasonable and better fit.

26

Page 27: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Table 22: GDP data: Kolmogorov-Smirnov test

2 components 3 components

RVM MM RVM MM

K-S statistic 0.0687 0.0966 0.0365 0.0396

p-value 0.0250 0.0190 0.1940 0.1820

Table 23: GDP data: criteria values and number of iterations

k `(Θ) AIC BIC EDC Number ofiterations

2 −94.95 205.24 227.35 214.32 79603 −92.08 206.15 240.90 220.43 7164

7.2 Old Faithful Geyser data set

Now we consider the famous Old Faithful Geyser data taken from Silverman (1986), consisting of 272 uni-variate measurements of eruptions lengths (in minutes) of the Old Faithful Geyser in Yellowstone NationalPark, Wyoming, USA. It has been analyzed by many authors, like Lin et al. (2007b), fitting FM-SN andFM-NOR models with two components, for example.

The histogram of this data, showed in figure 5, clearly suggests the fit of a bimodal distribution with twoheterogenous asymmetric sub-populations. We made an analysis similar to that in the previous example,fitting SN models with 2 and 3 components. Results are displayed in tables 24, 25 and 26. In this case the

1 2 3 4 5 6

0.00.2

0.40.6

0.81.0

Figure 5: Histogram of the Old Faithful Geyser data set and fitted FM-SN densities with 2 (solid line) and 3(dotted line) components

27

Page 28: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

K-S test and the criteria agree on the choice of the model with 2 components.

Table 24: Old Faithful data: initial values obtained by MM and RVM methods

MethodParameter k RVM MM

µi 2 (4.7851; 2.8597) (4.6874; 1.6713)

3 (5.7732; 3.5815; 4.5693 ) (4.4258; 4.7396; 4.6198)

σ2i 2 (0.0801; 0.0571) (0.3125; 0.2236)

3 (1.4072; 1.4842; 0.8971) (2.3636; 2.5202; 2.7715 )

λi 2 (1.4607; -1.5402) (-1.7855; 122.9431)

3 (-3.4674; -3.0006; -2.8209 (-1.4248; -2.3205; -1.8137)

pi 2 (0.7718; 0.2281) (0.6397; 0.3602)

3 (0.2499; 0.1149; 0.6351) (0.3308; 0.3566; 0.3125)

`(Θ) 2 -1886.2170 -274.9906

3 -3228.8101 -411.5698

Table 25: Old Faithful data: EM estimates

MethodParameters k RVM MM

µi 2 (4.8193; 2.0185) (4.7994; 1.7264)

3 (5.1021; 1.9993; 4.7367) (2.0002; 4.9903; 4.6653 )

σ2i 2 (0.5333; 0.0477) (0.4688; 0.1451)

3 (1.4799; 0.0444; 0.3476) (0.0442; 1.0963; 0.2489 )

λi 2 (-3.9733; -0.0731) (-3.4798; 5.8263

3 (-1.6523e+03; -3.4013e-03; -3.6525e+00) (-0.0109; -10.2598; -2.3059 )

pi 2 (0.6584; 0.3415) (0.6511; 0.3488)

3 (0.1883; 0.3340; 0.4776) (0.3344; 0.3005; 0.3650 )

`(Θ) 2 -226.4929 -257.5662

3 -264.4344 -265.2770

8 Final Conclusion

In this work we presented a simulation study in order to investigate the performance of the EM algorithmfor maximum likelihood estimation in finite mixtures of skew-normal distributions with component specificparameters. The results show that the algorithm produces quite reasonable estimates, in the sense of consis-tency and the total number of iterations, when using the method of moments to obtain the starting points.The study also suggested that the random initialization method used is not a reasonable procedure. Whenthe EM estimates were used to compute some model choice criteria, namely the AIC, BIC, ICL or EDCcriteria (in this case, depending on the penalization therm used), a good alternative to estimate the numberof components of the mixture was obtained. On the other side, these patterns do not hold when the mixturecomponents are poorly separated, notably for the skewness parameters estimates which, in addition, showed

28

Page 29: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Table 26: Old Faithful data: Kolmogorov-Smirnov test

2 components 3 components

RVM MM RVM MM

K-S statistic 0.0491 0.03569 0.0481 0.0485

p−value 0.1060 0.1730 0.0940 0.1010

Table 27: Old Faithful data: criteria values and number of iterations

k `(Θ) AIC BIC EDC Number ofIterations

2 −257.57 529.13 554.37 538.22 8053 −257.26 536.52 576.19 550.81 1641

a performance strongly dependent on large samples. Possible extensions of this work include the multivari-ate case and a wider family of skewed distributions, like the class of skew-normal independent distributions(Lachos et al., 2010).

References

Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactionson, 19, 716–723.

Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal ofStatistics, 12, 171–178.

Azzalini, A. (2005). The skew-normal distribution and related multivariate families. Scandinavian Journalof Statistics, 32, 159–188.

Bai, Z. D., Krishnaiah, P. R. & Zhao, L. C. (1989). On rates of convergence of efficient detection criteria insignal processing with white noise. IEEE Transactions on Information Theory, 35, 380–388.

Basso, R. M., Lachos, V. H., Cabral, C. R. B. & Ghosh, P. (2009). Robust mixture modelingbased on scale mixtures of skew-normal distributions. Computational Statistics and Data Analysis.doi:10.1016/j.csda.2009.09.031.

Bayes, C. L. & Branco, M. D. (2007). Bayesian inference for the skewness parameter of the scalar skew-normal distribution. Brazilian Journal of Probability and Statistics, 21, 141–163.

Biernacki, C., Celeux, G. & Govaert, G. (2000). Assessing a mixture model for clustering with the integratedcompleted likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 719–725.

Biernacki, C., Celeux, G. & Govaert, G. (2003). Choosing starting values for the EM algorithm for get-ting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & DataAnalysis, 41, 561–575.

Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38.

29

Page 30: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

Dias, J. G. & Wedel, M. (2004). An empirical comparison of EM, SEM and MCMC performance forproblematic gaussian mixture likelihoods. Statistics and Computing, 14, 323–332.

DiCiccio, T. J. & Monti, A. C. (2004). Inferential aspects of the skew exponential power distribution.Journal of the American Statistical Association, 99, 439–450.

Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Verlag.

Greselin, F. & Ingrassia, S. (2009). Constrained monotone EM algorithms for mixtures of multivariate tdistributions. Statistics and Computing. doi: 10.1007/s11222-008-9112-9.

Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for normal mixturemodels. The Annals of Statistics, 13, 795–800.

Henze, N. (1986). A probabilistic representation of the skew-normal distribution. Scandinavian Journal ofStatistics, 13, 271–275.

Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate normal mixture models.Statistical Methods and Applications, 13, 151–166.

Ingrassia, S. & Rocci, R. (2007). Constrained monotone EM algorithms for finite mixture of multivariategaussians. Computational Statistics & Data Analysis, 51, 5339–5351.

Johnson, R. A. & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis. Prentice Hall, 6thedition.

Karlis, D. & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Compu-tational Statistics & Data Analysis, 41, 577–590.

Lachos, V. H., Ghosh, P. & Arellano-Valle, R. B. (2010). Likelihood based inference for skew normalindependent linear mixed models. Statistica Sinica, 20, 303–322.

Lin, T. & Lin, T. (2009). Supervised learning of multivariate skew normal mixture models with missinginformation. Computational Statistics. 10.1007/s00180-009-0169-5.

Lin, T. I. (2009a). Maximum likelihood estimation for multivariate skew normal mixture models. Journalof Multivariate Analysis, 100, 257–265.

Lin, T. I. (2009b). Robust mixture modeling using multivariate skew t distributions. Statistics and Comput-ing. doi: 10.1007/s11222-009-9128-9.

Lin, T. I., Lee, J. C. & Ni, H. F. (2004). Bayesian analysis of mixture modelling using the multivariate tdistribution. Statistics and Computing, 14, 119–130.

Lin, T. I., Lee, J. C. & Hsieh, W. J. (2007a). Robust mixture modelling using the skew t distribution.Statistics and Computing, 17, 81–92.

Lin, T. I., Lee, J. C. & Yen, S. Y. (2007b). Finite mixture modelling using the skew normal distribution.Statistica Sinica, 17, 909–927.

McLachlan, G. J. & Krishnan, T. (2008). The EM Algorithm and Extensions. John Wiley and Sons, secondedition.

30

Page 31: An Empirical Comparison of EM Initialization Methods and ... · An Empirical Comparison of EM Initialization Methods and Model Choice Criteria for Mixtures of Skew-Normal Distributions⁄

McLachlan, G. J. & Peel, G. J. (2000). Finite Mixture Models. John Wiley and Sons.

Meng, X. L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A generalframework. Biometrika, 80, 267–278.

Nityasuddhi, D. & Böhning, D. (2003). Asymptotic properties of the EM algorithm estimate for normalmixture models with component specific variances. Computational Statistics & Data Analysis, 41, 591–601.

Park, H. & Ozeki, T. (2009). Singularity and slow convergence of the EM algorithm for gaussian mixtures.Neural Process Letters, 29, 45–59.

Peel, D. & McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics andComputing, 10, 339–348.

R Development Core Team (2009). R: A language and environment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

Shoham, S. (2002). Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions. Pattern Recognition, 35, 1127–1142.

Shoham, S., Fellows, M. R. & Normann, R. A. (2003). Robust, automatic spike sorting using mixtures ofmultivariate t-distributions. Journal of Neuroscience Methods, 127, 111–122.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distribu-tions. John Wiley and Sons.

Yakowitz, S. J. & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathemat-ical Statistics, 39, 209–214.

Yao, W. (2010). A profile likelihood method for normal mixture with unequal variance. Journal of StatisticalPlanning and Inference. doi:10.1016/j.jspi.2010.02.004.

31