Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet...

14
Hierarchical Mixture Modeling With Normalized Inverse-Gaussian Priors Antonio LIJOI, Ramsés H. MENA, and Igor PRÜNSTER In recent years the Dirichlet process prior has experienced a great success in the context of Bayesian mixture modeling. The idea of overcoming discreteness of its realizations by exploiting it in hierarchical models, combined with the development of suitable sampling techniques, represent one of the reasons of its popularity. In this article we propose the normalized inverse-Gaussian (N–IG) process as an alternative to the Dirichlet process to be used in Bayesian hierarchical models. The N–IG prior is constructed via its finite-dimensional distributions. This prior, although sharing the discreteness property of the Dirichlet prior, is characterized by a more elaborate and sensible clustering which makes use of all the information contained in the data. Whereas in the Dirichlet case the mass assigned to each observation depends solely on the number of times that it occurred, for the N–IG prior the weight of a single observation depends heavily on the whole number of ties in the sample. Moreover, expressions corresponding to relevant statistical quantities, such as a priori moments and the predictive distributions, are as tractable as those arising from the Dirichlet process. This implies that well-established sampling schemes can be easily extended to cover hierarchical models based on the N–IG process. The mixture of N–IG process and the mixture of Dirichlet process are compared using two examples involving mixtures of normals. KEY WORDS: Bayesian nonparametrics; Density estimation; Dirichlet process; Inverse-Gaussian distribution; Mixture models; Predic- tive distribution; Semiparametric inference. 1. INTRODUCTION Since the appearance of the article by Ferguson (1973), il- lustrating the statistical implications of the Dirichlet process, the Bayesian nonparametric literature has grown rapidly. Re- search has focused on both the analysis of various properties of the Dirichlet process and the proposal of alternative priors. In particular, Blackwell (1973) showed that the Dirichlet prior se- lects discrete distributions almost surely, which implies positive probability of ties in a sample drawn from it, as highlighted by Antoniak (1974). It is clear that such a property is undesirable when one is interested in modeling continuous data. Various extensions of the Dirichlet process have been pro- posed to overcome this drawback. For instance, Mauldin, Sudderth, and Williams (1992) and Lavine (1992) introduced Pólya-tree priors. Although these can be constructed in such a way as to put probability one on continuous distributions, inference based on them depends heavily on the tree of par- titions used for their construction. Paddock, Ruggeri, Lavine, and West (2003) tried to get rid of this unpleasant feature by making the partition itself random. The most fruitful ap- proach to exploiting the Dirichlet process in inferential proce- dures is represented by the mixture of Dirichlet process (MDP), introduced by Lo (1984) and further developed by Ferguson (1983). This prior is concentrated on the space of densities, thus overcoming the discreteness problem, and it has been the focus of much attention during the 1990s as a consequence of the introduction of simulation-based inference, first devel- oped by Escobar (1988). Relevant contributions include those Antonio Lijoi is Assistant Professor, Department of Economics and Quan- titative Methods, University of Pavia, Pavia, Italy, and Research Associate at the Institute of Applied Mathematics and Computer Science (IMATI), Na- tional Research Council (CNR), Milan, Italy (E-mail: [email protected]). Ramsés H. Mena is Associate Researcher, Research Institute for Applied Mathemat- ics and Systems (IIMAS), National University of Mexico (UNAM), Mexico City, Mexico (E-mail: [email protected]). Igor Prünster is Assis- tant Professor, Department of Economics and Quantitative Methods, Univer- sity of Pavia, Pavia, Italy, and Fellow of the International Center of Economic Research (ICER), Turin, Italy (E-mail: [email protected]) A. Lijoi and I. Prünster were supported in part by the Italian Ministry of University and Re- search (MIUR) research project “Bayesian Nonparametric Methods and Their Applications.” The authors are grateful to two anonymous referees for valu- able comments that have lead to a substantial improvement in the presentation. Special thanks also to Eugenio Regazzini for some helpful suggestions. by, among others, Escobar (1994), Escobar and West (1995), MacEachern (1994), MacEachern and Müller (1998), and Green and Richardson (2001). The model has been compre- hensively reviewed by Dey, Müller, and Sinha (1998). The rea- son of the success of the MDP, as pointed out by Green and Richardson (2001), is that it exploits the discreteness of the Dirichlet process rather than combating it, thus providing a flex- ible model for clustering of items in a hierarchical setting. This article proposes an alternative prior, the normalized inverse-Gaussian (N–IG) prior, to be used in the context of mixture modeling. Our proposed approach consists of the hi- erarchical model (or a suitable semiparametric variation of it) given by (Y i |X i ) ind L (Y i |X i ), i = 1,..., n, (X i | ˜ P) iid ˜ P, (1) ˜ P N–IG. This represents a valid alternative to the corresponding model based on the Dirichlet process, because the N–IG prior, al- though sharing the discreteness property, turns out to have some advantages, whereas it preserves almost the same tractability as the Dirichlet process, it is characterized by a more elabo- rate clustering property that makes use of all of the information contained in the data. It is well known that the mass assigned to each observation by the Dirichlet process depends solely on the number of times that it occurs; in contrast, for the N–IG process, the weight given to a single observation depends heav- ily on the whole number of ties in the sample. For making inference with nonparametric hierarchical mixture models, the prior distribution of the number of distinct observations within the sample, which takes on the interpretation of distribution of the number of components in the mixture, is of great interest. Such a distribution, obtained by Antoniak (1974) in the Dirich- let case, admits a simple closed-form expression for the N–IG process. Using two examples involving mixtures of normals, we compare the behavior of the N–IG process mixture and the MDP. © 2005 American Statistical Association Journal of the American Statistical Association December 2005, Vol. 100, No. 472, Theory and Methods DOI 10.1198/016214505000000132 1278

Transcript of Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet...

Page 1: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Hierarchical Mixture Modeling With NormalizedInverse-Gaussian Priors

Antonio LIJOI Ramseacutes H MENA and Igor PRUumlNSTER

In recent years the Dirichlet process prior has experienced a great success in the context of Bayesian mixture modeling The idea ofovercoming discreteness of its realizations by exploiting it in hierarchical models combined with the development of suitable samplingtechniques represent one of the reasons of its popularity In this article we propose the normalized inverse-Gaussian (NndashIG) process asan alternative to the Dirichlet process to be used in Bayesian hierarchical models The NndashIG prior is constructed via its finite-dimensionaldistributions This prior although sharing the discreteness property of the Dirichlet prior is characterized by a more elaborate and sensibleclustering which makes use of all the information contained in the data Whereas in the Dirichlet case the mass assigned to each observationdepends solely on the number of times that it occurred for the NndashIG prior the weight of a single observation depends heavily on thewhole number of ties in the sample Moreover expressions corresponding to relevant statistical quantities such as a priori moments andthe predictive distributions are as tractable as those arising from the Dirichlet process This implies that well-established sampling schemescan be easily extended to cover hierarchical models based on the NndashIG process The mixture of NndashIG process and the mixture of Dirichletprocess are compared using two examples involving mixtures of normals

KEY WORDS Bayesian nonparametrics Density estimation Dirichlet process Inverse-Gaussian distribution Mixture models Predic-tive distribution Semiparametric inference

1 INTRODUCTION

Since the appearance of the article by Ferguson (1973) il-lustrating the statistical implications of the Dirichlet processthe Bayesian nonparametric literature has grown rapidly Re-search has focused on both the analysis of various properties ofthe Dirichlet process and the proposal of alternative priors Inparticular Blackwell (1973) showed that the Dirichlet prior se-lects discrete distributions almost surely which implies positiveprobability of ties in a sample drawn from it as highlighted byAntoniak (1974) It is clear that such a property is undesirablewhen one is interested in modeling continuous data

Various extensions of the Dirichlet process have been pro-posed to overcome this drawback For instance MauldinSudderth and Williams (1992) and Lavine (1992) introducedPoacutelya-tree priors Although these can be constructed in sucha way as to put probability one on continuous distributionsinference based on them depends heavily on the tree of par-titions used for their construction Paddock Ruggeri Lavineand West (2003) tried to get rid of this unpleasant featureby making the partition itself random The most fruitful ap-proach to exploiting the Dirichlet process in inferential proce-dures is represented by the mixture of Dirichlet process (MDP)introduced by Lo (1984) and further developed by Ferguson(1983) This prior is concentrated on the space of densitiesthus overcoming the discreteness problem and it has been thefocus of much attention during the 1990s as a consequenceof the introduction of simulation-based inference first devel-oped by Escobar (1988) Relevant contributions include those

Antonio Lijoi is Assistant Professor Department of Economics and Quan-titative Methods University of Pavia Pavia Italy and Research Associate atthe Institute of Applied Mathematics and Computer Science (IMATI) Na-tional Research Council (CNR) Milan Italy (E-mail lijoiunipvit) RamseacutesH Mena is Associate Researcher Research Institute for Applied Mathemat-ics and Systems (IIMAS) National University of Mexico (UNAM) MexicoCity Mexico (E-mail ramsessigmaiimasunammx) Igor Pruumlnster is Assis-tant Professor Department of Economics and Quantitative Methods Univer-sity of Pavia Pavia Italy and Fellow of the International Center of EconomicResearch (ICER) Turin Italy (E-mail igorpruensterunipvit) A Lijoi andI Pruumlnster were supported in part by the Italian Ministry of University and Re-search (MIUR) research project ldquoBayesian Nonparametric Methods and TheirApplicationsrdquo The authors are grateful to two anonymous referees for valu-able comments that have lead to a substantial improvement in the presentationSpecial thanks also to Eugenio Regazzini for some helpful suggestions

by among others Escobar (1994) Escobar and West (1995)MacEachern (1994) MacEachern and Muumlller (1998) andGreen and Richardson (2001) The model has been compre-hensively reviewed by Dey Muumlller and Sinha (1998) The rea-son of the success of the MDP as pointed out by Green andRichardson (2001) is that it exploits the discreteness of theDirichlet process rather than combating it thus providing a flex-ible model for clustering of items in a hierarchical setting

This article proposes an alternative prior the normalizedinverse-Gaussian (NndashIG) prior to be used in the context ofmixture modeling Our proposed approach consists of the hi-erarchical model (or a suitable semiparametric variation of it)given by

(Yi|Xi)indsim L (Yi|Xi) i = 1 n

(Xi|P)iidsim P (1)

P sim NndashIG

This represents a valid alternative to the corresponding modelbased on the Dirichlet process because the NndashIG prior al-though sharing the discreteness property turns out to have someadvantages whereas it preserves almost the same tractabilityas the Dirichlet process it is characterized by a more elabo-rate clustering property that makes use of all of the informationcontained in the data It is well known that the mass assignedto each observation by the Dirichlet process depends solely onthe number of times that it occurs in contrast for the NndashIGprocess the weight given to a single observation depends heav-ily on the whole number of ties in the sample For makinginference with nonparametric hierarchical mixture models theprior distribution of the number of distinct observations withinthe sample which takes on the interpretation of distribution ofthe number of components in the mixture is of great interestSuch a distribution obtained by Antoniak (1974) in the Dirich-let case admits a simple closed-form expression for the NndashIGprocess Using two examples involving mixtures of normalswe compare the behavior of the NndashIG process mixture and theMDP

copy 2005 American Statistical AssociationJournal of the American Statistical Association

December 2005 Vol 100 No 472 Theory and MethodsDOI 101198016214505000000132

1278

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1279

Like the Dirichlet process the NndashIG process is a specialcase of various classes of random probability measures In-deed it is a particular case of normalized random measure withindependent increments (RMI) This class was introduced byKingman (1975) and subsequently studied by Perman Pitmanand Yor (1992) and Pitman (2003) In a Bayesian context itwas first considered by Regazzini Lijoi and Pruumlnster (2003)who generalized it also to nonhomogeneous measures (see alsoPruumlnster 2002) Further extensions and results have been givenby James (2002) The NndashIG process is also when its parame-ter measure is nonatomic a special case of species samplingmodel a family of random probability measures introducedby Pitman (1996) and studied in a Bayesian framework byJames (2002) and Ishwaran and James (2003) The NndashIG andDirichlet processes play special roles within these classes (andindeed within all random probability measures) to our knowl-edge they are the only priors for which completely explicitexpressions of their finite-dimensional distributions are avail-able Moreover the NndashIG process leads to explicit and tractableexpressions for quantities of statistical relevance What dis-tinguishes the Dirichlet process from normalized RMIs andspecies-sampling models (and thus also from the NndashIG process)is its conjugacy as shown by Pruumlnster (2002) However thisis no longer a problem given the availability of suitable sam-pling schemes It is worth noting that a posterior characteri-zation of the NndashIG process in terms of a latent variable canbe deduced immediately from work of James (2002) Based onthe families of discrete random probability measures mentionedearlier recently general classes of mixture models have beenproposed Ishwaran and James (2001) proposed replacing theDirichlet process by stick-breaking priors and study mixturesbased on the two-parameter PoissonndashDirichlet process due toPitman (1995) in great detail Ishwaran and James (2003) ex-tended the analysis to cover species-sampling mixture mod-els Another proposal for generalizing MDP is represented bythe normalized random measures driven by increasing additiveprocesses due to Nieto-Barajas Pruumlnster and Walker (2004)(see also Pruumlnster 2002)

The outline of the article is as follows We start with the con-struction of an NndashIG process We do this by first deriving theNndashIG distribution on (01) and its multivariate analog on the(n minus 1)-dimensional simplex in Section 2 where we computesome of its moments as well In Section 3 we show that there ex-ists a random probability measure with these finite-dimensionaldistributions which we term the NndashIG process Such a con-struction mimics the procedure adopted by Ferguson (1973) indefining the Dirichlet process The role played by the gammadistribution in the Dirichlet case is attributed to the inverse-Gaussian distribution in our case Assuming exchangeability ofobservations we obtain the predictive distributions correspond-ing to an NndashIG process We further develop the analysis of thisprior by illustrating the mechanism through which the weightsare allocated to the observations This represents a key featurefor its potential success in applications Moreover we obtaina closed-form expression for the prior probability distributionp(middot|n) of the number of distinct observations in a sample ofsize n In Section 4 we consider the mixture of NndashIG processand show that well-known simulation techniques such as thoseset forth by Escobar and West (1995) and Ishwaran and James

(2001) can be extended in a straightforward way to deal witha mixture of NndashIG process For illustrative purposes we con-sider simulated datasets where observations are drawn fromuniform mixtures of normals and compare the performance ofthe mixture of NndashIG process and the MDP in terms of bothposterior probabilities on the correct number of componentsand log-Bayes factors Finally the extensively studied ldquogalaxyrdquodataset is analyzed Another fruitful use of discrete nonpara-metric priors in a continuous setting is represented by the pos-sibility of considering their filtered-variate version as proposedby Dickey and Jiang (1998) This naturally leads to consideringprior and posterior distributions of means of random probabilitymeasures In Section 5 we provide a description of the posteriordistribution of means of NndashIG processes

2 THE NORMALIZEDINVERSEndashGAUSSIAN DISTRIBUTION

The beta distribution and its multidimensional analog theDirichlet distribution are commonly used as priors for thebinomial and multinomial models It is well known thatgiven n independent gamma random variables with Vi simGa(αi1) the Dirichlet distribution is defined as the distri-bution of the vector (W1 Wn) where Wi = Vi

sumnj=1 Vj

for i = 1 n If αi gt 0 for every i then the (n minus 1)-dimensional distribution of (W1 Wnminus1) is absolutely con-tinuous with respect to the product measure λnminus1 where λ isthe Lebesgue measure on the real line and its density functionon the simplex nminus1 = (w1 wnminus1) wi ge 0 i = 1

n minus 1sumnminus1

i=1 wi le 1 is given by

f (w1 wnminus1)

= (sumn

i=1 αi)prodn

i=1 (αi)

(nminus1prod

i=1

wαiminus1i

)(

1 minusnminus1sum

i=1

wi

)αnminus1

(2)

Clearly if n = 2 then (2) reduces to the beta density with para-meter (α1 α2) (see eg Bilodeau and Brenner 1999)

In this section we derive a distribution on the unit intervaland its multivariate analog on the simplex by substituting thegamma distribution with the inverse-Gaussian distribution inthe foregoing construction A random variable V has inverse-Gaussian distribution with shape parameter α ge 0 and scale pa-rameter γ gt 0 denoted by V sim IG(α γ ) if it has density withrespect to the Lebesgue measure given by

f (v) = αradic2π

vminus32 exp

minus1

2

(α2

v+ γ 2v

)

+ γ α

v ge 0 (3)

for α gt 0 whereas V = 0 (almost surely) if α = 0 Seshadri(1993) has provided an exhaustive account of the inverse-Gaussian distribution In what follows we assume without lossof generality that γ = 1

Let V1 Vn be independent random variables with Vi simIG(αi1) where αi ge 0 for all i and αi gt 0 for some ii = 1 n We define the NndashIG distribution with parameter(α1 αn) denoted by NndashIG(α1 αn) as the distributionof the random vector (W1 Wn) where Wi = Vi

sumnj=1 Vj

for i = 1 n Clearly if αj = 0 for some j then Wj = 0 al-most surely

1280 Journal of the American Statistical Association December 2005

The following proposition provides the density functionover nminus1 of an NndashIG-distributed random vector

Proposition 1 Suppose that the random vector (W1

Wm) is NndashIG(α1 αm) distributed If αi gt 0 for every i =1 m then the distribution of (W1 Wmminus1) is absolutelycontinuous with respect to λmminus1 and its density function onmminus1 coincides with

f (w1 wmminus1)

= esumm

i=1 αiprodm

i=1 αi

2m2minus1πm2

times Kminusm2(radic

Am(w1 wmminus1))

times(

w132 middot middot middotwmminus1

32(

1 minusmminus1sum

j=1

wj

)32

times [Am(w1 wmminus1)

]m4)minus1

(4)

where Am(w1 wmminus1) = summminus1i=1

α2i

wi+ α2

m(1 minus summminus1j=1 wj)

and K denotes the modified Bessel function of the third typeIf m = 2 then (4) reduces to a univariate density on (01) ofthe form

f (w) = eα1+α2α1α2

π

Kminus1(radicA2(w) )

w32(1 minus w)32(A2(w))12 (5)

One of the most important features of the Dirichlet dis-tribution is the additive property inherited from the gammadistribution through which the Dirichlet distribution is de-fined The same happens for the NndashIG distribution in thesense that if (W1 Wn) is NndashIG(α1 αn) distributedand m1 lt m2 lt middot middot middot lt mk = n are integers then the vec-tor (

summ1i=1 Wi

summ2i=m1+1 Wi

sumni=mkminus1+1 Wi) has distribu-

tion NndashIG(summ1

i=1 αisumm2

i=m1+1 αi sumn

i=mkminus1+1 αi) This canbe easily verified on the basis of the additive property of theinverse-Gaussian distribution

A common desire is to associate to a probability distributionsome descriptive indexes that summarize its characteristics Thefollowing proposition provides some of these for the NndashIG dis-tribution Set a = sumn

j=1 αj and pi = αia for every i = 1 n

Proposition 2 Suppose that the random vector (W1 Wn)

is NndashIG(α1 αn) distributed then

E[Wi] = pi

var[Wi] = pi(1 minus pi)a2ea(minus2a)

and

cov(WiWj) = minuspipja2ea(minus2a) for i = j

where (middot middot) denotes the incomplete gamma function

It is worth noting that the moments corresponding to an NndashIGdistribution are quite similar to those of the Dirichlet distribu-tion Indeed the structure is the same and they differ just bya multiplicative constant Figures 1 and 2 depict various fea-tures of Dirichlet and NndashIG distributions For comparative pur-poses the parameters have been chosen in such a way thattheir means and variances coincide Recall that if a random

vector is Dirichlet distributed then E[Wi] = αia = pi andvar[Wi] = pi(1 minus pi)(1 + a) for any i = 1 n having seta = sumn

i=1 αi Thus the means coincide if the pi = pi for any iwhereas given a the variance match is achieved by solving for athe equation a2ea(minus2a) = (a + 1)minus1 The plots seems tosuggest a slightly lower concentration of the mass of the NndashIGdistribution around the barycenter of the simplex Such a fea-ture can be interpreted as the NndashIG being less informative thanthe Dirichlet prior This will become apparent in the sequel

3 THE NORMALIZEDINVERSEndashGAUSSIAN PROCESS

31 Definition and Existence of the NormalizedInverse-Gaussian Process

In this section we define and show existence of the NndashIGprocess Toward this end we suppose that the observations takevalues on some complete and separable metric space X en-dowed with its Borel σ -field X The NndashIG process P is arandom probability measure on (XX ) that can be defined viaits family of finite-dimensional distributions as was done byFerguson (1973) for the Dirichlet process

Let P = QA1An A1 An isin X n ge 1 be a familyof probability distributions and let α be a finite measure on(XX ) with α(X) = a gt 0 If A1 An is a measurable par-tition of X then set

QA1An(C) =int

Ccapnminus1

f (v1 vnminus1)dv1 middot middot middot dvnminus1 (6)

forallC isin Rn where f is defined as in (4) with αi = α(Ai) gt 0 forall i If A1 An is not a partition of X then consider thepartition B1 Bm that it generates and set

QA1An(C) = QB1Bm

(

(x1 xm) isin [01]m

(sum

(1)

xi sum

(n)

xi

)

isin C

)

wheresum

( j) means that the sum extends over all indices i isin1 m for which Bi sub Aj We can show that P satisfies thefollowing conditions

(C1) For any n ge 1 and any finite permutation π of(1 n)

QA1An(C) = QAπ(1)Aπ(n)(πC) forallC isin B(Rn)

where πC = (xπ(1) xπ(n)) (x1 xn) isin C(C2) QX = δ1 where δx represents the point mass at x(C3) For any family of sets A1 An in X let D1

Dh be a measurable partition of X such that it is finerthan the partition generated by A1 An Then forany C isin B(Rn)

QA1An(C) = QD1Dh(Cprime)

where

Cprime =

(x1 xh) isin [01]h

(sum

(1)

xi sum

(n)

xi

)

isin C

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1281

Figure 1 NndashIG and Dirichlet Densities (a) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densities on (0 1) with p1 = p1 = 12 a = 5 and a = 1737 (b) Dirichletdensity on ∆2 with p1 = p2 = 13 and a = 1737 (c) NndashIG density on ∆2 with p1 = p2 = 13 and a = 5 (d) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densitieson (0 1) with p1 = p1 = 12 a = 18 and a = 3270 (e) Dirichlet density on ∆2 with p1 = p2 = 13 and a = 3270 (f) NndashIG density on ∆2 withp1 = p2 = 13 and a = 18

(C4) For any sequence (An)nge1 of measurable subsets of X

such that An darr empty

QAn rArr δ0

where the symbol rArr denotes as usual weak conver-gence of a sequence of probability measures

Hence according to proposition 392 of Regazzini (2001)there exists a unique random probability measure admitting Pas its family of finite-dimensional distributions We term such arandom probability measure P an NndashIG process

32 Relationship to Other Classes of DiscreteRandom Probability Measures

In this section we discuss the relationship of the NndashIGprocess to other classes of discrete random probability mea-

sures First recall that Ferguson (1973) also proposed an al-ternative construction of the Dirichlet process as a normal-ized gamma process The same can be done in this caseby replacing the gamma process with an inverse-Gaussianprocess that is an increasing Leacutevy process ξ = ξt t ge 0which is uniquely characterized by its Leacutevy measure ν(dv) =(2πv3)minus12eminusv2 dv If α is a finite and nonnull measure on Rthen the time change t = A(x) = α((minusinfin x]) yields a reparame-terized process ξA which still has independent but generallynot stationary increments Because ξA is finite and positive al-most surely we are in a position to consider

F(x) = ξA(x)

ξafor every x isin R (7)

1282 Journal of the American Statistical Association December 2005

Figure 2 Contour Plots for the Dirichlet (a)ndash(c) and NndashIG (d)ndash(f) Densities on ∆2 With p1 = p2 = p1 = p2 = 13 for Different Choices ofa and a

as a random probability distribution function on R wherea = α(R) The corresponding random probability P is an NndashIGprocess on R As shown by Regazzini et al (2003) such a con-struction holds for any increasing additive process providedthat the total mass of the Leacutevy measure is unbounded givingrise to the class of normalized RMI James (2002) extendednormalized RMIs to Polish spaces Thus the NndashIG processrepresents clearly a particular case of normalized RMI Weremark that the idea of defining a general class of discrete ran-dom probability measures via normalization is due to Kingman(1975) and previous stimulating work was done by Permanet al (1992) and Pitman (2003)

The NndashIG process provided that α is nonatomic turns out toalso be a species-sampling model This class of random proba-bility measures due to Pitman (1996) is defined as

P(middot) =sum

ige1

PiδXi(middot) +(

1 minussum

ige1

Pi

)

H(middot) (8)

where 0 lt Pi lt 1 are random weights such thatsum

igt1 Pi le 1independent of the locations Xi which are iid with somenonatomic distribution H A species-sampling model is com-pletely characterized by H which is the prior guess at theshape of P and the so-called ldquoexchangeable partition proba-bility functionrdquo (EPPF) (see Pitman 1996) Although the de-finition of species-sampling model provides an intuitive andquite general framework it leaves a difficult open problemnamely the concrete assignment of the random weights Pi Itis clear that such an issue is crucial for applicability of thesemodels Up to now all tractable species sampling models havebeen defined by exploiting the so-called ldquostick-breakingrdquo proce-dure already adopted by Halmos (1944) and Freedman (1963)(See Ishwaran and James 2001 and references therein for exam-ples of what they call ldquostick-breaking priorsrdquo) More recentlyPitman (2003) considered PoissonndashKingman models a sub-class of normalized RMI which are defined via normaliza-

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1283

tion of homogeneous random measures with independent incre-ments under the additional assumption of α being nonatomicHe showed that PoissonndashKingman models are a subclass ofspecies-sampling models and characterized them by deriving anexpression of the EPPF From these considerations it followsthat an NndashIG process provided that α is nonatomic is also aspecies-sampling model Note that in the NndashIG case a closed-form expression of the EPPF can be derived based on work ofPitman (2003) or James (2002) combined with the argumentsused in the proof of Proposition 3 one obtains (A1) in the Ap-pendix In this article we do not pursue the approach based onPoisson process partition calculus and refer the reader to the ar-ticle by James (2002) for extensive calculus relative to EPPFsWe just remark that the NndashIG process represents the first exam-ple (at least to our knowledge) of a tractable species-samplingmodel which cannot be derived via a stick-breaking procedure

Before proceeding it seems worthwhile to point out thatthe peculiarities of the NndashIG and Dirichlet processes comparedwith other members of these classes (and indeed within all ran-dom probability measures) is represented by the fact that theirfinite-dimensional distributions are known explicitly

33 Some Properties of the NormalizedInverse-Gaussian Process

In this section we study some properties of the NndashIG processIn particular we discuss discreteness derive some of its mo-ments and the predictive distribution and consider the distribu-tion of the number of distinct observations with a sample drawnfrom an NndashIG process

A first interesting issue is the almost-sure discreteness of anNndashIG process On the basis of the construction provided in Sec-tion 31 it follows that the distribution of the NndashIG processadmits a version that selects discrete probability measures al-most surely By a result of James (2003) that makes use of theconstruction in (7) we have that all versions of P are discreteHence we can affirm that the NndashIG process is discrete almostsurely

The moments of an NndashIG process with parameter α followimmediately from Proposition 2 Let BB1B2 isin X and setC = B1 cap B2 and P0(middot) = α(middot)a Then

E[P(B)] = P0(B)

var[P(B)] = P0(B)(1 minus P0(B))a2ea (minus2a)

and

cov(P(B1) P(B2)

) = [P0(C) minus P0(B1)P0(B2)]a2ea(minus2a)

Note that if P0 is diffuse then the mean of P follows alsofrom the fact that the NndashIG process is a species-samplingmodel These quantities are usually exploited for incorporatingreal qualitative prior knowledge into the model For instanceWalker and Damien (1998) suggested controlling the priorguess P0 as well as also the variance of the random probabil-ity measure at issue Walker Damien Laud and Smith (1999)provided a detailed discussion on the prior specification in non-parametric problems

An important goal in inferential procedures is predicting fu-ture values of a random quantity based on its past outcomesSuppose that a sequence (Xn)nge1 of exchangeable observa-tions is defined in such a way that given P the Xirsquos are iid

with distribution P Moreover let Xlowast1 Xlowast

k denote the k dis-tinct observations within the sample X(n) = (X1 Xn) withnj gt 0 terms being equal to Xlowast

j for j = 1 k and such thatsum

j nj = n The next proposition provides the predictive distrib-utions corresponding to an NndashIG process

Proposition 3 If P is an NndashIG process with diffuse parame-ter α then the predictive distributions are of the form

Pr(Xn+1 isin B

∣∣X(n)

) = w(n)0 P0(B) + w(n)

1

ksum

j=1

(nj minus 12)δXlowastj(B)

for every B isin X with

w(n)0 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 1 + 2r minus 2na)

2nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

and

w(n)1 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 2r minus 2na)

nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

Note that the proof of Proposition 3 exploits a general resultof Pitman (2003) later derived independently by means of adifferent technique by Pruumlnster (2002) (see also James 2002)

Thus the predictive distribution is a linear combination ofthe prior guess P0 and a weighted empirical distribution withexplicit expression for the weights Moreover a generalizedPoacutelya urn scheme follows immediately which is given in (11)To highlight its distinctive features it is useful to compare itwith the prediction rule of the Dirichlet process In the lat-ter case one has that Xn+1 is new with probability a(a + n)

and that it coincides with Xlowastj with probability nj(a + n) for

j = 1 k Thus the probability allocated to previous obser-vations is n(a + n) and does not depend on the number k ofdistinct observations within the sample Moreover the weightassigned to each Xlowast

j depends only on the number of observa-tions equal to Xlowast

j a characterizing property of the Dirichletprocess that at the same time represents one if its drawbacks(see Ferguson 1974) In contrast for the NndashIG process themechanism is quite interesting and exploits all available in-formation Given X(n) the (n + 1)st observation is new withprobability w(n)

0 and coincides with one of the previous withprobability (n minus k2)w(n)

1 In this case the balancing betweennew and old observations takes the number of distinct val-ues k into account it is easy to verify numerically that w(n)

0 isdecreasing as k decreases and thus one removes more massfrom the prior if fewer distinct observations (ie more ties) arerecorded Moreover allocation of the probability to each Xlowast

j ismore elaborate than for the Dirichlet case instead of increasingthe weight proportionally to the number of ties the probabilityassigned to Xlowast

j is reinforced more than proportionally each timea new tie has been recorded This can be explained by the ar-gument that the more often Xlowast

j is observed the stronger is thestatistical evidence in its favor and thus it is sensible to reallo-cate mass toward it

A small numerical example may clarify this point Whencomparing the Dirichlet and NndashIG processes it is sensible tofix their parameter such that they have the same mean and vari-ance as was done in Section 2 Recall that this is tantamount

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 2: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1279

Like the Dirichlet process the NndashIG process is a specialcase of various classes of random probability measures In-deed it is a particular case of normalized random measure withindependent increments (RMI) This class was introduced byKingman (1975) and subsequently studied by Perman Pitmanand Yor (1992) and Pitman (2003) In a Bayesian context itwas first considered by Regazzini Lijoi and Pruumlnster (2003)who generalized it also to nonhomogeneous measures (see alsoPruumlnster 2002) Further extensions and results have been givenby James (2002) The NndashIG process is also when its parame-ter measure is nonatomic a special case of species samplingmodel a family of random probability measures introducedby Pitman (1996) and studied in a Bayesian framework byJames (2002) and Ishwaran and James (2003) The NndashIG andDirichlet processes play special roles within these classes (andindeed within all random probability measures) to our knowl-edge they are the only priors for which completely explicitexpressions of their finite-dimensional distributions are avail-able Moreover the NndashIG process leads to explicit and tractableexpressions for quantities of statistical relevance What dis-tinguishes the Dirichlet process from normalized RMIs andspecies-sampling models (and thus also from the NndashIG process)is its conjugacy as shown by Pruumlnster (2002) However thisis no longer a problem given the availability of suitable sam-pling schemes It is worth noting that a posterior characteri-zation of the NndashIG process in terms of a latent variable canbe deduced immediately from work of James (2002) Based onthe families of discrete random probability measures mentionedearlier recently general classes of mixture models have beenproposed Ishwaran and James (2001) proposed replacing theDirichlet process by stick-breaking priors and study mixturesbased on the two-parameter PoissonndashDirichlet process due toPitman (1995) in great detail Ishwaran and James (2003) ex-tended the analysis to cover species-sampling mixture mod-els Another proposal for generalizing MDP is represented bythe normalized random measures driven by increasing additiveprocesses due to Nieto-Barajas Pruumlnster and Walker (2004)(see also Pruumlnster 2002)

The outline of the article is as follows We start with the con-struction of an NndashIG process We do this by first deriving theNndashIG distribution on (01) and its multivariate analog on the(n minus 1)-dimensional simplex in Section 2 where we computesome of its moments as well In Section 3 we show that there ex-ists a random probability measure with these finite-dimensionaldistributions which we term the NndashIG process Such a con-struction mimics the procedure adopted by Ferguson (1973) indefining the Dirichlet process The role played by the gammadistribution in the Dirichlet case is attributed to the inverse-Gaussian distribution in our case Assuming exchangeability ofobservations we obtain the predictive distributions correspond-ing to an NndashIG process We further develop the analysis of thisprior by illustrating the mechanism through which the weightsare allocated to the observations This represents a key featurefor its potential success in applications Moreover we obtaina closed-form expression for the prior probability distributionp(middot|n) of the number of distinct observations in a sample ofsize n In Section 4 we consider the mixture of NndashIG processand show that well-known simulation techniques such as thoseset forth by Escobar and West (1995) and Ishwaran and James

(2001) can be extended in a straightforward way to deal witha mixture of NndashIG process For illustrative purposes we con-sider simulated datasets where observations are drawn fromuniform mixtures of normals and compare the performance ofthe mixture of NndashIG process and the MDP in terms of bothposterior probabilities on the correct number of componentsand log-Bayes factors Finally the extensively studied ldquogalaxyrdquodataset is analyzed Another fruitful use of discrete nonpara-metric priors in a continuous setting is represented by the pos-sibility of considering their filtered-variate version as proposedby Dickey and Jiang (1998) This naturally leads to consideringprior and posterior distributions of means of random probabilitymeasures In Section 5 we provide a description of the posteriordistribution of means of NndashIG processes

2 THE NORMALIZEDINVERSEndashGAUSSIAN DISTRIBUTION

The beta distribution and its multidimensional analog theDirichlet distribution are commonly used as priors for thebinomial and multinomial models It is well known thatgiven n independent gamma random variables with Vi simGa(αi1) the Dirichlet distribution is defined as the distri-bution of the vector (W1 Wn) where Wi = Vi

sumnj=1 Vj

for i = 1 n If αi gt 0 for every i then the (n minus 1)-dimensional distribution of (W1 Wnminus1) is absolutely con-tinuous with respect to the product measure λnminus1 where λ isthe Lebesgue measure on the real line and its density functionon the simplex nminus1 = (w1 wnminus1) wi ge 0 i = 1

n minus 1sumnminus1

i=1 wi le 1 is given by

f (w1 wnminus1)

= (sumn

i=1 αi)prodn

i=1 (αi)

(nminus1prod

i=1

wαiminus1i

)(

1 minusnminus1sum

i=1

wi

)αnminus1

(2)

Clearly if n = 2 then (2) reduces to the beta density with para-meter (α1 α2) (see eg Bilodeau and Brenner 1999)

In this section we derive a distribution on the unit intervaland its multivariate analog on the simplex by substituting thegamma distribution with the inverse-Gaussian distribution inthe foregoing construction A random variable V has inverse-Gaussian distribution with shape parameter α ge 0 and scale pa-rameter γ gt 0 denoted by V sim IG(α γ ) if it has density withrespect to the Lebesgue measure given by

f (v) = αradic2π

vminus32 exp

minus1

2

(α2

v+ γ 2v

)

+ γ α

v ge 0 (3)

for α gt 0 whereas V = 0 (almost surely) if α = 0 Seshadri(1993) has provided an exhaustive account of the inverse-Gaussian distribution In what follows we assume without lossof generality that γ = 1

Let V1 Vn be independent random variables with Vi simIG(αi1) where αi ge 0 for all i and αi gt 0 for some ii = 1 n We define the NndashIG distribution with parameter(α1 αn) denoted by NndashIG(α1 αn) as the distributionof the random vector (W1 Wn) where Wi = Vi

sumnj=1 Vj

for i = 1 n Clearly if αj = 0 for some j then Wj = 0 al-most surely

1280 Journal of the American Statistical Association December 2005

The following proposition provides the density functionover nminus1 of an NndashIG-distributed random vector

Proposition 1 Suppose that the random vector (W1

Wm) is NndashIG(α1 αm) distributed If αi gt 0 for every i =1 m then the distribution of (W1 Wmminus1) is absolutelycontinuous with respect to λmminus1 and its density function onmminus1 coincides with

f (w1 wmminus1)

= esumm

i=1 αiprodm

i=1 αi

2m2minus1πm2

times Kminusm2(radic

Am(w1 wmminus1))

times(

w132 middot middot middotwmminus1

32(

1 minusmminus1sum

j=1

wj

)32

times [Am(w1 wmminus1)

]m4)minus1

(4)

where Am(w1 wmminus1) = summminus1i=1

α2i

wi+ α2

m(1 minus summminus1j=1 wj)

and K denotes the modified Bessel function of the third typeIf m = 2 then (4) reduces to a univariate density on (01) ofthe form

f (w) = eα1+α2α1α2

π

Kminus1(radicA2(w) )

w32(1 minus w)32(A2(w))12 (5)

One of the most important features of the Dirichlet dis-tribution is the additive property inherited from the gammadistribution through which the Dirichlet distribution is de-fined The same happens for the NndashIG distribution in thesense that if (W1 Wn) is NndashIG(α1 αn) distributedand m1 lt m2 lt middot middot middot lt mk = n are integers then the vec-tor (

summ1i=1 Wi

summ2i=m1+1 Wi

sumni=mkminus1+1 Wi) has distribu-

tion NndashIG(summ1

i=1 αisumm2

i=m1+1 αi sumn

i=mkminus1+1 αi) This canbe easily verified on the basis of the additive property of theinverse-Gaussian distribution

A common desire is to associate to a probability distributionsome descriptive indexes that summarize its characteristics Thefollowing proposition provides some of these for the NndashIG dis-tribution Set a = sumn

j=1 αj and pi = αia for every i = 1 n

Proposition 2 Suppose that the random vector (W1 Wn)

is NndashIG(α1 αn) distributed then

E[Wi] = pi

var[Wi] = pi(1 minus pi)a2ea(minus2a)

and

cov(WiWj) = minuspipja2ea(minus2a) for i = j

where (middot middot) denotes the incomplete gamma function

It is worth noting that the moments corresponding to an NndashIGdistribution are quite similar to those of the Dirichlet distribu-tion Indeed the structure is the same and they differ just bya multiplicative constant Figures 1 and 2 depict various fea-tures of Dirichlet and NndashIG distributions For comparative pur-poses the parameters have been chosen in such a way thattheir means and variances coincide Recall that if a random

vector is Dirichlet distributed then E[Wi] = αia = pi andvar[Wi] = pi(1 minus pi)(1 + a) for any i = 1 n having seta = sumn

i=1 αi Thus the means coincide if the pi = pi for any iwhereas given a the variance match is achieved by solving for athe equation a2ea(minus2a) = (a + 1)minus1 The plots seems tosuggest a slightly lower concentration of the mass of the NndashIGdistribution around the barycenter of the simplex Such a fea-ture can be interpreted as the NndashIG being less informative thanthe Dirichlet prior This will become apparent in the sequel

3 THE NORMALIZEDINVERSEndashGAUSSIAN PROCESS

31 Definition and Existence of the NormalizedInverse-Gaussian Process

In this section we define and show existence of the NndashIGprocess Toward this end we suppose that the observations takevalues on some complete and separable metric space X en-dowed with its Borel σ -field X The NndashIG process P is arandom probability measure on (XX ) that can be defined viaits family of finite-dimensional distributions as was done byFerguson (1973) for the Dirichlet process

Let P = QA1An A1 An isin X n ge 1 be a familyof probability distributions and let α be a finite measure on(XX ) with α(X) = a gt 0 If A1 An is a measurable par-tition of X then set

QA1An(C) =int

Ccapnminus1

f (v1 vnminus1)dv1 middot middot middot dvnminus1 (6)

forallC isin Rn where f is defined as in (4) with αi = α(Ai) gt 0 forall i If A1 An is not a partition of X then consider thepartition B1 Bm that it generates and set

QA1An(C) = QB1Bm

(

(x1 xm) isin [01]m

(sum

(1)

xi sum

(n)

xi

)

isin C

)

wheresum

( j) means that the sum extends over all indices i isin1 m for which Bi sub Aj We can show that P satisfies thefollowing conditions

(C1) For any n ge 1 and any finite permutation π of(1 n)

QA1An(C) = QAπ(1)Aπ(n)(πC) forallC isin B(Rn)

where πC = (xπ(1) xπ(n)) (x1 xn) isin C(C2) QX = δ1 where δx represents the point mass at x(C3) For any family of sets A1 An in X let D1

Dh be a measurable partition of X such that it is finerthan the partition generated by A1 An Then forany C isin B(Rn)

QA1An(C) = QD1Dh(Cprime)

where

Cprime =

(x1 xh) isin [01]h

(sum

(1)

xi sum

(n)

xi

)

isin C

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1281

Figure 1 NndashIG and Dirichlet Densities (a) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densities on (0 1) with p1 = p1 = 12 a = 5 and a = 1737 (b) Dirichletdensity on ∆2 with p1 = p2 = 13 and a = 1737 (c) NndashIG density on ∆2 with p1 = p2 = 13 and a = 5 (d) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densitieson (0 1) with p1 = p1 = 12 a = 18 and a = 3270 (e) Dirichlet density on ∆2 with p1 = p2 = 13 and a = 3270 (f) NndashIG density on ∆2 withp1 = p2 = 13 and a = 18

(C4) For any sequence (An)nge1 of measurable subsets of X

such that An darr empty

QAn rArr δ0

where the symbol rArr denotes as usual weak conver-gence of a sequence of probability measures

Hence according to proposition 392 of Regazzini (2001)there exists a unique random probability measure admitting Pas its family of finite-dimensional distributions We term such arandom probability measure P an NndashIG process

32 Relationship to Other Classes of DiscreteRandom Probability Measures

In this section we discuss the relationship of the NndashIGprocess to other classes of discrete random probability mea-

sures First recall that Ferguson (1973) also proposed an al-ternative construction of the Dirichlet process as a normal-ized gamma process The same can be done in this caseby replacing the gamma process with an inverse-Gaussianprocess that is an increasing Leacutevy process ξ = ξt t ge 0which is uniquely characterized by its Leacutevy measure ν(dv) =(2πv3)minus12eminusv2 dv If α is a finite and nonnull measure on Rthen the time change t = A(x) = α((minusinfin x]) yields a reparame-terized process ξA which still has independent but generallynot stationary increments Because ξA is finite and positive al-most surely we are in a position to consider

F(x) = ξA(x)

ξafor every x isin R (7)

1282 Journal of the American Statistical Association December 2005

Figure 2 Contour Plots for the Dirichlet (a)ndash(c) and NndashIG (d)ndash(f) Densities on ∆2 With p1 = p2 = p1 = p2 = 13 for Different Choices ofa and a

as a random probability distribution function on R wherea = α(R) The corresponding random probability P is an NndashIGprocess on R As shown by Regazzini et al (2003) such a con-struction holds for any increasing additive process providedthat the total mass of the Leacutevy measure is unbounded givingrise to the class of normalized RMI James (2002) extendednormalized RMIs to Polish spaces Thus the NndashIG processrepresents clearly a particular case of normalized RMI Weremark that the idea of defining a general class of discrete ran-dom probability measures via normalization is due to Kingman(1975) and previous stimulating work was done by Permanet al (1992) and Pitman (2003)

The NndashIG process provided that α is nonatomic turns out toalso be a species-sampling model This class of random proba-bility measures due to Pitman (1996) is defined as

P(middot) =sum

ige1

PiδXi(middot) +(

1 minussum

ige1

Pi

)

H(middot) (8)

where 0 lt Pi lt 1 are random weights such thatsum

igt1 Pi le 1independent of the locations Xi which are iid with somenonatomic distribution H A species-sampling model is com-pletely characterized by H which is the prior guess at theshape of P and the so-called ldquoexchangeable partition proba-bility functionrdquo (EPPF) (see Pitman 1996) Although the de-finition of species-sampling model provides an intuitive andquite general framework it leaves a difficult open problemnamely the concrete assignment of the random weights Pi Itis clear that such an issue is crucial for applicability of thesemodels Up to now all tractable species sampling models havebeen defined by exploiting the so-called ldquostick-breakingrdquo proce-dure already adopted by Halmos (1944) and Freedman (1963)(See Ishwaran and James 2001 and references therein for exam-ples of what they call ldquostick-breaking priorsrdquo) More recentlyPitman (2003) considered PoissonndashKingman models a sub-class of normalized RMI which are defined via normaliza-

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1283

tion of homogeneous random measures with independent incre-ments under the additional assumption of α being nonatomicHe showed that PoissonndashKingman models are a subclass ofspecies-sampling models and characterized them by deriving anexpression of the EPPF From these considerations it followsthat an NndashIG process provided that α is nonatomic is also aspecies-sampling model Note that in the NndashIG case a closed-form expression of the EPPF can be derived based on work ofPitman (2003) or James (2002) combined with the argumentsused in the proof of Proposition 3 one obtains (A1) in the Ap-pendix In this article we do not pursue the approach based onPoisson process partition calculus and refer the reader to the ar-ticle by James (2002) for extensive calculus relative to EPPFsWe just remark that the NndashIG process represents the first exam-ple (at least to our knowledge) of a tractable species-samplingmodel which cannot be derived via a stick-breaking procedure

Before proceeding it seems worthwhile to point out thatthe peculiarities of the NndashIG and Dirichlet processes comparedwith other members of these classes (and indeed within all ran-dom probability measures) is represented by the fact that theirfinite-dimensional distributions are known explicitly

33 Some Properties of the NormalizedInverse-Gaussian Process

In this section we study some properties of the NndashIG processIn particular we discuss discreteness derive some of its mo-ments and the predictive distribution and consider the distribu-tion of the number of distinct observations with a sample drawnfrom an NndashIG process

A first interesting issue is the almost-sure discreteness of anNndashIG process On the basis of the construction provided in Sec-tion 31 it follows that the distribution of the NndashIG processadmits a version that selects discrete probability measures al-most surely By a result of James (2003) that makes use of theconstruction in (7) we have that all versions of P are discreteHence we can affirm that the NndashIG process is discrete almostsurely

The moments of an NndashIG process with parameter α followimmediately from Proposition 2 Let BB1B2 isin X and setC = B1 cap B2 and P0(middot) = α(middot)a Then

E[P(B)] = P0(B)

var[P(B)] = P0(B)(1 minus P0(B))a2ea (minus2a)

and

cov(P(B1) P(B2)

) = [P0(C) minus P0(B1)P0(B2)]a2ea(minus2a)

Note that if P0 is diffuse then the mean of P follows alsofrom the fact that the NndashIG process is a species-samplingmodel These quantities are usually exploited for incorporatingreal qualitative prior knowledge into the model For instanceWalker and Damien (1998) suggested controlling the priorguess P0 as well as also the variance of the random probabil-ity measure at issue Walker Damien Laud and Smith (1999)provided a detailed discussion on the prior specification in non-parametric problems

An important goal in inferential procedures is predicting fu-ture values of a random quantity based on its past outcomesSuppose that a sequence (Xn)nge1 of exchangeable observa-tions is defined in such a way that given P the Xirsquos are iid

with distribution P Moreover let Xlowast1 Xlowast

k denote the k dis-tinct observations within the sample X(n) = (X1 Xn) withnj gt 0 terms being equal to Xlowast

j for j = 1 k and such thatsum

j nj = n The next proposition provides the predictive distrib-utions corresponding to an NndashIG process

Proposition 3 If P is an NndashIG process with diffuse parame-ter α then the predictive distributions are of the form

Pr(Xn+1 isin B

∣∣X(n)

) = w(n)0 P0(B) + w(n)

1

ksum

j=1

(nj minus 12)δXlowastj(B)

for every B isin X with

w(n)0 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 1 + 2r minus 2na)

2nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

and

w(n)1 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 2r minus 2na)

nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

Note that the proof of Proposition 3 exploits a general resultof Pitman (2003) later derived independently by means of adifferent technique by Pruumlnster (2002) (see also James 2002)

Thus the predictive distribution is a linear combination ofthe prior guess P0 and a weighted empirical distribution withexplicit expression for the weights Moreover a generalizedPoacutelya urn scheme follows immediately which is given in (11)To highlight its distinctive features it is useful to compare itwith the prediction rule of the Dirichlet process In the lat-ter case one has that Xn+1 is new with probability a(a + n)

and that it coincides with Xlowastj with probability nj(a + n) for

j = 1 k Thus the probability allocated to previous obser-vations is n(a + n) and does not depend on the number k ofdistinct observations within the sample Moreover the weightassigned to each Xlowast

j depends only on the number of observa-tions equal to Xlowast

j a characterizing property of the Dirichletprocess that at the same time represents one if its drawbacks(see Ferguson 1974) In contrast for the NndashIG process themechanism is quite interesting and exploits all available in-formation Given X(n) the (n + 1)st observation is new withprobability w(n)

0 and coincides with one of the previous withprobability (n minus k2)w(n)

1 In this case the balancing betweennew and old observations takes the number of distinct val-ues k into account it is easy to verify numerically that w(n)

0 isdecreasing as k decreases and thus one removes more massfrom the prior if fewer distinct observations (ie more ties) arerecorded Moreover allocation of the probability to each Xlowast

j ismore elaborate than for the Dirichlet case instead of increasingthe weight proportionally to the number of ties the probabilityassigned to Xlowast

j is reinforced more than proportionally each timea new tie has been recorded This can be explained by the ar-gument that the more often Xlowast

j is observed the stronger is thestatistical evidence in its favor and thus it is sensible to reallo-cate mass toward it

A small numerical example may clarify this point Whencomparing the Dirichlet and NndashIG processes it is sensible tofix their parameter such that they have the same mean and vari-ance as was done in Section 2 Recall that this is tantamount

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 3: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

1280 Journal of the American Statistical Association December 2005

The following proposition provides the density functionover nminus1 of an NndashIG-distributed random vector

Proposition 1 Suppose that the random vector (W1

Wm) is NndashIG(α1 αm) distributed If αi gt 0 for every i =1 m then the distribution of (W1 Wmminus1) is absolutelycontinuous with respect to λmminus1 and its density function onmminus1 coincides with

f (w1 wmminus1)

= esumm

i=1 αiprodm

i=1 αi

2m2minus1πm2

times Kminusm2(radic

Am(w1 wmminus1))

times(

w132 middot middot middotwmminus1

32(

1 minusmminus1sum

j=1

wj

)32

times [Am(w1 wmminus1)

]m4)minus1

(4)

where Am(w1 wmminus1) = summminus1i=1

α2i

wi+ α2

m(1 minus summminus1j=1 wj)

and K denotes the modified Bessel function of the third typeIf m = 2 then (4) reduces to a univariate density on (01) ofthe form

f (w) = eα1+α2α1α2

π

Kminus1(radicA2(w) )

w32(1 minus w)32(A2(w))12 (5)

One of the most important features of the Dirichlet dis-tribution is the additive property inherited from the gammadistribution through which the Dirichlet distribution is de-fined The same happens for the NndashIG distribution in thesense that if (W1 Wn) is NndashIG(α1 αn) distributedand m1 lt m2 lt middot middot middot lt mk = n are integers then the vec-tor (

summ1i=1 Wi

summ2i=m1+1 Wi

sumni=mkminus1+1 Wi) has distribu-

tion NndashIG(summ1

i=1 αisumm2

i=m1+1 αi sumn

i=mkminus1+1 αi) This canbe easily verified on the basis of the additive property of theinverse-Gaussian distribution

A common desire is to associate to a probability distributionsome descriptive indexes that summarize its characteristics Thefollowing proposition provides some of these for the NndashIG dis-tribution Set a = sumn

j=1 αj and pi = αia for every i = 1 n

Proposition 2 Suppose that the random vector (W1 Wn)

is NndashIG(α1 αn) distributed then

E[Wi] = pi

var[Wi] = pi(1 minus pi)a2ea(minus2a)

and

cov(WiWj) = minuspipja2ea(minus2a) for i = j

where (middot middot) denotes the incomplete gamma function

It is worth noting that the moments corresponding to an NndashIGdistribution are quite similar to those of the Dirichlet distribu-tion Indeed the structure is the same and they differ just bya multiplicative constant Figures 1 and 2 depict various fea-tures of Dirichlet and NndashIG distributions For comparative pur-poses the parameters have been chosen in such a way thattheir means and variances coincide Recall that if a random

vector is Dirichlet distributed then E[Wi] = αia = pi andvar[Wi] = pi(1 minus pi)(1 + a) for any i = 1 n having seta = sumn

i=1 αi Thus the means coincide if the pi = pi for any iwhereas given a the variance match is achieved by solving for athe equation a2ea(minus2a) = (a + 1)minus1 The plots seems tosuggest a slightly lower concentration of the mass of the NndashIGdistribution around the barycenter of the simplex Such a fea-ture can be interpreted as the NndashIG being less informative thanthe Dirichlet prior This will become apparent in the sequel

3 THE NORMALIZEDINVERSEndashGAUSSIAN PROCESS

31 Definition and Existence of the NormalizedInverse-Gaussian Process

In this section we define and show existence of the NndashIGprocess Toward this end we suppose that the observations takevalues on some complete and separable metric space X en-dowed with its Borel σ -field X The NndashIG process P is arandom probability measure on (XX ) that can be defined viaits family of finite-dimensional distributions as was done byFerguson (1973) for the Dirichlet process

Let P = QA1An A1 An isin X n ge 1 be a familyof probability distributions and let α be a finite measure on(XX ) with α(X) = a gt 0 If A1 An is a measurable par-tition of X then set

QA1An(C) =int

Ccapnminus1

f (v1 vnminus1)dv1 middot middot middot dvnminus1 (6)

forallC isin Rn where f is defined as in (4) with αi = α(Ai) gt 0 forall i If A1 An is not a partition of X then consider thepartition B1 Bm that it generates and set

QA1An(C) = QB1Bm

(

(x1 xm) isin [01]m

(sum

(1)

xi sum

(n)

xi

)

isin C

)

wheresum

( j) means that the sum extends over all indices i isin1 m for which Bi sub Aj We can show that P satisfies thefollowing conditions

(C1) For any n ge 1 and any finite permutation π of(1 n)

QA1An(C) = QAπ(1)Aπ(n)(πC) forallC isin B(Rn)

where πC = (xπ(1) xπ(n)) (x1 xn) isin C(C2) QX = δ1 where δx represents the point mass at x(C3) For any family of sets A1 An in X let D1

Dh be a measurable partition of X such that it is finerthan the partition generated by A1 An Then forany C isin B(Rn)

QA1An(C) = QD1Dh(Cprime)

where

Cprime =

(x1 xh) isin [01]h

(sum

(1)

xi sum

(n)

xi

)

isin C

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1281

Figure 1 NndashIG and Dirichlet Densities (a) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densities on (0 1) with p1 = p1 = 12 a = 5 and a = 1737 (b) Dirichletdensity on ∆2 with p1 = p2 = 13 and a = 1737 (c) NndashIG density on ∆2 with p1 = p2 = 13 and a = 5 (d) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densitieson (0 1) with p1 = p1 = 12 a = 18 and a = 3270 (e) Dirichlet density on ∆2 with p1 = p2 = 13 and a = 3270 (f) NndashIG density on ∆2 withp1 = p2 = 13 and a = 18

(C4) For any sequence (An)nge1 of measurable subsets of X

such that An darr empty

QAn rArr δ0

where the symbol rArr denotes as usual weak conver-gence of a sequence of probability measures

Hence according to proposition 392 of Regazzini (2001)there exists a unique random probability measure admitting Pas its family of finite-dimensional distributions We term such arandom probability measure P an NndashIG process

32 Relationship to Other Classes of DiscreteRandom Probability Measures

In this section we discuss the relationship of the NndashIGprocess to other classes of discrete random probability mea-

sures First recall that Ferguson (1973) also proposed an al-ternative construction of the Dirichlet process as a normal-ized gamma process The same can be done in this caseby replacing the gamma process with an inverse-Gaussianprocess that is an increasing Leacutevy process ξ = ξt t ge 0which is uniquely characterized by its Leacutevy measure ν(dv) =(2πv3)minus12eminusv2 dv If α is a finite and nonnull measure on Rthen the time change t = A(x) = α((minusinfin x]) yields a reparame-terized process ξA which still has independent but generallynot stationary increments Because ξA is finite and positive al-most surely we are in a position to consider

F(x) = ξA(x)

ξafor every x isin R (7)

1282 Journal of the American Statistical Association December 2005

Figure 2 Contour Plots for the Dirichlet (a)ndash(c) and NndashIG (d)ndash(f) Densities on ∆2 With p1 = p2 = p1 = p2 = 13 for Different Choices ofa and a

as a random probability distribution function on R wherea = α(R) The corresponding random probability P is an NndashIGprocess on R As shown by Regazzini et al (2003) such a con-struction holds for any increasing additive process providedthat the total mass of the Leacutevy measure is unbounded givingrise to the class of normalized RMI James (2002) extendednormalized RMIs to Polish spaces Thus the NndashIG processrepresents clearly a particular case of normalized RMI Weremark that the idea of defining a general class of discrete ran-dom probability measures via normalization is due to Kingman(1975) and previous stimulating work was done by Permanet al (1992) and Pitman (2003)

The NndashIG process provided that α is nonatomic turns out toalso be a species-sampling model This class of random proba-bility measures due to Pitman (1996) is defined as

P(middot) =sum

ige1

PiδXi(middot) +(

1 minussum

ige1

Pi

)

H(middot) (8)

where 0 lt Pi lt 1 are random weights such thatsum

igt1 Pi le 1independent of the locations Xi which are iid with somenonatomic distribution H A species-sampling model is com-pletely characterized by H which is the prior guess at theshape of P and the so-called ldquoexchangeable partition proba-bility functionrdquo (EPPF) (see Pitman 1996) Although the de-finition of species-sampling model provides an intuitive andquite general framework it leaves a difficult open problemnamely the concrete assignment of the random weights Pi Itis clear that such an issue is crucial for applicability of thesemodels Up to now all tractable species sampling models havebeen defined by exploiting the so-called ldquostick-breakingrdquo proce-dure already adopted by Halmos (1944) and Freedman (1963)(See Ishwaran and James 2001 and references therein for exam-ples of what they call ldquostick-breaking priorsrdquo) More recentlyPitman (2003) considered PoissonndashKingman models a sub-class of normalized RMI which are defined via normaliza-

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1283

tion of homogeneous random measures with independent incre-ments under the additional assumption of α being nonatomicHe showed that PoissonndashKingman models are a subclass ofspecies-sampling models and characterized them by deriving anexpression of the EPPF From these considerations it followsthat an NndashIG process provided that α is nonatomic is also aspecies-sampling model Note that in the NndashIG case a closed-form expression of the EPPF can be derived based on work ofPitman (2003) or James (2002) combined with the argumentsused in the proof of Proposition 3 one obtains (A1) in the Ap-pendix In this article we do not pursue the approach based onPoisson process partition calculus and refer the reader to the ar-ticle by James (2002) for extensive calculus relative to EPPFsWe just remark that the NndashIG process represents the first exam-ple (at least to our knowledge) of a tractable species-samplingmodel which cannot be derived via a stick-breaking procedure

Before proceeding it seems worthwhile to point out thatthe peculiarities of the NndashIG and Dirichlet processes comparedwith other members of these classes (and indeed within all ran-dom probability measures) is represented by the fact that theirfinite-dimensional distributions are known explicitly

33 Some Properties of the NormalizedInverse-Gaussian Process

In this section we study some properties of the NndashIG processIn particular we discuss discreteness derive some of its mo-ments and the predictive distribution and consider the distribu-tion of the number of distinct observations with a sample drawnfrom an NndashIG process

A first interesting issue is the almost-sure discreteness of anNndashIG process On the basis of the construction provided in Sec-tion 31 it follows that the distribution of the NndashIG processadmits a version that selects discrete probability measures al-most surely By a result of James (2003) that makes use of theconstruction in (7) we have that all versions of P are discreteHence we can affirm that the NndashIG process is discrete almostsurely

The moments of an NndashIG process with parameter α followimmediately from Proposition 2 Let BB1B2 isin X and setC = B1 cap B2 and P0(middot) = α(middot)a Then

E[P(B)] = P0(B)

var[P(B)] = P0(B)(1 minus P0(B))a2ea (minus2a)

and

cov(P(B1) P(B2)

) = [P0(C) minus P0(B1)P0(B2)]a2ea(minus2a)

Note that if P0 is diffuse then the mean of P follows alsofrom the fact that the NndashIG process is a species-samplingmodel These quantities are usually exploited for incorporatingreal qualitative prior knowledge into the model For instanceWalker and Damien (1998) suggested controlling the priorguess P0 as well as also the variance of the random probabil-ity measure at issue Walker Damien Laud and Smith (1999)provided a detailed discussion on the prior specification in non-parametric problems

An important goal in inferential procedures is predicting fu-ture values of a random quantity based on its past outcomesSuppose that a sequence (Xn)nge1 of exchangeable observa-tions is defined in such a way that given P the Xirsquos are iid

with distribution P Moreover let Xlowast1 Xlowast

k denote the k dis-tinct observations within the sample X(n) = (X1 Xn) withnj gt 0 terms being equal to Xlowast

j for j = 1 k and such thatsum

j nj = n The next proposition provides the predictive distrib-utions corresponding to an NndashIG process

Proposition 3 If P is an NndashIG process with diffuse parame-ter α then the predictive distributions are of the form

Pr(Xn+1 isin B

∣∣X(n)

) = w(n)0 P0(B) + w(n)

1

ksum

j=1

(nj minus 12)δXlowastj(B)

for every B isin X with

w(n)0 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 1 + 2r minus 2na)

2nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

and

w(n)1 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 2r minus 2na)

nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

Note that the proof of Proposition 3 exploits a general resultof Pitman (2003) later derived independently by means of adifferent technique by Pruumlnster (2002) (see also James 2002)

Thus the predictive distribution is a linear combination ofthe prior guess P0 and a weighted empirical distribution withexplicit expression for the weights Moreover a generalizedPoacutelya urn scheme follows immediately which is given in (11)To highlight its distinctive features it is useful to compare itwith the prediction rule of the Dirichlet process In the lat-ter case one has that Xn+1 is new with probability a(a + n)

and that it coincides with Xlowastj with probability nj(a + n) for

j = 1 k Thus the probability allocated to previous obser-vations is n(a + n) and does not depend on the number k ofdistinct observations within the sample Moreover the weightassigned to each Xlowast

j depends only on the number of observa-tions equal to Xlowast

j a characterizing property of the Dirichletprocess that at the same time represents one if its drawbacks(see Ferguson 1974) In contrast for the NndashIG process themechanism is quite interesting and exploits all available in-formation Given X(n) the (n + 1)st observation is new withprobability w(n)

0 and coincides with one of the previous withprobability (n minus k2)w(n)

1 In this case the balancing betweennew and old observations takes the number of distinct val-ues k into account it is easy to verify numerically that w(n)

0 isdecreasing as k decreases and thus one removes more massfrom the prior if fewer distinct observations (ie more ties) arerecorded Moreover allocation of the probability to each Xlowast

j ismore elaborate than for the Dirichlet case instead of increasingthe weight proportionally to the number of ties the probabilityassigned to Xlowast

j is reinforced more than proportionally each timea new tie has been recorded This can be explained by the ar-gument that the more often Xlowast

j is observed the stronger is thestatistical evidence in its favor and thus it is sensible to reallo-cate mass toward it

A small numerical example may clarify this point Whencomparing the Dirichlet and NndashIG processes it is sensible tofix their parameter such that they have the same mean and vari-ance as was done in Section 2 Recall that this is tantamount

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 4: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1281

Figure 1 NndashIG and Dirichlet Densities (a) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densities on (0 1) with p1 = p1 = 12 a = 5 and a = 1737 (b) Dirichletdensity on ∆2 with p1 = p2 = 13 and a = 1737 (c) NndashIG density on ∆2 with p1 = p2 = 13 and a = 5 (d) NndashIG (middotmiddotmiddotmiddotmiddot) and Dirichlet (mdashndash) densitieson (0 1) with p1 = p1 = 12 a = 18 and a = 3270 (e) Dirichlet density on ∆2 with p1 = p2 = 13 and a = 3270 (f) NndashIG density on ∆2 withp1 = p2 = 13 and a = 18

(C4) For any sequence (An)nge1 of measurable subsets of X

such that An darr empty

QAn rArr δ0

where the symbol rArr denotes as usual weak conver-gence of a sequence of probability measures

Hence according to proposition 392 of Regazzini (2001)there exists a unique random probability measure admitting Pas its family of finite-dimensional distributions We term such arandom probability measure P an NndashIG process

32 Relationship to Other Classes of DiscreteRandom Probability Measures

In this section we discuss the relationship of the NndashIGprocess to other classes of discrete random probability mea-

sures First recall that Ferguson (1973) also proposed an al-ternative construction of the Dirichlet process as a normal-ized gamma process The same can be done in this caseby replacing the gamma process with an inverse-Gaussianprocess that is an increasing Leacutevy process ξ = ξt t ge 0which is uniquely characterized by its Leacutevy measure ν(dv) =(2πv3)minus12eminusv2 dv If α is a finite and nonnull measure on Rthen the time change t = A(x) = α((minusinfin x]) yields a reparame-terized process ξA which still has independent but generallynot stationary increments Because ξA is finite and positive al-most surely we are in a position to consider

F(x) = ξA(x)

ξafor every x isin R (7)

1282 Journal of the American Statistical Association December 2005

Figure 2 Contour Plots for the Dirichlet (a)ndash(c) and NndashIG (d)ndash(f) Densities on ∆2 With p1 = p2 = p1 = p2 = 13 for Different Choices ofa and a

as a random probability distribution function on R wherea = α(R) The corresponding random probability P is an NndashIGprocess on R As shown by Regazzini et al (2003) such a con-struction holds for any increasing additive process providedthat the total mass of the Leacutevy measure is unbounded givingrise to the class of normalized RMI James (2002) extendednormalized RMIs to Polish spaces Thus the NndashIG processrepresents clearly a particular case of normalized RMI Weremark that the idea of defining a general class of discrete ran-dom probability measures via normalization is due to Kingman(1975) and previous stimulating work was done by Permanet al (1992) and Pitman (2003)

The NndashIG process provided that α is nonatomic turns out toalso be a species-sampling model This class of random proba-bility measures due to Pitman (1996) is defined as

P(middot) =sum

ige1

PiδXi(middot) +(

1 minussum

ige1

Pi

)

H(middot) (8)

where 0 lt Pi lt 1 are random weights such thatsum

igt1 Pi le 1independent of the locations Xi which are iid with somenonatomic distribution H A species-sampling model is com-pletely characterized by H which is the prior guess at theshape of P and the so-called ldquoexchangeable partition proba-bility functionrdquo (EPPF) (see Pitman 1996) Although the de-finition of species-sampling model provides an intuitive andquite general framework it leaves a difficult open problemnamely the concrete assignment of the random weights Pi Itis clear that such an issue is crucial for applicability of thesemodels Up to now all tractable species sampling models havebeen defined by exploiting the so-called ldquostick-breakingrdquo proce-dure already adopted by Halmos (1944) and Freedman (1963)(See Ishwaran and James 2001 and references therein for exam-ples of what they call ldquostick-breaking priorsrdquo) More recentlyPitman (2003) considered PoissonndashKingman models a sub-class of normalized RMI which are defined via normaliza-

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1283

tion of homogeneous random measures with independent incre-ments under the additional assumption of α being nonatomicHe showed that PoissonndashKingman models are a subclass ofspecies-sampling models and characterized them by deriving anexpression of the EPPF From these considerations it followsthat an NndashIG process provided that α is nonatomic is also aspecies-sampling model Note that in the NndashIG case a closed-form expression of the EPPF can be derived based on work ofPitman (2003) or James (2002) combined with the argumentsused in the proof of Proposition 3 one obtains (A1) in the Ap-pendix In this article we do not pursue the approach based onPoisson process partition calculus and refer the reader to the ar-ticle by James (2002) for extensive calculus relative to EPPFsWe just remark that the NndashIG process represents the first exam-ple (at least to our knowledge) of a tractable species-samplingmodel which cannot be derived via a stick-breaking procedure

Before proceeding it seems worthwhile to point out thatthe peculiarities of the NndashIG and Dirichlet processes comparedwith other members of these classes (and indeed within all ran-dom probability measures) is represented by the fact that theirfinite-dimensional distributions are known explicitly

33 Some Properties of the NormalizedInverse-Gaussian Process

In this section we study some properties of the NndashIG processIn particular we discuss discreteness derive some of its mo-ments and the predictive distribution and consider the distribu-tion of the number of distinct observations with a sample drawnfrom an NndashIG process

A first interesting issue is the almost-sure discreteness of anNndashIG process On the basis of the construction provided in Sec-tion 31 it follows that the distribution of the NndashIG processadmits a version that selects discrete probability measures al-most surely By a result of James (2003) that makes use of theconstruction in (7) we have that all versions of P are discreteHence we can affirm that the NndashIG process is discrete almostsurely

The moments of an NndashIG process with parameter α followimmediately from Proposition 2 Let BB1B2 isin X and setC = B1 cap B2 and P0(middot) = α(middot)a Then

E[P(B)] = P0(B)

var[P(B)] = P0(B)(1 minus P0(B))a2ea (minus2a)

and

cov(P(B1) P(B2)

) = [P0(C) minus P0(B1)P0(B2)]a2ea(minus2a)

Note that if P0 is diffuse then the mean of P follows alsofrom the fact that the NndashIG process is a species-samplingmodel These quantities are usually exploited for incorporatingreal qualitative prior knowledge into the model For instanceWalker and Damien (1998) suggested controlling the priorguess P0 as well as also the variance of the random probabil-ity measure at issue Walker Damien Laud and Smith (1999)provided a detailed discussion on the prior specification in non-parametric problems

An important goal in inferential procedures is predicting fu-ture values of a random quantity based on its past outcomesSuppose that a sequence (Xn)nge1 of exchangeable observa-tions is defined in such a way that given P the Xirsquos are iid

with distribution P Moreover let Xlowast1 Xlowast

k denote the k dis-tinct observations within the sample X(n) = (X1 Xn) withnj gt 0 terms being equal to Xlowast

j for j = 1 k and such thatsum

j nj = n The next proposition provides the predictive distrib-utions corresponding to an NndashIG process

Proposition 3 If P is an NndashIG process with diffuse parame-ter α then the predictive distributions are of the form

Pr(Xn+1 isin B

∣∣X(n)

) = w(n)0 P0(B) + w(n)

1

ksum

j=1

(nj minus 12)δXlowastj(B)

for every B isin X with

w(n)0 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 1 + 2r minus 2na)

2nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

and

w(n)1 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 2r minus 2na)

nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

Note that the proof of Proposition 3 exploits a general resultof Pitman (2003) later derived independently by means of adifferent technique by Pruumlnster (2002) (see also James 2002)

Thus the predictive distribution is a linear combination ofthe prior guess P0 and a weighted empirical distribution withexplicit expression for the weights Moreover a generalizedPoacutelya urn scheme follows immediately which is given in (11)To highlight its distinctive features it is useful to compare itwith the prediction rule of the Dirichlet process In the lat-ter case one has that Xn+1 is new with probability a(a + n)

and that it coincides with Xlowastj with probability nj(a + n) for

j = 1 k Thus the probability allocated to previous obser-vations is n(a + n) and does not depend on the number k ofdistinct observations within the sample Moreover the weightassigned to each Xlowast

j depends only on the number of observa-tions equal to Xlowast

j a characterizing property of the Dirichletprocess that at the same time represents one if its drawbacks(see Ferguson 1974) In contrast for the NndashIG process themechanism is quite interesting and exploits all available in-formation Given X(n) the (n + 1)st observation is new withprobability w(n)

0 and coincides with one of the previous withprobability (n minus k2)w(n)

1 In this case the balancing betweennew and old observations takes the number of distinct val-ues k into account it is easy to verify numerically that w(n)

0 isdecreasing as k decreases and thus one removes more massfrom the prior if fewer distinct observations (ie more ties) arerecorded Moreover allocation of the probability to each Xlowast

j ismore elaborate than for the Dirichlet case instead of increasingthe weight proportionally to the number of ties the probabilityassigned to Xlowast

j is reinforced more than proportionally each timea new tie has been recorded This can be explained by the ar-gument that the more often Xlowast

j is observed the stronger is thestatistical evidence in its favor and thus it is sensible to reallo-cate mass toward it

A small numerical example may clarify this point Whencomparing the Dirichlet and NndashIG processes it is sensible tofix their parameter such that they have the same mean and vari-ance as was done in Section 2 Recall that this is tantamount

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 5: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

1282 Journal of the American Statistical Association December 2005

Figure 2 Contour Plots for the Dirichlet (a)ndash(c) and NndashIG (d)ndash(f) Densities on ∆2 With p1 = p2 = p1 = p2 = 13 for Different Choices ofa and a

as a random probability distribution function on R wherea = α(R) The corresponding random probability P is an NndashIGprocess on R As shown by Regazzini et al (2003) such a con-struction holds for any increasing additive process providedthat the total mass of the Leacutevy measure is unbounded givingrise to the class of normalized RMI James (2002) extendednormalized RMIs to Polish spaces Thus the NndashIG processrepresents clearly a particular case of normalized RMI Weremark that the idea of defining a general class of discrete ran-dom probability measures via normalization is due to Kingman(1975) and previous stimulating work was done by Permanet al (1992) and Pitman (2003)

The NndashIG process provided that α is nonatomic turns out toalso be a species-sampling model This class of random proba-bility measures due to Pitman (1996) is defined as

P(middot) =sum

ige1

PiδXi(middot) +(

1 minussum

ige1

Pi

)

H(middot) (8)

where 0 lt Pi lt 1 are random weights such thatsum

igt1 Pi le 1independent of the locations Xi which are iid with somenonatomic distribution H A species-sampling model is com-pletely characterized by H which is the prior guess at theshape of P and the so-called ldquoexchangeable partition proba-bility functionrdquo (EPPF) (see Pitman 1996) Although the de-finition of species-sampling model provides an intuitive andquite general framework it leaves a difficult open problemnamely the concrete assignment of the random weights Pi Itis clear that such an issue is crucial for applicability of thesemodels Up to now all tractable species sampling models havebeen defined by exploiting the so-called ldquostick-breakingrdquo proce-dure already adopted by Halmos (1944) and Freedman (1963)(See Ishwaran and James 2001 and references therein for exam-ples of what they call ldquostick-breaking priorsrdquo) More recentlyPitman (2003) considered PoissonndashKingman models a sub-class of normalized RMI which are defined via normaliza-

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1283

tion of homogeneous random measures with independent incre-ments under the additional assumption of α being nonatomicHe showed that PoissonndashKingman models are a subclass ofspecies-sampling models and characterized them by deriving anexpression of the EPPF From these considerations it followsthat an NndashIG process provided that α is nonatomic is also aspecies-sampling model Note that in the NndashIG case a closed-form expression of the EPPF can be derived based on work ofPitman (2003) or James (2002) combined with the argumentsused in the proof of Proposition 3 one obtains (A1) in the Ap-pendix In this article we do not pursue the approach based onPoisson process partition calculus and refer the reader to the ar-ticle by James (2002) for extensive calculus relative to EPPFsWe just remark that the NndashIG process represents the first exam-ple (at least to our knowledge) of a tractable species-samplingmodel which cannot be derived via a stick-breaking procedure

Before proceeding it seems worthwhile to point out thatthe peculiarities of the NndashIG and Dirichlet processes comparedwith other members of these classes (and indeed within all ran-dom probability measures) is represented by the fact that theirfinite-dimensional distributions are known explicitly

33 Some Properties of the NormalizedInverse-Gaussian Process

In this section we study some properties of the NndashIG processIn particular we discuss discreteness derive some of its mo-ments and the predictive distribution and consider the distribu-tion of the number of distinct observations with a sample drawnfrom an NndashIG process

A first interesting issue is the almost-sure discreteness of anNndashIG process On the basis of the construction provided in Sec-tion 31 it follows that the distribution of the NndashIG processadmits a version that selects discrete probability measures al-most surely By a result of James (2003) that makes use of theconstruction in (7) we have that all versions of P are discreteHence we can affirm that the NndashIG process is discrete almostsurely

The moments of an NndashIG process with parameter α followimmediately from Proposition 2 Let BB1B2 isin X and setC = B1 cap B2 and P0(middot) = α(middot)a Then

E[P(B)] = P0(B)

var[P(B)] = P0(B)(1 minus P0(B))a2ea (minus2a)

and

cov(P(B1) P(B2)

) = [P0(C) minus P0(B1)P0(B2)]a2ea(minus2a)

Note that if P0 is diffuse then the mean of P follows alsofrom the fact that the NndashIG process is a species-samplingmodel These quantities are usually exploited for incorporatingreal qualitative prior knowledge into the model For instanceWalker and Damien (1998) suggested controlling the priorguess P0 as well as also the variance of the random probabil-ity measure at issue Walker Damien Laud and Smith (1999)provided a detailed discussion on the prior specification in non-parametric problems

An important goal in inferential procedures is predicting fu-ture values of a random quantity based on its past outcomesSuppose that a sequence (Xn)nge1 of exchangeable observa-tions is defined in such a way that given P the Xirsquos are iid

with distribution P Moreover let Xlowast1 Xlowast

k denote the k dis-tinct observations within the sample X(n) = (X1 Xn) withnj gt 0 terms being equal to Xlowast

j for j = 1 k and such thatsum

j nj = n The next proposition provides the predictive distrib-utions corresponding to an NndashIG process

Proposition 3 If P is an NndashIG process with diffuse parame-ter α then the predictive distributions are of the form

Pr(Xn+1 isin B

∣∣X(n)

) = w(n)0 P0(B) + w(n)

1

ksum

j=1

(nj minus 12)δXlowastj(B)

for every B isin X with

w(n)0 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 1 + 2r minus 2na)

2nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

and

w(n)1 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 2r minus 2na)

nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

Note that the proof of Proposition 3 exploits a general resultof Pitman (2003) later derived independently by means of adifferent technique by Pruumlnster (2002) (see also James 2002)

Thus the predictive distribution is a linear combination ofthe prior guess P0 and a weighted empirical distribution withexplicit expression for the weights Moreover a generalizedPoacutelya urn scheme follows immediately which is given in (11)To highlight its distinctive features it is useful to compare itwith the prediction rule of the Dirichlet process In the lat-ter case one has that Xn+1 is new with probability a(a + n)

and that it coincides with Xlowastj with probability nj(a + n) for

j = 1 k Thus the probability allocated to previous obser-vations is n(a + n) and does not depend on the number k ofdistinct observations within the sample Moreover the weightassigned to each Xlowast

j depends only on the number of observa-tions equal to Xlowast

j a characterizing property of the Dirichletprocess that at the same time represents one if its drawbacks(see Ferguson 1974) In contrast for the NndashIG process themechanism is quite interesting and exploits all available in-formation Given X(n) the (n + 1)st observation is new withprobability w(n)

0 and coincides with one of the previous withprobability (n minus k2)w(n)

1 In this case the balancing betweennew and old observations takes the number of distinct val-ues k into account it is easy to verify numerically that w(n)

0 isdecreasing as k decreases and thus one removes more massfrom the prior if fewer distinct observations (ie more ties) arerecorded Moreover allocation of the probability to each Xlowast

j ismore elaborate than for the Dirichlet case instead of increasingthe weight proportionally to the number of ties the probabilityassigned to Xlowast

j is reinforced more than proportionally each timea new tie has been recorded This can be explained by the ar-gument that the more often Xlowast

j is observed the stronger is thestatistical evidence in its favor and thus it is sensible to reallo-cate mass toward it

A small numerical example may clarify this point Whencomparing the Dirichlet and NndashIG processes it is sensible tofix their parameter such that they have the same mean and vari-ance as was done in Section 2 Recall that this is tantamount

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 6: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1283

tion of homogeneous random measures with independent incre-ments under the additional assumption of α being nonatomicHe showed that PoissonndashKingman models are a subclass ofspecies-sampling models and characterized them by deriving anexpression of the EPPF From these considerations it followsthat an NndashIG process provided that α is nonatomic is also aspecies-sampling model Note that in the NndashIG case a closed-form expression of the EPPF can be derived based on work ofPitman (2003) or James (2002) combined with the argumentsused in the proof of Proposition 3 one obtains (A1) in the Ap-pendix In this article we do not pursue the approach based onPoisson process partition calculus and refer the reader to the ar-ticle by James (2002) for extensive calculus relative to EPPFsWe just remark that the NndashIG process represents the first exam-ple (at least to our knowledge) of a tractable species-samplingmodel which cannot be derived via a stick-breaking procedure

Before proceeding it seems worthwhile to point out thatthe peculiarities of the NndashIG and Dirichlet processes comparedwith other members of these classes (and indeed within all ran-dom probability measures) is represented by the fact that theirfinite-dimensional distributions are known explicitly

33 Some Properties of the NormalizedInverse-Gaussian Process

In this section we study some properties of the NndashIG processIn particular we discuss discreteness derive some of its mo-ments and the predictive distribution and consider the distribu-tion of the number of distinct observations with a sample drawnfrom an NndashIG process

A first interesting issue is the almost-sure discreteness of anNndashIG process On the basis of the construction provided in Sec-tion 31 it follows that the distribution of the NndashIG processadmits a version that selects discrete probability measures al-most surely By a result of James (2003) that makes use of theconstruction in (7) we have that all versions of P are discreteHence we can affirm that the NndashIG process is discrete almostsurely

The moments of an NndashIG process with parameter α followimmediately from Proposition 2 Let BB1B2 isin X and setC = B1 cap B2 and P0(middot) = α(middot)a Then

E[P(B)] = P0(B)

var[P(B)] = P0(B)(1 minus P0(B))a2ea (minus2a)

and

cov(P(B1) P(B2)

) = [P0(C) minus P0(B1)P0(B2)]a2ea(minus2a)

Note that if P0 is diffuse then the mean of P follows alsofrom the fact that the NndashIG process is a species-samplingmodel These quantities are usually exploited for incorporatingreal qualitative prior knowledge into the model For instanceWalker and Damien (1998) suggested controlling the priorguess P0 as well as also the variance of the random probabil-ity measure at issue Walker Damien Laud and Smith (1999)provided a detailed discussion on the prior specification in non-parametric problems

An important goal in inferential procedures is predicting fu-ture values of a random quantity based on its past outcomesSuppose that a sequence (Xn)nge1 of exchangeable observa-tions is defined in such a way that given P the Xirsquos are iid

with distribution P Moreover let Xlowast1 Xlowast

k denote the k dis-tinct observations within the sample X(n) = (X1 Xn) withnj gt 0 terms being equal to Xlowast

j for j = 1 k and such thatsum

j nj = n The next proposition provides the predictive distrib-utions corresponding to an NndashIG process

Proposition 3 If P is an NndashIG process with diffuse parame-ter α then the predictive distributions are of the form

Pr(Xn+1 isin B

∣∣X(n)

) = w(n)0 P0(B) + w(n)

1

ksum

j=1

(nj minus 12)δXlowastj(B)

for every B isin X with

w(n)0 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 1 + 2r minus 2na)

2nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

and

w(n)1 =

sumnr=0

(nr

)(minusa2)minusr+1(k + 2r minus 2na)

nsumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

Note that the proof of Proposition 3 exploits a general resultof Pitman (2003) later derived independently by means of adifferent technique by Pruumlnster (2002) (see also James 2002)

Thus the predictive distribution is a linear combination ofthe prior guess P0 and a weighted empirical distribution withexplicit expression for the weights Moreover a generalizedPoacutelya urn scheme follows immediately which is given in (11)To highlight its distinctive features it is useful to compare itwith the prediction rule of the Dirichlet process In the lat-ter case one has that Xn+1 is new with probability a(a + n)

and that it coincides with Xlowastj with probability nj(a + n) for

j = 1 k Thus the probability allocated to previous obser-vations is n(a + n) and does not depend on the number k ofdistinct observations within the sample Moreover the weightassigned to each Xlowast

j depends only on the number of observa-tions equal to Xlowast

j a characterizing property of the Dirichletprocess that at the same time represents one if its drawbacks(see Ferguson 1974) In contrast for the NndashIG process themechanism is quite interesting and exploits all available in-formation Given X(n) the (n + 1)st observation is new withprobability w(n)

0 and coincides with one of the previous withprobability (n minus k2)w(n)

1 In this case the balancing betweennew and old observations takes the number of distinct val-ues k into account it is easy to verify numerically that w(n)

0 isdecreasing as k decreases and thus one removes more massfrom the prior if fewer distinct observations (ie more ties) arerecorded Moreover allocation of the probability to each Xlowast

j ismore elaborate than for the Dirichlet case instead of increasingthe weight proportionally to the number of ties the probabilityassigned to Xlowast

j is reinforced more than proportionally each timea new tie has been recorded This can be explained by the ar-gument that the more often Xlowast

j is observed the stronger is thestatistical evidence in its favor and thus it is sensible to reallo-cate mass toward it

A small numerical example may clarify this point Whencomparing the Dirichlet and NndashIG processes it is sensible tofix their parameter such that they have the same mean and vari-ance as was done in Section 2 Recall that this is tantamount

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 7: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

1284 Journal of the American Statistical Association December 2005

Table 1 Posterior Probability Allocation of the Dirichletand NndashIG Processes

n = 10 Dirichlet process NndashIG process

New observation 1480 2182

X1 = 3 n1 = 4 3408 3420

X2 = 1 n2 = 3 2556 2443

X3 = 6 n3 = 2 1704 1466

X4 = 5 n4 = 1 0852 0489

to requiring them to have the same prior guess P0 and impos-ing var(Dα(B)) = var(P(B)) for every B isin X which is equiv-alent to (a + 1)minus1 = a2ea(minus2a) where a and a are the totalmasses of the parameter measures of the Dirichlet and NndashIGprocesses

To highlight the reinforcement mechanism let X = [01]P0 equal to the uniform distribution with a = 5 and a = 1737Suppose that one has observed X(10) with four observationsequal to 3 three observations equal to 1 two observationsequal to 6 and one observation equal to 5 Table 1 givesthe probabilities assigned by the predictive distribution of theDirichlet and NndashIG processes

From this table the reinforcement becomes apparent In theDirichlet case having four ties in 3 means assigning to itfour times the probability given to an observation that appearedonce For the NndashIG process things are very different the massassigned to the prior (ie to observing a new value) is higherindicating the greater flexibility of the NndashIG prior Moreoverthe probability assigned to 6 is more than two times the massassigned to 5 the probability assigned to 1 is greater than 32times the mass given to 6 and so on Having a sample withmany ties means having significant statistical evidence withreference to the associated clustering structure and the NndashIGprocess makes use of this information

A further remarkable issue concerns determination of theprior distribution for the number k of distinct observations Xlowast

i rsquosamong the n being observed Antoniak (1974) remarked thatsuch a distribution is induced by the distribution assigned tothe random probability measure and gave an expression for theDirichlet case The corresponding formula for the NndashIG processis given in the following proposition

Proposition 4 The distribution of the number of distinct ob-servations k in a sample of size n is given by

p(k|n) =(

2n minus k minus 1n minus 1

)ea(minusa2)nminus1

22nminuskminus1(k)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr

times (k + 2 + 2r minus 2n a) (9)

for k = 1 n

Note that the expression for p(k|n) in the Dirichlet case is

cn(k)ak (a)

(a + n) (10)

where cn(k) is the absolute value of a Stirling number of thefirst kind (for details see Green and Richardson 2001) Unlikein (10) the evaluation of which can be achieved by resorting torecursive formulas for Stirling numbers evaluation of the ex-pression in (9) is straightforward Table 2 compares the priorsp(k|n) corresponding to the Dirichlet and NndashIG processes inthe same setup of Table 1

In this toy example note that the distribution p(middot|n) inducedby the NndashIG prior is flatter than that corresponding to theDirichlet process This fact plays a role in the context of mixturemodeling to be examined in the next section

4 THE MIXTURE OF NORMALIZEDINVERSEndashGAUSSIAN PROCESS

In light of the previous results it is clear that the reinforce-ment mechanism makes the NndashIG prior an interesting alterna-tive to the Dirichlet process Nowadays the Dirichlet processis seldom used directly for assessing inferences indeed it iscommonly exploited as the crucial building block in a hier-archical mixture model of type (1) Here we aim to study amixture of NndashIG process model as set in (1) and suitable semi-parametric variations of it The mixture of NndashIG process repre-sents a particular case of normalized random measures drivenby increasing additive processes (see Nieto-Barajas et al 2004)and also of species-sampling mixture models (see Ishwaran andJames 2003) With respect to both classes the mixture of NndashIGprocess stands out for its tractability

To exploit a mixture of NndashIG process for inferential pur-poses it is essential to derive an appropriate sampling schemeIn such a framework knowledge of the predictive distributionsdetermined in Proposition 3 is crucial Indeed most of the sim-ulation algorithms developed in Bayesian nonparametrics relyon variations of the BlackwellndashMacQueen Poacutelya urn schemeand on the development of appropriate Gibbs sampling pro-cedures (See Escobar 1994 MacEachern 1994 Escobar andWest 1995 for the Dirichlet case and Pitman 1996 Ishwaranand James 2001 for extensions to stick-breaking priors) In thecase of an NndashIG process it follows that the joint distributionof X(n) can be characterized by the following generalized Poacutelyaurn scheme Let Z1 Zn be an iid sample from P0 Then asample X(n) from an NndashIG process can be generated as followsSet X1 = Z1 and for i = 2 n

(Xi

∣∣X(iminus1)

) =

Zi with probability w(iminus1)0

Xlowastij with probability

(nj minus 12)w(iminus1)1 j = 1 ki

(11)

Table 2 Prior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes

n = 10 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

Dirichleta = 1737 0274 1350 2670 2870 1850 0756 0196 0031 00029 00001

NndashIGa = 5 0523 1286 1813 1936 1716 1296 0824 042 0155 0031

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 8: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1285

Figure 3 Prior Probabilities for k (n = 100) Corresponding to the NndashIG Process and the Dirichlet Process for Different Values of a ( a) Thesevalues were chosen to match the modes at 6 14 and 28 [ Dirichlet ( a = 12 42 126) (left a = 12 to right a = 126) NndashIG (a = 1 1 5)(left a = 1 to right a = 5)]

where ki represents the number of distinct observations de-noted by Xlowast

i1 Xlowastiki

in X(iminus1) and w(iminus1)0 and w(iminus1)

1 are asgiven in Proposition 3

Moreover the result in Proposition 4 can be interpreted asproviding the prior distribution of the number k of componentsin the mixture for a given sample size n As in the Dirichletcase a smaller total mass a yields a p(middot|n) more concentratedon smaller values of k This can be explained by the fact that asmaller a gives rise to a smaller w(n)

0 that is it generates newdata with lower probability However the p(middot|n) induced by theNndashIG prior is apparently less informative than that correspond-ing to the Dirichlet process prior and thus is more robust withrespect to changes in a A qualitative illustration is given in Fig-ure 3 where the distribution of k given n = 100 observations isdepicted for the NndashIG and Dirichlet processes We connectedthe probability points by straight lines only for visual simplifi-cation Note that the mode of the distribution of k correspondingto the NndashIG never has probability larger than 07

Recently new algorithms have been proposed for dealingwith mixtures Extending the work of Brunner Chan Jamesand Lo (2001) on an iid sequential importance sampling pro-cedure for fitting MDPs Ishwaran and James (2003) proposeda generalized weighted Chinese restaurant algorithm that cov-ers species-sampling mixture models They formally derivedthe posterior distribution of a species-sampling mixture Todraw approximate samples from this distribution they deviseda computational scheme that requires knowledge of the condi-tional distribution of the species-sampling model P given themissing values X1 Xn When feasible such an algorithmhas the merit of reducing the Monte Carlo error (see Ishawaranand James 2003 for details and further references) In the caseof an NndashIG process sampling from its posterior law is notstraightforward because it is characterized in terms of a latent

variable (see James 2002) Further investigations are needed toimplement these interesting sampling procedures efficiently inthe NndashIG case and more generally for normalized RMI Prob-lems of the same type arise if one is willing to implement thescheme proposed by Nieto-Barajas et al (2004)

To carry out a more detailed comparison between Dirichletand NndashIG mixtures we next consider two illustrative examples

41 Simulated Data

Here we consider simulated data sampled from uniform mix-tures of normal distributions with different number of compo-nents and compare the behavior of the MDP and mixture ofNndashIG process for different choices of priors on the number ofcomponents We then evaluate the performance in terms of pos-terior probabilities on the correct number of components and interms of log-Bayes factors for testing k = klowast against k = klowastwhere klowast represents the correct number of components

We first analyze a dataset Y(100) simulated from a mixtureof three normal distributions with means minus40 and 8 variance1 and corresponding weights 33 34 and 33 Let N(middot|m v)denote the normal distribution with mean m and variance v gt 0We consider the following hierarchical model to represent suchdata

(Yi|Xi)indsim N(Yi|Xi1) i = 1 100

(Xi|P)iidsim P

P sim P

where P refers to either an NndashIG process or a Dirichletprocess Both are centered at the same prior guess N(middot|Y t2)where t is the data range that is t = max Yi minus min Yi To fixthe total masses a and a we no longer consider the issue of

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 9: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

1286 Journal of the American Statistical Association December 2005

Figure 4 Prior Probabilities for k (n = 100) Corresponding to the Dirichlet and NndashIG Processes [ Dirichlet ( a = 2) NndashIG (a = 2365)]

matching variances For comparative purposes in this mixture-modeling framework it seems more plausible to start with sim-ilar priors for the number of components k Hence a and a aresuch that the mode of p(middot|n) is the same in both the Dirichletand NndashIG cases We choose to set the mode equal to k = 8 inboth cases yielding a = 2 and a = 2365 The correspondingdistributions are plotted in Figure 4

In this setup we draw a sample from the posterior distribu-tion L (X(100)|Y(100)) by iteratively drawing samples from thedistributions of (Xi|X(100)

minusi Y(100)) for i = 1 100 given by

P(Xi isin middot∣∣X(100)

minusi Y(100))

= qlowasti0N

(

Xi isin middot∣∣∣t2Yi + Y

1 + t2

t2

t2 + 1

)

+kisum

j=1

qlowastijδXlowast

j(middot) (12)

where X(100)minusi is the vector X(100) with the ith coordinate deleted

and ki refers to the number of Xlowastj rsquos in X(100)

minusi The mixing pro-portions are given by

qlowasti0 prop w(99)

i0 N(middot|Y t2 + 1)

and

qlowastij prop

(nj minus 12)w(99)i1

N(middot|Xlowast

j 1)

subject to the constraintsumki

j=0 qlowastij = 1 The values for w(99)

i0and w(99)

i1 are given as in Proposition 3 when applied for de-termining the ldquopredictiverdquo distribution of Xi given X(100)

minusi theyare computed with a precision of 14 decimal places Note thatin general w(99)

i0 and w(99)i1 depend only on a n and ki and not

on the different partition ξ of n = nξ1 + middot middot middot + nξk This implies

that a table containing the values for w(99)i0 and w(99)

i1 for a givena n and k = 1 99 can be generated in advance for use inthe Poacutelya urn Gibbs sampler Further details are available onrequest from the authors

To obtain our estimates we resorted to the Poacutelya urn Gibbssampler such as the one set forth by Escobar and West (1995)As pointed out by Ishwaran and James (2001) such a gener-alized Poacutelya urn characterization also can be considered forthe NndashIG case the only ingredients being the weights of thepredictive distribution determined in Proposition 3 All of thefollowing results were obtained by implementing the MCMCalgorithm using 10000 iterations after 2000 burn-in sweepsFigure 5 depicts the posterior density estimates and the truedensity that generated the data

The predictive density corresponding to the mixture of NndashIGprocess clearly represents a more accurate estimate Table 3provides the prior and posterior probabilities of the number ofcomponents The prior probabilities are computed according toProposition 4 whereas the posterior probabilities follow fromthe MCMC output One may note that the posterior mass as-signed to k = 3 by the NndashIG process is significantly higher thanthat corresponding to the MDP

As far as the log-Bayes factors are concerned one obtainsBFDα

= 546 and BFNndashIGα = 497 thus favoring the MDP Thereason of this outcome seems to be due to the fact that the priorprobability on the correct number of components is much lowerfor the MDP

It is interesting to look also at the case in which the modeof p(k|n) corresponds to the correct number of componentsklowast = 3 In such a case the prior probability on klowast = 3 is muchhigher for the MDP being 285 against 0567 of the mixture ofNndashIG process This results in a posterior probability on klowast = 3of 9168 for the MDP and 86007 for the mixture of NndashIGprocess whereas the log-Bayes factors favor the NndashIG mixturebeing BFDα

= 332 and BFNminusIGα = 463 This again seems tobe due to the fact that the prior probability of one process onthe correct number of components is much lower than the onecorresponding to the other

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 10: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1287

Figure 5 Posterior Density Estimates for the Mixture of Three Normals With Means minus4 0 8 Variance 1 and Mixing Proportions 33 34and 33 [ Dirichlet ( a = 2) model NndashIG (a = 2365)]

The next dataset Y(120) that we deal with in this frameworkis drawn from a uniform mixture of six normal distributionswith means minus10minus6minus226 and 10 and variance 7 Both theDirichlet and NndashIG processes are again centered at the sameprior guess N(middot|Y t2) where t represents for the data rangeSet a = 01 and a = 5 such that p(k|n) have mode at k = 3for both the MDP and mixture of NndashIG process Such an ex-ample is meant to shed some light on the ability of the mix-ture of NndashIG process to detect clusters when starting from aprior with mode at a small number of clusters and data comingfrom a mixture with more components In this case the mix-ture of NndashIG process performs better than the MDP both interms of posterior probabilities as shown in Table 4 and alsoin terms of log-Bayes factors because BFDα

= 313 whereasBFNminusIGα = 378

If one considers data drawn from mixtures with more com-ponents then the mixture of NndashIG process does systematicallybetter in terms of posterior probabilities but the phenom-enon of ldquorelatively low probability on the correct number ofcomponentsrdquo of the MDP reappears leading it to be favoredin terms of Bayes factors For instance we have taken 120data from an uniform mixture of eight normals with meansminus14minus10minus6minus22610 and 14 and variance 7 and seta = 01 and a = 5 so the p(k|120) have mode at k = 3 Thisyields a priori probabilities on klowast = 8 of 00568 and 0489 and

a posteriori probabilities of 0937 and 4338 for the MDP andmixture of NndashIG process In contrast the log-Bayes factorsare BFDα

= 290 and BFNminusIGα = 273 One can alternativelymatch the a priori probability on the correct number of compo-nents such that p(8|120) = 04765 for both priors This happensif we set a = 86 which results in the MDP having mode atk = 4 closer to the correct number klowast = 8 whereas the mixtureof NndashIG process remains unchanged This results in a poste-rior probability on klowast = 8 for the MDP of 1134 and havingremoved the influence of the low probability p(8|120) in a log-Bayes factor of 94

42 Galaxy Data

In this section we reanalyze the galaxy dataset popularizedby Roeder (1990) These data consisting of the relative ve-locities (in kmsec) of 82 galaxies have been widely studiedwithin the framework of Bayesian statistics (see eg Roederand Wasserman 1997 Green and Richardson 2001 Petrone andVeronese 2002) For comparison purposes we use the follow-ing semiparametric setting also analyzed by Escobar and West(1995)

(Yi|miVi)indsim N(Yi|miVi) i = 1 82

(miVi|P)iidsim P

P sim P

Table 3 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|100) Has Mode at k = 8

n = 100 k le 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 00225 00997 03057 06692 11198 14974 16501 46356a = 2 Posterior 7038 2509 0406 0046 0001

NndashIG Prior 02404 0311 0419 04937 05386 05611 05677 68685a = 2365 Posterior 8223 1612 0160 0005

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 11: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

1288 Journal of the American Statistical Association December 2005

Table 4 Prior and Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes Such That p(k|120) Has Mode at 3

n = 120 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k ge 9

Dirichlet Prior 08099 21706 27433 21954 12583 05532 01949 005678 00176a = 5 Posterior 0148 3455 5722 064 0033 0002

NndashIG Prior 04389 05096 05169 05142 05082 04997 0489 04765 6047a = 01 Posterior 0085 6981 2395 0479 006

where as before P represents either the NndashIG process or theDirichlet process Again the prior guess P0 is the same for bothnonparametric priors and is characterized by

P0(A) =int

AN(x|microτvminus1)Ga(v|s2S2)dx dv

where A isin B(RtimesR+) and Ga(middot|cd) is the density correspond-

ing to a gamma distribution with mean cd Similar to Escobarand West (1995) we assume a further hierarchy for micro and τ namely micro sim N(middot|gh) and τminus1 sim Ga(middot|wW) We choose thehyperparameters for this illustration to fit those used by Escobarand West (1995) namely g = 0h = 001w = 1 and W = 100For comparative purposes we again choose a to achieve themode match if the mode of p(middot|82) is in k = 5 as it is for theDirichlet process with a = 1 then a = 0829 From Table 5 it isapparent that the NndashIG process provides a noninformative priorfor k compared with the distribution induced by the Dirichletprocess

The details of the Poacutelya urn Gibbs sampler are as provided byEscobar and West (1995) with the only difference being the re-placement of the weights with those corresponding to the NndashIGprocess given in Proposition 3 Figure 6 shows the posteriordensity estimates based on 10000 iterations considered after aburnndashin period of 2000 sweeps Table 5 displays the prior andthe posterior probabilities of the number of components in themixture

Some comments on the results in Table 5 are in orderEscobar and West (1995) extensively discussed the issue oflearning about the total mass a This is motivated by the sen-sitivity of posterior inferences on the choice of the parameter aHence they randomized a and specified a prior distributionfor it It seems worth noting that the posterior distributionof the number of components for the NndashIG process with afixed a essentially coincides with the corresponding distribu-tion arising from an MDP with random a given in table 6 ofEscobar and West (1995) Some explanations of this phenom-enon might be that the prior p(k|n) is essentially noninforma-tive in a broad neighborhood of its mode thus mitigating theinfluence of a on posterior inferences and the aggressivenessin reducingdetecting clusters of the NndashIG mixture shown inthe previous example seems also to be a plausible reason ofsuch a gain in ldquoparsimonyrdquo However for a mixture of NndashIGprocess it may be of interest to have a random a To achievethis a Metropolis step must be added in the algorithm In prin-ciple this is straightforward The only drawback would be an

increase in the computational burden because computing theweights for the NndashIG process is after all not as quick as com-puting those corresponding to the Dirichlet process

5 THE MEAN OF A NORMALIZEDINVERSEndashGAUSSIAN PROCESS

An alternative use of discrete nonparametric priors for infer-ence with continuous data is represented by histogram smooth-ing Such a problem can be handled by exploiting the so-calledldquofiltered-variaterdquo random probability measures as defined byDickey and Jiang (1998) These quantities essentially coincidewith suitable means of random probability measures Here wefocus attention on means of NndashIG processes After the recentbreakthrough achieved by Cifarelli and Regazzini (1990) thishas become a very active area of research in Bayesian nonpara-metrics (see eg Diaconis and Kemperman 1996 RegazziniGuglielmi and Di Nunno 2002 for the Dirichlet case andEpifani Lijoi and Pruumlnster 2003 Hjort 2003 James 2002for results beyond the Dirichlet case) In particular Regazziniet al (2003) dealt with means of normalized priors providingconditions for their existence their exact prior and approximateposterior distribution Here we specialize their general results tothe NndashIG process and derive the exact posterior density

Letting X = R and X = B(R) we study the distributionof the mean

intR

xP(dx) The first issue to face is its finitenessA necessary and sufficient condition for this to hold can be de-rived from proposition 1 of Regazzini et al (2003) and is givenby

intR

radic2λx + 1α(dx) lt +infin for every λ gt 0

As far as the problem of the distribution of a mean is con-cerned proposition 2 of Regazzini et al (2003) and some cal-culations lead to an expression of the prior distribution functionof the mean as

F(σ ) = 1

2minus 1

πea

int +infin

0

1

texp

minusint

R

4radic

1 + 4t2(x minus σ)2

times cos

[1

2arctan(2t(x minus σ))

]

α(dx)

times sin

int

R

4radic

1 + 4t2(x minus σ)2

times sin

[1

2arctan(2t(x minus σ))

]

α(dx)

dt

(13)

Table 5 Posterior Probabilities for k Corresponding to the Dirichlet and NndashIG Processes in the Galaxy Data Example

n = 82 k le 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10 k ge 11

Dirichlet Prior 2140 2060 214 169 106 0548 0237 0088 0038a = 1 Posterior 0465 125 2524 2484 2011 08 019 0276

NndashIG Prior 1309 0617 0627 0621 0607 0586 0562 0534 4537a = 0829 Posterior 003 0754 168 2301 2338 1225 0941 0352 0379

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 12: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1289

Figure 6 Posterior Density Estimates for the Galaxy Dataset [ NndashIG (a = 0829) Dirichlet (a = 2)]

for any σ isin R Before providing the posterior density of a meanof an NndashIG process it is useful to introduce the LiouvillendashWeylfractional integral defined as

Inc+h(σ ) =

int σ

c

(σ minus u)nminus1

(n minus 1) h(u)du

for n ge 1 and set equal to the identity operator for n = 0 for anyc lt σ Moreover let Re z and Im z denote the real and imagi-nary part of z isin C

Proposition 5 If P is an NndashIG process with diffuse parame-ter α and minusinfin le c = inf supp(α) then the posterior density ofthe mean is of the form

ρx(n) (σ ) = 1

πInminus1c+ Imψ(σ) if n is even (14)

and

ρx(n) (σ ) = minus1

πInminus1c+ Reψ(σ) if n is odd (15)

with

ψ(σ) = (n minus 1)2nminus1int infin

0 tnminus1eminus intR

radic1minusit2(xminusσ)α(dx)

a2nminus(2+k)sumnminus1

r=0

(nminus1r

)(minusa2)minusr(k + 2 + 2r minus 2na)

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

6 CONCLUDING REMARKS

In this article we have studied some of the statistical im-plications of the use of the NndashIG process as an alternative tothe Dirichlet process Both priors are almost surely discreteallow for explicit derivation of their finite-dimensional distri-bution and lead to tractable expressions of relevant quantitiesTheir natural use is in the context of mixture modeling where

they present remarkably different behaviors Indeed it has beenshown that the prior distribution on the number of componentsinduced by the NndashIG process is wider than that induced by theDirichlet process This seems to mitigate the influence of thechoice of the total mass a of the parameter measure Moreoverthe mixture of NndashIG process seems to be more aggressive in re-ducingdetecting clusters on the basis of ties present in the dataSuch conclusions are however preliminary and still need to bevalidated by more extensive applications which will constitutethe object of our future research

APPENDIX PROOFS

A1 Proof of Proposition 1

Having (3) at hand the computation of (4) is as follows One oper-ates the transformation Wi = Vi(

sumni=1 Vi)

minus1 for i = 1 n minus 1 andWn = sumn

j=1 Vj Some algebra and formula 347112 of Gradshteyn andRhyzik (2000) leads to the desired result

A2 Proof of Proposition 2

In proving this result we do not use the distribution of an NndashIGrandom variable directly The key point is exploiting the indepen-dence of the variables (V1 Vn) used for defining (W1 Wn)Set V = sumn

j=1 Vj and Vminusi = sumj =i Vj and recall that the moment-

generating function of an IG(α γ ) distributed random variable is given

by E[eminusλX] = eminusa(radic

2λ+γ 2minusγ ) As far as the mean is concerned forany i = 1 n we have

E[Wi] = E[ViVminus1]

=int +infin

0E[eminusuVminusi ]E[eminusuVi Vi]du

=int +infin

0E[eminusuVminusi ]E

[

minus d

dueminusuVi

]

du

= minusint +infin

0eminus(aminusαi)(

radic2u+1minus1) d

dueminusαi(

radic2u+1minus1) du

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 13: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

1290 Journal of the American Statistical Association December 2005

= αi

int +infin0

(2u + 1)minus12eminusa(radic

2u+1minus1) du

= αi

a

int +infin0

minus d

dueminusa(

radic2u+1minus1) du = αi

a= pi

having applied Fubinirsquos theorem and theorem 168 of Billingsley(1995) The variance and covariance can be deduced by similar ar-guments combined with integration by parts and some tedious algebra

A3 Proof of Proposition 3

From Pitman (2003) (see also James 2002 Pruumlnster 2002) we candeduce that the predictive distribution associated with an NndashIG processwith diffuse parameter α is of the form

P(Xn+1 isin middot∣∣X(n)

) = w(n) α(middot)a

+ 1

n

ksum

j=1

w(n)j δXlowast

j(middot)

with weights equal to

w(n) = aintR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)micro1(u)du

nintR+ unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

and

w(n)j =

intR+ uneminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronj+1(u) middot middot middotmicronk (u)du

intR+ unminus1eminusa(

radic2u+1minus1)micron1(u) middot middot middotmicronk (u)du

where micron(u) = intR+ vneminusuvν(dv) for any positive u and n = 12

and ν(dv) = (2πv3)minus12eminusv2 dv We can easily verify that micron(u) =2nminus1(n minus 1

2 )(radic

π [2u + 1]nminus12)minus1 and moreover that

int

R+unminus1eminusa(

radic2u+1minus1)micron1 (u) middot middot middotmicronk (u)du

= 2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

int +infin0

unminus1eminusaradic

2u+1

[2u + 1]nminusk2du

By the change of variableradic

2u + 1 = y the latter expression is equalto

2nminuskea

(radic

π)k

kprod

j=1

(nj minus 12)

1

2nminus1

int +infin1

(y2 minus 1)nminus1eminusay

y2nminuskminus1dy

= ea

2kminus1(radic

π)k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusrint +infin

1

eminusay

y2nminuskminus2rminus1dy

= ea

2kminus1(radic

π )k

kprod

j=1

(nj minus 12)

timesnminus1sum

r=0

(n minus 1

r

)

(minus1)nminus1minusra2nminus2minuskminus2r(k + 2r minus 2n + 2a)

Now the result follows by rearranging the terms appropriately com-bined with some algebra

A4 Proof of Proposition 4

According to the proof of Proposition 3 the joint distribution of thedistinct observations (Xlowast

1 Xlowastk ) and the random partition coincides

with

α(dx1) middot middot middotα(dxk)ea(minusa2)nminus1

ak2kminus1 πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na)

The conditional distribution of the vector (kn1 nk) given n canbe obtained by marginalizing with respect to (Xlowast

1 Xlowastk ) thus yield-

ing

ea(minusa2)nminus1

2kminus1πk2(n)

kprod

j=1

(

nj minus 1

2

)

timesnminus1sum

r=0

(n minus 1

r

)

(minusa2)minusr(k + 2 + 2r minus 2na) (A1)

To determine p(k|n) one needs to compute

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

where (lowast) means that the sum extends over all partitions of the set ofintegers 1 n into k komponents

sum

(lowast)

kprod

j=1

(

nj minus 1

2

)

= πk2sum

(lowast)

kprod

j=1

(1

2

)

njminus1

=(

2n minus k minus 1n minus 1

)(n)

(k)22kminus2n

where for any positive b (b)n = (b + n)(b) and the result fol-lows

A5 Proof of Proposition 5

Let m = Ami i = 1 km be a sequence of partitions con-structed in a similar fashion as was done in section 4 of Regazzini et al(2003) We accordingly discretize α by setting αm = sumkm

i=1 αmi δsmi where αmi = α(Ami) and smi is a point in Ami Hence wheneverthe jth element in the sample Xj belongs to Amij it is as if we ob-served smij Then if we apply proposition 3 of Regazzini et al (2003)some algebra leads to express the posterior density function for the dis-cretized mean as (14) and (15) with

ψm(σ ) = minusCm(x(n)

)ea2nminusk

radicπ

k

( kprod

j=1

αmij(nj minus 12)

)

timesint +infin

0

(tnminus1eminussumkm

j=1

radic1minusit2(sjminusσ)αmj

)

times( kprod

r=1

[(1 minus it2

(sir minus σ

))nrminus12

+ gm(αmir smir t

)])minus1

dt (A2)

where Cm(x(n))minus1 is the marginal distribution of the discretized sam-ple and gm(αmir smir t) = O(α2

mir) as m rarr +infin for any t On the

basis of proposition 4 of Regazzini et al (2003) one could use the

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527

Page 14: Hierarchical Mixture Modeling With Normalized Inverse ... · Various extensions of the Dirichlet process have been pro-posed to overcome this drawback. For instance, Mauldin, Sudderth,

Lijoi Mena and Pruumlnster Normalized Inverse-Gaussian Process 1291

previous expression as an approximation to the actual distribution be-cause it converges almost surely in the weak sense along the tree ofpartitions m But in the particular NndashIG case we are able to com-pute Cm(x(n))minus1 by mimiking the technique used in Proposition 1 andexploiting the diffuseness of α Thus we have

Cm(x(n)

)minus1 = ea2nminusk

(n minus 1)radicπk

( kprod

j=1

αmij(nj minus 12)

)

timesint

(0+infin)

unminus1eminusaradic

2u+1

prodkr=1[[1 + 2u]nrminus12 + hm(αmir u)] du

where hm(αmir u) = O(α2mir

) as m rarr +infin for any u Insert the pre-vious expression in (A2) and simplify Then apply theorem 357 ofBillingsley (1995) and dominated convergence to obtain that the limit-ing posterior density is given by (14) and (15) with

ψ(σ) = minus (n minus 1) int infin0 tnminus1eminus int

R

radic1minusit2(xminusσ)α(dx)

int +infin0 unminus1eminusa

radic2u+1(

radic2u + 1 )minusn+k2 du

timeskprod

j=1

(1 minus it2(xlowast

j minus σ))minusnj+12 dt

Now arguments similar to those in the proof of Proposition 3 lead tothe desired result

[Received February 2004 Revised December 2004]

REFERENCES

Antoniak C E (1974) ldquoMixtures of Dirichlet Processes With Applications toBayesian Nonparametric Problemsrdquo The Annals of Statistics 2 1152ndash1174

Billingsley P (1995) Probability and Measure (3rd ed) New York WileyBilodeau M and Brenner D (1999) Theory of Multivariate Statistics New

York Springer-VerlagBlackwell D (1973) ldquoDiscreteness of Ferguson Selectionsrdquo The Annals of

Statistics 1 356ndash358Brunner L J Chan A T James L F and Lo A Y (2001) ldquoWeighted

Chinese Restaurant Processes and Bayesian Mixture Modelsrdquo unpublishedmanuscript

Cifarelli D M and Regazzini E (1990) ldquoDistribution Functions of Means ofa Dirichlet Processrdquo The Annals of Statistics 18 429ndash442

Dey D Muumlller P and Sinha D (eds) (1998) Practical Nonparametric andSemiparametric Bayesian Statistics Lecture Notes in Statistics 133 NewYork Springer-Verlag

Diaconis P and Kemperman J (1996) ldquoSome New Tools for Dirichlet Pri-orsrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A P Dawidand A F M Smith New York Oxford University Press pp 97ndash106

Dickey J M and Jiang T J (1998) ldquoFiltered-Variate Prior Distributions forHistogram Smoothingrdquo Journal of the American Statistical Association 93651ndash662

Epifani I Lijoi A and Pruumlnster I (2003) ldquoExponential Functionals andMeans of Neutral-to-the-Right Priorsrdquo Biometrika 90 791ndash808

Escobar M D (1988) ldquoEstimating the Means of Several Normal Populationsby Nonparametric Estimation of the Distribution of the Meansrdquo unpublisheddoctoral dissertation Yale University Dept of Statistics

(1994) ldquoEstimating Normal Means With a Dirichlet Process PriorrdquoJournal of the American Statistical Association 89 268ndash277

Escobar M D and West M (1995) ldquoBayesian Density Estimation and In-ference Using Mixturesrdquo Journal of the American Statistical Association 90577ndash588

Ferguson T S (1973) ldquoA Bayesian Analysis of Some Nonparametric Prob-lemsrdquo The Annals of Statistics 1 209ndash230

(1974) ldquoPrior Distributions on Spaces of Probability Measuresrdquo TheAnnals of Statistics 2 615ndash629

(1983) ldquoBayesian Density Estimation by Mixtures of Normal Distrib-utionsrdquo in Recent Advances in Statistics eds M H Rizvi J S Rustagi andD Siegmund New York Academic Press pp 287ndash302

Freedman D A (1963) ldquoOn the Asymptotic Behavior of Bayesrsquo Estimates inthe Discrete Caserdquo The Annals of Mathematical Statistics 34 1386ndash1403

Gradshteyn I S and Ryzhik J M (2000) Table of Integrals Series andProducts (6th ed) New York Academic Press

Green P J and Richardson S (2001) ldquoModelling Heterogeneity Withand Without the Dirichlet Processrdquo Scandinavian Journal of Statistics 28355ndash375

Halmos P (1944) ldquoRandom Almsrdquo The Annals of Mathematical Statistics 15182ndash189

Hjort N L (2003) ldquoTopics in Non-Parametric Bayesian Statisticsrdquo in HighlyStructured Stochastic Systems eds P J Green et al Oxford UK OxfordUniversity Press pp 455ndash478

Ishwaran H and James L F (2001) ldquoGibbs Sampling Methods for Stick-Breaking Priorsrdquo Journal of American Statistical Association 96 161ndash173

(2003) ldquoGeneralized Weighted Chinese Restaurant Processes forSpecies Sampling Mixture Modelsrdquo Statistica Sinica 13 1211ndash1235

James L F (2002) ldquoPoisson Process Partition Calculus With Applications toExchangeable Models and Bayesian Nonparametricsrdquo Mathematics ArXivmathPR0205093

(2003) ldquoA Simple Proof of the Almost Sure Discreteness of a Class ofRandom Measuresrdquo Statistics amp Probability Letters 65 363ndash368

Kingman J F C (1975) ldquoRandom Discrete Distributionsrdquo Journal of theRoyal Statistical Society Ser B 37 1ndash22

Lavine M (1992) ldquoSome Aspects of Poacutelya Tree Distributions for StatisticalModellingrdquo The Annals of Statistics 20 1222ndash1235

Lo A Y (1984) ldquoOn a Class of Bayesian Nonparametric Estimates I DensityEstimatesrdquo The Annals of Statistics 12 351ndash357

Mauldin R D Sudderth W D and Williams S C (1992) ldquoPoacutelya Trees andRandom Distributionsrdquo The Annals of Statistics 20 1203ndash1221

MacEachern S N (1994) ldquoEstimating Normal Means With a Conjugate StyleDirichlet Process Priorrdquo Communications in Statistics Simulation and Com-putation 23 727ndash741

MacEachern S N and Muumlller P (1998) ldquoEstimating Mixture of Dirich-let Process Modelsrdquo Journal of Computational and Graphical Statistics 7223ndash239

Nieto-Barajas L E Pruumlnster I and Walker S G (2004) ldquoNormalized Ran-dom Measures Driven by Increasing Additive Processesrdquo The Annals of Sta-tistics 32 2343ndash2360

Paddock S M Ruggeri F Lavine M and West M (2003) ldquoRandomizedPolya Tree Models for Nonparametric Bayesian Inferencerdquo Statistica Sinica13 443ndash460

Perman M Pitman J and Yor M (1992) ldquoSize-Biased Sampling of PoissonPoint Processes and Excursionsrdquo Probability Theory and Related Fields 9221ndash39

Petrone S and Veronese P (2002) ldquoNonparametric Mixture Priors Based onan Exponential Random Schemerdquo Statistical Methods and Applications 111ndash20

Pitman J (1995) ldquoExchangeable and Partially Exchangeable Random Parti-tionsrdquo Probability Theory and Related Fields 102 145ndash158

(1996) ldquoSome Developments of the BlackwellndashMacQueen UrnSchemerdquo in Statistics Probability and Game Theory Papers in Honor ofDavid Blackwell eds T S Ferguson et al Hayward CA Institute of Math-ematical Statistics pp 245ndash267

(2003) ldquoPoissonndashKingman Partitionsrdquo in Science and StatisticsA Festschrift for Terry Speed ed D R Goldstein Hayward CA Instituteof Mathematical Statistics pp 1ndash35

Pruumlnster I (2002) ldquoRandom Probability Measures Derived From IncreasingAdditive Processes and Their Application to Bayesian Statisticsrdquo unpub-lished doctoral thesis University of Pavia

Regazzini E (2001) ldquoFoundations of Bayesian Statistics and Some Theory ofBayesian Nonparametric Methodsrdquo lecture notes Stanford University

Regazzini E Guglielmi A and Di Nunno G (2002) ldquoTheory and NumericalAnalysis for Exact Distribution of Functionals of a Dirichlet Processrdquo TheAnnals of Statistics 30 1376ndash1411

Regazzini E Lijoi A and Pruumlnster I (2003) ldquoDistributional Results forMeans of Random Measures With Independent Incrementsrdquo The Annals ofStatistics 31 560ndash585

Roeder K (1990) ldquoDensity Estimation With Confidence Sets Exemplified bySuperclusters and Voids in the Galaxiesrdquo Journal of the American StatisticalAssociation 85 617ndash624

Roeder K and Wasserman L (1997) ldquoPractical Bayesian Density EstimationUsing Mixtures of Normalsrdquo Journal of the American Statistical Association92 894ndash902

Seshadri V (1993) The Inverse Gaussian Distribution New York Oxford Uni-versity Press

Walker S and Damien P (1998) ldquoA Full Bayesian Non-Parametric AnalysisInvolving a Neutral to the Right Processrdquo Scandinavia Journal of Statistics25 669ndash680

Walker S Damien P Laud P W and Smith A F M (1999) ldquoBayesianNonparametric Inference for Random Distributions and Related Functionsrdquo(with discussion) Journal of the Royal Statistical Society Ser B 61485ndash527