On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian...

Post on 04-Aug-2020

28 views 0 download

Transcript of On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian...

On Bayesian nonparametric regression

Sonia Petrone

Bocconi University, Milano, Italy

with Sara Wade (University of Warwick, UK)and Michele Peruzzi (Bocconi University)

Building Bridges, Oslo, 22-14 May 2017

motivation

1. Regression is possibly the most important problem in Statistics!

2. bridging parametric & nonparametricBayesian nonparametrics somehow extends parametricconstructions to processes..

3. explosion of methods for Bayesian nonparametric regression.However:

• still, less theory than for density estimation

• rich but fragmented literature

• little “ready-to-go” software

→ aim: a brief overview, having these issues in mind

motivation

1. Regression is possibly the most important problem in Statistics!

2. bridging parametric & nonparametricBayesian nonparametrics somehow extends parametricconstructions to processes..

3. explosion of methods for Bayesian nonparametric regression.However:

• still, less theory than for density estimation

• rich but fragmented literature

• little “ready-to-go” software

→ aim: a brief overview, having these issues in mind

motivation

1. Regression is possibly the most important problem in Statistics!

2. bridging parametric & nonparametricBayesian nonparametrics somehow extends parametricconstructions to processes..

3. explosion of methods for Bayesian nonparametric regression.However:

• still, less theory than for density estimation

• rich but fragmented literature

• little “ready-to-go” software

→ aim: a brief overview, having these issues in mind

motivation - 1. regression

Regression is possibly the most important problem in Statistics

Classical approaches now ‘compete’ with machine learningmethods, deep learning, and more....

more bridges?frequentist/Bayesian; · · · parametric/nonparametric, · · ·statistics/machine learning? . . . probabilistic/non-probabilistic?

Anyway, the basic issue of quantifying the uncertainty remainscrucial. The output of regression is the basis for risk evaluation anddecision. A poor quantification of uncertainty may be a disaster...

..and the Bayesian approach is based on quantifying informationand uncertainty!

motivation - 1. regression

Regression is possibly the most important problem in Statistics

Classical approaches now ‘compete’ with machine learningmethods, deep learning, and more....

more bridges?frequentist/Bayesian; · · · parametric/nonparametric, · · ·statistics/machine learning? . . . probabilistic/non-probabilistic?

Anyway, the basic issue of quantifying the uncertainty remainscrucial. The output of regression is the basis for risk evaluation anddecision. A poor quantification of uncertainty may be a disaster...

..and the Bayesian approach is based on quantifying informationand uncertainty!

deep learning, deep learning, deep learning....

motivation - 2. parametric & nonpar conjugate priors

Bayesian nonparametrics

Xi | Fiid∼ F , F ∼ prior distribution.

The most popular nonparametric prior, the Dirichlet process, is theextension to a process of the Dirichlet conjugate prior for(parametric) multinomial sampling.

motivation - 2. parametric conjugate priors

X | ξ ∼ p(x | ξ) in the NEF,ξ ∼ standard conjugate prior (Diaconis & Ylvisaker, 1979).

Bernoulli BetaMultinomial Dirichlet

N(0,Σ) Inverse-Wishart...

...

But, conjugate priors for a multivariate NEF are restrictive.

Generalized Dirichlet Connor & Mosiman (1969)Generalized Wishart Brown, Le, Zidek (1994)

Enriched conjugate priors Consonni and Veronese (2001)...

...

If f (x1, . . . , xk ; ξ) NEF, and can be decomposed in a product ofNEF densities, each having its own parameters, the enrichedconjugate prior on ξ is obtained by giving a standard conjugateprior for each of those densities.

motivation - 2. parametric conjugate priors

X | ξ ∼ p(x | ξ) in the NEF,ξ ∼ standard conjugate prior (Diaconis & Ylvisaker, 1979).

Bernoulli BetaMultinomial Dirichlet

N(0,Σ) Inverse-Wishart...

...

But, conjugate priors for a multivariate NEF are restrictive.

Generalized Dirichlet Connor & Mosiman (1969)Generalized Wishart Brown, Le, Zidek (1994)

Enriched conjugate priors Consonni and Veronese (2001)...

...

If f (x1, . . . , xk ; ξ) NEF, and can be decomposed in a product ofNEF densities, each having its own parameters, the enrichedconjugate prior on ξ is obtained by giving a standard conjugateprior for each of those densities.

motivation - 2. parametric conjugate priors

X | ξ ∼ p(x | ξ) in the NEF,ξ ∼ standard conjugate prior (Diaconis & Ylvisaker, 1979).

Bernoulli BetaMultinomial Dirichlet

N(0,Σ) Inverse-Wishart...

...

But, conjugate priors for a multivariate NEF are restrictive.

Generalized Dirichlet Connor & Mosiman (1969)Generalized Wishart Brown, Le, Zidek (1994)

Enriched conjugate priors Consonni and Veronese (2001)...

...

If f (x1, . . . , xk ; ξ) NEF, and can be decomposed in a product ofNEF densities, each having its own parameters, the enrichedconjugate prior on ξ is obtained by giving a standard conjugateprior for each of those densities.

examples

• (X ,Y ) | Σ ∼ Nk(0,Σ).decompose f (x , y ; Σ) = fx(x ;φ)f (y | x ; θ) (both Gaussian),and assign conjugate priors of φ and θ.→ Generalized Wishart on Σ−1.

• (N1, . . . ,Nk) | (p1, . . . , pk) ∼ Multinomial(N, p1, . . . , pk).Re-write the multinomial as a product, by suitablereparametrization→ Generalized Dirichlet distribution for (p1, . . . , pk).

motivation - 2. nonparametric conjugate priors

The DP extends to a process. Let Xi | Fiid∼ F .

A Dirichlet process prior for F , F ∼ DP(αF0), is such that, for anyfinite partition, the random vector of probabilities(p1, . . . , pk) = (F (A1), . . . ,F (Ak)) ∼ Dir(αF0(A1), . . . , αF0(Ak))(Ferguson, 1973).

The DP inherits conjugacy from the conjugacy of the Dirichletdistribution for multinomial sampling; but also its lack of flexibility.This is the more true when F is on <k .

...An enriched conjugate nonparametric prior?

Enriched conjugate prior distributions: non-parametric

Doksum’s Neutral to the Right Process (Doksum, 1974) extendsthe enriched conjugate Dirichlet distribution to a process.

However, NTR processes are limited to univariate randomdistributions.

The Dirichlet distribution implies that any permutation of (p1, . . . , pk) is

completely neutral (that is, p1, p2/(1− p1), . . . , pk/(1−∑k−1

j=1 pj) are

independent).

The Generalized Dirichlet only assumes that one ordered vector

(p1, . . . , pk) is completely neutral. In <, there is a natural ordering. In

more dimensions (e.g., contingency tables p(x , y)), there is no natural

ordering.

Enriched conjugate prior distributions: non-parametric

Doksum’s Neutral to the Right Process (Doksum, 1974) extendsthe enriched conjugate Dirichlet distribution to a process.

However, NTR processes are limited to univariate randomdistributions.

The Dirichlet distribution implies that any permutation of (p1, . . . , pk) is

completely neutral (that is, p1, p2/(1− p1), . . . , pk/(1−∑k−1

j=1 pj) are

independent).

The Generalized Dirichlet only assumes that one ordered vector

(p1, . . . , pk) is completely neutral. In <, there is a natural ordering. In

more dimensions (e.g., contingency tables p(x , y)), there is no natural

ordering.

Enriched Dirichlet Process

(Wade, Mongelluzzo, P., 2011).

Finite case: we define an Enriched Dirichlet distribution for[p(x , y)] by choosing the ordering throughp(x , y) = py |x(y | x)px(x), and assuming that the vectors of themarginal and conditional probabilities have independent Dirichletdisributions.

Nonparametric case: P(x , y) random distribution on <k . Assume:

• Px ∼ DP(αxP0,x)

• for any x , Py |x ∼ DP(αy (x)P0,y |x)

all independent.

This well defines a probability law for the random P(x , y), namedEnriched Dirichlet Process (EDP)

Enriched Dirichlet Process

(Wade, Mongelluzzo, P., 2011).

Finite case: we define an Enriched Dirichlet distribution for[p(x , y)] by choosing the ordering throughp(x , y) = py |x(y | x)px(x), and assuming that the vectors of themarginal and conditional probabilities have independent Dirichletdisributions.

Nonparametric case: P(x , y) random distribution on <k . Assume:

• Px ∼ DP(αxP0,x)

• for any x , Py |x ∼ DP(αy (x)P0,y |x)

all independent.

This well defines a probability law for the random P(x , y), namedEnriched Dirichlet Process (EDP)

motivation - 3. Bayesian nonparametric regression

Focus on mixture models for1. conditional density estimation: estimate f (y | x), and2. density regression: how the density fx(y) of Y varies with x .

1 Details of the ‘nonparametric’ prior matter – not only forasymptotics, but for finite sample prediction.We show the case of two different BNP priors, bothconsistent: but one of them is clearly better – better exploitsinformation.Bayesian criteria to formalize such improvement?

2 Explosion of dependent Dirichlet processes (DDP) models.How can we compare? and offer a ‘default choice’ topractitioners, possibly in an R-package?

motivation - 3. Bayesian nonparametric regression

Focus on mixture models for1. conditional density estimation: estimate f (y | x), and2. density regression: how the density fx(y) of Y varies with x .

1 Details of the ‘nonparametric’ prior matter – not only forasymptotics, but for finite sample prediction.We show the case of two different BNP priors, bothconsistent: but one of them is clearly better – better exploitsinformation.Bayesian criteria to formalize such improvement?

2 Explosion of dependent Dirichlet processes (DDP) models.How can we compare? and offer a ‘default choice’ topractitioners, possibly in an R-package?

Outline

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)Example: Improving prediction by Enriched DP mixtures

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Density regression

X predictor, p-dimensional; Y response.

∗ Bayesian nonparametric (mean) regression:

: x → m(x) = E(Y | x)

flexible model/prior on m(x) (basis expansions, Gaussian processes

and splines (Wahba (1990), Denison, Holmes, Mallick, Smith (2002)),

wavelets (Vidakovic, 2009), neural networks (Neal, 1996),..

∗ Yet, the mean may be a too poor summary of the relationshipbetween x and y .⇒ median, quantile regression,.. density regression

: x → f (y | x)

Limited literature on optimal estimators of f (y | x). (Efromovich,Ann. Stat. 2007). How about Bayesian methods?

Random or fixed design

• regression: random design(Xi ,Yi ), i = 1, . . . , n are a random sample from f (x , y).

Then, estimate the joint density f (x , y) and from this the

conditional f (y | x) = f (x ,y)fx (x)

.

• fixed designx1, . . . , xn is a deterministic sequence. If predictor is x ,Y ∼ f (y | x). in fact, fx(y).

• replicates of y values at a given x (x typically categorical orordinal), (e.g. ANOVA)

• no replicates (x typically continuous)still interest in fx(y): borrowing strength through smoothnessconditions along x .

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)Example: Improving prediction by Enriched DP mixtures

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Random design: DP mixtures for f (x , y)

(Xi ,Yi ) | fiid∼ f (x , y), f ∼ prior prob. law

∗ Model f as a mixture of kernels:

(Xi ,Yi ) | Giid∼ fG (x , y) =

∫K (x , y | θ)dG (θ)

G ∼ DP(αG0)

Usually, K (x , y | θ) = Np+1(x , y | µ,Σ), with θ = (µ,Σ), andG0(θ) conjugate prior.

Since, a.s., G =∑∞

j=1 pjδθ∗j , where (pj) ∼ stick-breaking(α)

independent of θ∗jiid∼ G0, the mixture above reduces to

fG (x , y) =∞∑j=1

wjK (x , y | θ∗j ).

Random design: DP mixtures for f (x , y)

(Xi ,Yi ) | fiid∼ f (x , y), f ∼ prior prob. law

∗ Model f as a mixture of kernels:

(Xi ,Yi ) | Giid∼ fG (x , y) =

∫K (x , y | θ)dG (θ)

G ∼ DP(αG0)

Usually, K (x , y | θ) = Np+1(x , y | µ,Σ), with θ = (µ,Σ), andG0(θ) conjugate prior.

Since, a.s., G =∑∞

j=1 pjδθ∗j , where (pj) ∼ stick-breaking(α)

independent of θ∗jiid∼ G0, the mixture above reduces to

fG (x , y) =∞∑j=1

wjK (x , y | θ∗j ).

Random design: DP mixtures for f (x , y)

(Xi ,Yi ) | fiid∼ f (x , y), f ∼ prior prob. law

∗ Model f as a mixture of kernels:

(Xi ,Yi ) | Giid∼ fG (x , y) =

∫K (x , y | θ)dG (θ)

G ∼ DP(αG0)

Usually, K (x , y | θ) = Np+1(x , y | µ,Σ), with θ = (µ,Σ), andG0(θ) conjugate prior.

Since, a.s., G =∑∞

j=1 pjδθ∗j , where (pj) ∼ stick-breaking(α)

independent of θ∗jiid∼ G0, the mixture above reduces to

fG (x , y) =∞∑j=1

wjK (x , y | θ∗j ).

mixture of Gaussian kernels

mixture of Gaussian kernels

Latent variable representation

The DP-mixture model is equivalently expressed as

(Xi ,Yi ) | θiind∼ K (x , y | θi )

θi | Giid∼ G

G ∼ DP(αG0)

Integrating the θi out, one has back the countable mixture model

(Xi ,Yi ) | Giid∼ fG (x , y) =

∞∑j=1

pj K (x , y | θ∗j ).

Then the conditional density f (y | x) is obtained as

fG (y | x) =

∑j pj K (x , y | θ∗j )∑j piK (x | θ∗j )

=∑j

pj(x)K (y | x , θ∗j )

where pj(x) = pj K (x | θ∗j )/(∑

j ′ pj ′K (x | θ∗j ′).

Latent variable representation

The DP-mixture model is equivalently expressed as

(Xi ,Yi ) | θiind∼ K (x , y | θi )

θi | Giid∼ G

G ∼ DP(αG0)

Integrating the θi out, one has back the countable mixture model

(Xi ,Yi ) | Giid∼ fG (x , y) =

∞∑j=1

pj K (x , y | θ∗j ).

Then the conditional density f (y | x) is obtained as

fG (y | x) =

∑j pj K (x , y | θ∗j )∑j piK (x | θ∗j )

=∑j

pj(x)K (y | x , θ∗j )

where pj(x) = pj K (x | θ∗j )/(∑

j ′ pj ′K (x | θ∗j ′).

Random partition

Since G is a.s. discrete, ties in a sample (θ1, . . . , θn) from G havepositive probability, so that

(θ1, . . . , θn) described by (ρn; θ∗1 , . . . , θ∗k(ρn)

)

• a random partition ρn = (s1, . . . , sn)

• the cluster-specific parameters θ∗j

Ex: for n = 5, ρn = (1, 1, 2, 2, 1) gives (θ1, . . . , θn) = (θ∗1 , θ∗1 , θ∗2 , θ∗2 , θ∗1),

kn = 2 two clusters of size n1 = 3, n2 = 2 resp., with cluster-specificparameters θ∗1 , θ

∗2 .

• The DP induces a probability law of the random partition

p(ρn) =Γ(α)

Γ(α + n)αkn

kn∏j=1

Γ(nj)

• Given the partition ρn, the cluster specific parameters θ∗j are i.i.d.∼ G0.

mixture of Gaussian kernels

mixture of Gaussian kernels

mixture of Gaussian kernels

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

0 2 4 6

−2

02

46

8

x 1

y●

● ●

●●

●●

●●

●●

●● ●

● ●●

● ●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●● ●

●●

●●

● ●

●●

Inference

∗ posterior on ρn

p(ρn | x1:n, y1:n) ∝ p(ρn)kn∏j=1

∫ ∏i :(xi ,yi )∈Sj

K (xi , yi | θ∗j )dP0(θ∗j )

∝ p(ρn)kn∏j=1

m({(xi , yi ) ∈ Cj} | ρn)

prior × independent marginal likelihoods in each clusters.

∗ posterior on (θ∗j ) Given the partition, clusters are independent,and inference on θ∗j is based only on obs. in group Cj

p(θ∗1, . . . , θ∗kn | x1:n, y1:n, ρn) =

kn∏j=1

p(θ∗j | ρn, {(xi , yi ) ∈ Cj})

Inference

∗ posterior on ρn

p(ρn | x1:n, y1:n) ∝ p(ρn)kn∏j=1

∫ ∏i :(xi ,yi )∈Sj

K (xi , yi | θ∗j )dP0(θ∗j )

∝ p(ρn)kn∏j=1

m({(xi , yi ) ∈ Cj} | ρn)

prior × independent marginal likelihoods in each clusters.

∗ posterior on (θ∗j ) Given the partition, clusters are independent,and inference on θ∗j is based only on obs. in group Cj

p(θ∗1, . . . , θ∗kn | x1:n, y1:n, ρn) =

kn∏j=1

p(θ∗j | ρn, {(xi , yi ) ∈ Cj})

Predictive distribution (density estimate)

With quadratic loss, the Bayesian density estimate is the predictivedensity

(Xn+1,Yn+1) | x1:n, y1:n ∼ f̂ (x , y) = E(f (x , y) | x1:n, y1:n).

∗ Recalling that G | θ1:n, x1:n, y1:n ∼ DP(αG0 +∑n

i=1 δθi ),

f̂ (x , y) = E(E(fG (x , y) | θ1:n) | x1:n, y1:n)

α + nfG0(x , y) +

n

α + nE

(n∑

i=1

K (x , y | θi )n

| x1:n, y1:n

)

average of prior guess fG0 and expectation of a kernel estimatewith kernels centered at the θi .

conditional density estimate

∗ Since (θ1, . . . , θn)↔ (ρn, θ∗1, . . . , θ

∗k), the joint density estimate is

f̂ (x , y) =α

α + nfG0(x , y)

+∑ρn

k(ρn)∑j=1

nj(ρn)

α + nf (x , y | (xi , yi ) ∈ Cj(ρn)

p(ρn | x1:n, y1:n)

average of the prior guess fG0 , and given the partition ρn, of thepredictive densities in clusters Cj(ρn).

From f̂ (x , y), one can find an estimate of f (y | x).

’clustering’ and density estimate

The partition is not of main interest (mixture components just playthe role of kernels), but p(ρn | x1:n, y1:n) plays a crucial role.

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

0 2 4 6

−2

02

46

8

x 1

y

●●

● ●

●●

●●

●●

●●

●● ●

● ●●

● ●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●● ●

●●

●●

● ●

●●

Such role of the prior and posterior distribution of the randompartition is often overlooked.

How comparing nonparametric priors?

Frequentist properties.Results on frequentist asymptotic properties for multivariatedensity estimation, and some results for regression and conditionaldensity estimation (Wu & Ghosal (2008; 2010); Tokdar (2011), Norets

& Pelenis (2012), Shen, Tokdar & Ghosal (2013), Canale & Dunson

(2015), Bhattacharya, Pati, Dunson (2014)– anisotropic; Norets & Pati

(2016+), . . .

But, how about (Bayesian) finite sample properties and predictiveperformance?

Different nonparametric priors may be consistent, but have quitediverse predictive performance!The implied distribution on the random partition plays a crucialrole.

Example 1. Wade, Walker & Petrone (Scand. J. Statist., 2014)introduce a restricted partition model to overcome difficulties ofDP mixtures in curve fitting. Their proposal clearly leads toimproved prediction.

How comparing nonparametric priors?

Frequentist properties.Results on frequentist asymptotic properties for multivariatedensity estimation, and some results for regression and conditionaldensity estimation (Wu & Ghosal (2008; 2010); Tokdar (2011), Norets

& Pelenis (2012), Shen, Tokdar & Ghosal (2013), Canale & Dunson

(2015), Bhattacharya, Pati, Dunson (2014)– anisotropic; Norets & Pati

(2016+), . . .

But, how about (Bayesian) finite sample properties and predictiveperformance?

Different nonparametric priors may be consistent, but have quitediverse predictive performance!The implied distribution on the random partition plays a crucialrole.

Example 1. Wade, Walker & Petrone (Scand. J. Statist., 2014)introduce a restricted partition model to overcome difficulties ofDP mixtures in curve fitting. Their proposal clearly leads toimproved prediction.

How comparing nonparametric priors?

Frequentist properties.Results on frequentist asymptotic properties for multivariatedensity estimation, and some results for regression and conditionaldensity estimation (Wu & Ghosal (2008; 2010); Tokdar (2011), Norets

& Pelenis (2012), Shen, Tokdar & Ghosal (2013), Canale & Dunson

(2015), Bhattacharya, Pati, Dunson (2014)– anisotropic; Norets & Pati

(2016+), . . .

But, how about (Bayesian) finite sample properties and predictiveperformance?

Different nonparametric priors may be consistent, but have quitediverse predictive performance!The implied distribution on the random partition plays a crucialrole.

Example 1. Wade, Walker & Petrone (Scand. J. Statist., 2014)introduce a restricted partition model to overcome difficulties ofDP mixtures in curve fitting. Their proposal clearly leads toimproved prediction.

Problem: anisotropic case

X continuos predictors; Gaussian kernels.The DP mixture of multivariate Gaussian distributions uses joint clustersto fit the density f (x , y).

BUT, the conditional density fy |x and the marginal density fx might havedifferent smoothness; in regression, typically fy |x is smoother than fx .

here, many small clusters (kernels) are needed to fit the fx density, whilemuch fewer kernels would suffice for fy |x .If the dimension of x is large, the likelihood is dominated by the xcomponent and many small clusters are suggested by the posterior on ρn.This impoverishes the performance of the model.

Problem: anisotropic case

X continuos predictors; Gaussian kernels.The DP mixture of multivariate Gaussian distributions uses joint clustersto fit the density f (x , y).BUT, the conditional density fy |x and the marginal density fx might havedifferent smoothness; in regression, typically fy |x is smoother than fx .

here, many small clusters (kernels) are needed to fit the fx density, whilemuch fewer kernels would suffice for fy |x .If the dimension of x is large, the likelihood is dominated by the xcomponent and many small clusters are suggested by the posterior on ρn.This impoverishes the performance of the model.

Problem: anisotropic case

X continuos predictors; Gaussian kernels.The DP mixture of multivariate Gaussian distributions uses joint clustersto fit the density f (x , y).BUT, the conditional density fy |x and the marginal density fx might havedifferent smoothness; in regression, typically fy |x is smoother than fx .

here, many small clusters (kernels) are needed to fit the fx density, whilemuch fewer kernels would suffice for fy |x .If the dimension of x is large, the likelihood is dominated by the xcomponent and many small clusters are suggested by the posterior on ρn.This impoverishes the performance of the model.

This undesirable behavior does not seem to vanish with increasing samplesize.

If fx(x) requires many clusters, the unappealing behaviour of the randompartition could be reflected in worse convergence rates. Efromovich[2007] shows that if the conditional density is smoother than the joint, itcan be estimated at a faster rate.

Thus, improving inference on the random partition to take into accountthe different degree of smoothness of fx and fy |x is crucial.

Improving prediction

Consider the Dirichlet mixture of Gaussian kernels

(Xi ,Yi ) | G ∼∫

Np+1(µ,Σ)dG (µ,Σ), G ∼ DP(αG0).

The base measure of the DP, G0(µ,Σ), is usually Normal-InvWishart. BUT, this conjugate prior is restrictive if p is large

Improving prediction

• Write the kernels as

Np+1(x , y | µ,Σ) = Np(x | µx ,Σx) N(y | x ′β, σ2y |x)

and use simple spherical x-kernels (Shahbaba and Neal, 2009,Hannah et al., 2011). Thus,

fx (x | P) =∞∑j=1

wjNp(x | µ∗xj ,

σ2∗x1,j 0 · · · 0

0 σ2∗x2,j 0 0

0 · · · · · · · · ·0 0 0 σ2∗

xp,j

)

f (y | x ,P) =∞∑j=1

wj(x)N(y | x ′β∗j , σ2 ∗y |x,j).

• Denote the x-parameters φ = ((µx1, . . . , µxp), (σ2x1, . . . , σ

2xp)) and

the y |x-parameters θ = (β, σ2y |x). Assign independent conjugate

priors G0,φ(φ) and G0,θ(θ) (that leads to an enriched conjugate priorfor (µ,Σ)

• In the DP mixture model, assume individual-specific (φi , θi ) as arandom sample from G ∼ DP(αG0,φ G0,θ)

Improving prediction

• Write the kernels as

Np+1(x , y | µ,Σ) = Np(x | µx ,Σx) N(y | x ′β, σ2y |x)

and use simple spherical x-kernels (Shahbaba and Neal, 2009,Hannah et al., 2011). Thus,

fx (x | P) =∞∑j=1

wjNp(x | µ∗xj ,

σ2∗x1,j 0 · · · 0

0 σ2∗x2,j 0 0

0 · · · · · · · · ·0 0 0 σ2∗

xp,j

)

f (y | x ,P) =∞∑j=1

wj(x)N(y | x ′β∗j , σ2 ∗y |x,j).

• Denote the x-parameters φ = ((µx1, . . . , µxp), (σ2x1, . . . , σ

2xp)) and

the y |x-parameters θ = (β, σ2y |x). Assign independent conjugate

priors G0,φ(φ) and G0,θ(θ) (that leads to an enriched conjugate priorfor (µ,Σ)

• In the DP mixture model, assume individual-specific (φi , θi ) as arandom sample from G ∼ DP(αG0,φ G0,θ)

Improving prediction

• Write the kernels as

Np+1(x , y | µ,Σ) = Np(x | µx ,Σx) N(y | x ′β, σ2y |x)

and use simple spherical x-kernels (Shahbaba and Neal, 2009,Hannah et al., 2011). Thus,

fx (x | P) =∞∑j=1

wjNp(x | µ∗xj ,

σ2∗x1,j 0 · · · 0

0 σ2∗x2,j 0 0

0 · · · · · · · · ·0 0 0 σ2∗

xp,j

)

f (y | x ,P) =∞∑j=1

wj(x)N(y | x ′β∗j , σ2 ∗y |x,j).

• Denote the x-parameters φ = ((µx1, . . . , µxp), (σ2x1, . . . , σ

2xp)) and

the y |x-parameters θ = (β, σ2y |x). Assign independent conjugate

priors G0,φ(φ) and G0,θ(θ) (that leads to an enriched conjugate priorfor (µ,Σ)

• In the DP mixture model, assume individual-specific (φi , θi ) as arandom sample from G ∼ DP(αG0,φ G0,θ)

Joint DP mixture model

Yi |xi , βi , σ2y ,iind∼ N(β′ix i , σ

2y |x ,i ), θi = (βi , σ

2y |x ,i ),

Xi |µi , σ2x ,iind∼

p∏h=1

N(µx ,h,i , σ2x ,h,i ), φi = (µx ,i , σ

2x ,i )

(θi , φi ) | Gi .i .d .∼ G ,

G ∼ DP(αG0θ × G0φ).

with G0θ and G0φ independent conjugate Normal-Inverse Gammapriors.

behavior for large p

The model is flexible, and MCMC computations are standard.

BUT, if p is large, many (independent) kernels will be typicallyneeded to describe the (dependent) marginal fx , while therelationship Y | x can be smoother.

However, the DP only allows joint clusters of (φi , θi ), i = 1, . . . , n.

Given its crucial role, difficulties in the random partition haverelevant consequences on predictionWe would like to use a prior on P that allows many φi clusters, tofit fx , but fewer θj clusters, and it is still conjugate, so thatcomputations remain simple.

⇒ Enriched Dirichlet Process (Wade, Mongelluzzo, P., 2011)

behavior for large p

The model is flexible, and MCMC computations are standard.

BUT, if p is large, many (independent) kernels will be typicallyneeded to describe the (dependent) marginal fx , while therelationship Y | x can be smoother.

However, the DP only allows joint clusters of (φi , θi ), i = 1, . . . , n.

Given its crucial role, difficulties in the random partition haverelevant consequences on predictionWe would like to use a prior on P that allows many φi clusters, tofit fx , but fewer θj clusters, and it is still conjugate, so thatcomputations remain simple.

⇒ Enriched Dirichlet Process (Wade, Mongelluzzo, P., 2011)

Enriched Dirichlet Process

Extends the idea that leads from the Dirichlet distribution (conjugate tothe multinomial) to the enriched (or generalized) Dirichlet dist. (Connor& Mosiman (1969);

and from the DP to Doksum (1974) neutral-to-the-right priors for

random probability measures on the real line.

EDP: Assign a prior for the random prob. measure G (θ, φ) byassuming:

• Pθ ∼ DP(αθP0,θ)

• for any θ, Pφ|θ ∼ DP(αφ(θ)P0,φ|θ)

all independent.

The EDP gives a nested random partition: ρn = (ρn,θ, ρn,φ), thatallows many φ-clusters inside each θ-cluster.

EDP: nested random partition

ρn = (ρn,θ, ρn,φ) : many φ-clusters inside each θ-cluster.

• Pθ ∼ DP(αP0θ) gives a Chinese restaurant process:customers choose restaurants, and restaurants are colored

with colors θ∗hiid∼ P0θ (nonatomic):

• Pφ|θ ∼ DP(αφ(θ)P0φ|θ) gives a nested CRP: within eachrestaurant, customers sits at tables as in the CRP. Tables in

restaurant θ∗h are colored with colors φ∗j |hiid∼ P0,φ|θ(φ | θ).

EDP: nested random partition

ρn = (ρn,θ, ρn,φ) : many φ-clusters inside each θ-cluster.

• Pθ ∼ DP(αP0θ) gives a Chinese restaurant process:customers choose restaurants, and restaurants are colored

with colors θ∗hiid∼ P0θ (nonatomic):

• Pφ|θ ∼ DP(αφ(θ)P0φ|θ) gives a nested CRP: within eachrestaurant, customers sits at tables as in the CRP. Tables in

restaurant θ∗h are colored with colors φ∗j |hiid∼ P0,φ|θ(φ | θ).

Joint EDP mixture model

• Model (replace DP with EDP):

Yi |xi , βi , σ2y ,i

ind∼ N(β′i x i , σ2y |x,i ), θi = (βi , σ

2y |x,i ),

Xi |µi , σ2x,i

ind∼p∏

h=1

N(µx,h,i , σ2x,i,h), φi = (µx,i , σ

2x,i )

(θi , φi ) | Gi.i.d.∼ G ,

G ∼ EDP(αθ, αφ(·),G0,θ × G0,φ|θ).

• Computations remain simple, as the EDP is a conjugate prior;

• Inference on a cluster-specific θ∗j = (β,jσy |x∗), (thus,ultimately, on the conditional density f (y | x , θ)), exploits theinformation from the observations in all the φ∗h-clusters thatshare the same θ∗j ⇒ much improved inference and prediction

(Wade, Dunson, Petrone, Trippa, JMLR (2014))

Simulation study

Toy example to demonstrate two key advantages of the EDP model

• it can recover the true coarser θ-partition

• improved prediction and smaller credible intervals result.

Data: 200 obs (xi , yi ) whereThe covariates Xi are sampled from a p-variate normal,

Xi = (Xi,1, . . . ,Xi,p)′iid∼ N(µ,Σ),

centered at µ = (4, . . . , 4)′ with Σh,h = 4 for h = 1, . . . , p, andcovariances that model two groups of covariates with different correlationstructure.

The true regression only depends on the first covariate; it is a nonlinearregression obtained as a mixture

Yi |xiind∼ p(xi,1)N(yi |β1,0+β1,1xi,1, σ

21)+(1−p(xi,1))N(yi |β2,0+β2,1xi,1, σ

22)

Simulation study

Toy example to demonstrate two key advantages of the EDP model

• it can recover the true coarser θ-partition

• improved prediction and smaller credible intervals result.

Data: 200 obs (xi , yi ) whereThe covariates Xi are sampled from a p-variate normal,

Xi = (Xi,1, . . . ,Xi,p)′iid∼ N(µ,Σ),

centered at µ = (4, . . . , 4)′ with Σh,h = 4 for h = 1, . . . , p, andcovariances that model two groups of covariates with different correlationstructure.

The true regression only depends on the first covariate; it is a nonlinearregression obtained as a mixture

Yi |xiind∼ p(xi,1)N(yi |β1,0+β1,1xi,1, σ

21)+(1−p(xi,1))N(yi |β2,0+β2,1xi,1, σ

22)

prediction error

Table: Prediction error for both models as p increases.

p = 1 p = 5 p = 10 p = 15

l̂1 l̂2 l̂1 l̂2 l̂1 l̂2 l̂1 l̂2DP 0.03 0.05 0.16 0.2 0.25 0.34 0.26 0.34

EDP 0.04 0.05 0.06 0.1 0.09 0.16 0.12 0.21

Real data: Alzheimer disease diagnosis

Data were obtained from the Alzheimer’s Disease Neuroimaging Initiative(ADNI) database, which is publicly accessible at UCLA’s Laboratory ofNeuroimaging.

covariates: summaries of p = 15 brain structures computed fromstructural MRI obtained at the first visit for 377 patients, of which 159have been diagnosed with AD and 218 are cognitively normal (CN).

response: Yi = 1 (cognitively normal subject); or = 0 (diagnosed withAD).

Aim: Prediction of AD status

Extension of the model to binary response

The model is extended to a local probit model:

Yi |xi , βiind∼ Bern(Φ(x iβi )), Xi |µi , σ

2i

ind∼p∏

h=1

N(µi,h, σ2i,h),

(βi , µi , σ2i )|G iid∼ G , G ∼ Q.

First, DP prior for G : G ∼ DP(α,G0β × G0ψ), with G0β = N(0p,C−1)

and G0ψ product of p normal-inverse gamma. We let α ∼ Gamma(1, 1).

EDP prior for G : correlation between the measurements of the brainstructures and non-normal univariate histograms of the covariates suggestthat many Gaussian kernels with local independence will be needed toapproximate the density of X . The conditional density of the response,on the other hand, may not be so complicated. This motivates the choiceof an EDP prior.

Prediction for 10 new subjects

* * *

* * * * * * *

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Subject

p(y=

1|x)

● ●

●● ●

● ●

(a) DP

* * *

* * * * * * *

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Subject

p(y=

1|x)

●● ● ● ●

● ●

(b) EDP

Figure: Predicted probability of being healthy against subject index for 10new subjects and represented with circles (DP in blue and EDP in red)with the true outcome as black stars. The bars about the predictiondepict the 95% credible intervals.

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Fixed design: Conditional models

Sample (xi ,Yi ,ν), ν = 1, . . . , ni ), x fixed input.Interest in fx(y) (no longer a conditional f (y | x)).

Joint DP mixture models are used in this context, too. Yet, theyunnecessarily require to model the marginal density fx(x).

In a Bayesian approach, one wants to assign a prior on the set ofrandom prob. measures {Fx(·), x ∈ X}.The random measures Fx must be dependent, as we want somesmoothness along x , for borrowing strength; and have a givenmarginal distribution, e.g. Fx ∼ DP.

Early proposals: Cifarelli & Regazzini (1978), extending mixturesof DP (Antoniak, 1974).

Influential idea: dependent Dirichlet processes (DDP), McEachern(1999; 2000), based on dependent stick-breaking constructions.

Fixed design: Conditional models

Sample (xi ,Yi ,ν), ν = 1, . . . , ni ), x fixed input.Interest in fx(y) (no longer a conditional f (y | x)).Joint DP mixture models are used in this context, too. Yet, theyunnecessarily require to model the marginal density fx(x).

In a Bayesian approach, one wants to assign a prior on the set ofrandom prob. measures {Fx(·), x ∈ X}.The random measures Fx must be dependent, as we want somesmoothness along x , for borrowing strength; and have a givenmarginal distribution, e.g. Fx ∼ DP.

Early proposals: Cifarelli & Regazzini (1978), extending mixturesof DP (Antoniak, 1974).

Influential idea: dependent Dirichlet processes (DDP), McEachern(1999; 2000), based on dependent stick-breaking constructions.

Fixed design: Conditional models

Sample (xi ,Yi ,ν), ν = 1, . . . , ni ), x fixed input.Interest in fx(y) (no longer a conditional f (y | x)).Joint DP mixture models are used in this context, too. Yet, theyunnecessarily require to model the marginal density fx(x).

In a Bayesian approach, one wants to assign a prior on the set ofrandom prob. measures {Fx(·), x ∈ X}.The random measures Fx must be dependent, as we want somesmoothness along x , for borrowing strength; and have a givenmarginal distribution, e.g. Fx ∼ DP.

Early proposals: Cifarelli & Regazzini (1978), extending mixturesof DP (Antoniak, 1974).

Influential idea: dependent Dirichlet processes (DDP), McEachern(1999; 2000), based on dependent stick-breaking constructions.

DDP mixture models

Use a mixture model for fx(y)

Yi ,ν | x ,Gxind∼ fGx (y) =

∫K (y | x , θ)dGx(θ)

where the mixing distribution Gx is indexed by x .

A DDP prior on {Gx , x ∈ X} assumes that Gx ∼ DP(α(x)G0,x),and dependence is introduced through the dependentstick-breaking constructions:

Gx(·) =∞∑j=1

pj(x)δθ∗j (x)(·)

where, for each j :

• (wj(x), x ∈ X ) is a stochastic process, with stick-breakingconstruction w1(x) = v1(x); w2(x) = v2(x)(1− v1(x)), . . . where(vj(x), x ∈ X ) is a stochastic process with marginalsvj(x) ∼ Beta(1, α(x)), and the vj(·) are independent across x ;

• (θ∗j (x), x ∈ X ) is a stochastic process with marginals G0,x . Theprocesses θj(·) are independent across j , and indep. of the vj(·).

DDP mixture models

Use a mixture model for fx(y)

Yi ,ν | x ,Gxind∼ fGx (y) =

∫K (y | x , θ)dGx(θ)

where the mixing distribution Gx is indexed by x .

A DDP prior on {Gx , x ∈ X} assumes that Gx ∼ DP(α(x)G0,x),and dependence is introduced through the dependentstick-breaking constructions:

Gx(·) =∞∑j=1

pj(x)δθ∗j (x)(·)

where, for each j :

• (wj(x), x ∈ X ) is a stochastic process, with stick-breakingconstruction w1(x) = v1(x); w2(x) = v2(x)(1− v1(x)), . . . where(vj(x), x ∈ X ) is a stochastic process with marginalsvj(x) ∼ Beta(1, α(x)), and the vj(·) are independent across x ;

• (θ∗j (x), x ∈ X ) is a stochastic process with marginals G0,x . Theprocesses θj(·) are independent across j , and indep. of the vj(·).

Conditional approach: Single weights

The DDP allows both the mixing weights and the atoms to dependon x . But this is redundant, and one either considers 1) modelswith single weights and 2) models with covariate dependentweights.Single weights: assume wj(x) = wj with flexible θj(x):

fGx (y |x) =∞∑j=1

wjK (y |θj(x)).

• Ex. K (y |θj(x)) = N(y |µj(x), σ2j ) with µjiid∼ GP.

• Popular because inference relies on established algorithms forBNP mixtures.

Conditional approach: Covariate dependent weights

Covariate dependent weights: flexible wj(x) with θj(x) = θj :

fPx (y |x) =∞∑j=1

wj(x)K (y |θj , x),

for ex. K (y |θj(x)) = N(y |β′jx , σ2j ).

• Most techniques to define wj(x) s.t∑

j wj(x) = 1 use astick-breaking approach:

w1(x) = v1(x) and for j > 1 wj(x) = vj(x)∏j ′<j

(1− vj ′(x)),

• Proposals for vj(x) include Griffin and Steel (2006), Dunsonand Park (2008), normalized weights model (Antoniano,Wade, Walker; 2014); probit stick-breaking (Rodriguez andDunson, 2011); logit stick-breaking (Ren, Du, Carin &Dunson (2011); Rigon & Durante, 2017+), . . .

How can we compare?

Model comparison

The EDP example shows clear improvement in the predictiveperformance. We would like to formally express the gain ofinformation that a model/prior can provide.

So far

• Careful and detailed study of the information and assumptionintroduced through the prior/model; and the analyticimplications on the predictive distribution;

• compare on simulated and real data.

Model comparison

The EDP example shows clear improvement in the predictiveperformance. We would like to formally express the gain ofinformation that a model/prior can provide.So far

• Careful and detailed study of the information and assumptionintroduced through the prior/model; and the analyticimplications on the predictive distribution;

• compare on simulated and real data.

comparisons

Some findings (Peruzzi, Petrone & Wade, 2016+)

• The ‘single weights’ mixture model needs flexible atoms, forgiving reasonable prediction. But this has drawbacks,including identifiability.

• Covariate-dependent weights allow local selection of theclustering, and generally better prediction.In particular, Peruzzi & Wade (2016+) compared the kernelstick-breaking model (Dunson & Park, 2008) and thenormalized weights model (Antoniano, Wade, Walker, 2014).The latter appears to allow faster computations.

Computations

A substantial challenge for comparison is due to computations: oneneeds an algorithm that can give fast computations, and be fairlyeasily adapted to different models.

Peruzzi & Wade (2016+) suggest a clever algorithm, based onadaptive truncation and Metropolis-Hastings, moving from aproposal by Griffin (2016).

This is a necessary basis for providing R-packages for densityregression and conditional density estimation.

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

summary and discussion

I tried to give an overview of proposals for Bayesian densityregression, based on mixture models.

Careful study of the finite-sample properties of these models, forcomparison

Together with flexible, easily-exportable computational tools, theseare necessary steps for providing R-packages for density regression.

references

• Wade, S., Mongelluzzo, S. & Petrone, S. (2011). An enrichedconjugate prior for Bayesian nonparametric inference. BayesianAnalysis, 3, 359–386.

• Wade, S., Dunson, D., Petrone, S. & Trippa, L. (2014). Improvingprediction from Dirichlet process mixtures via enrichment. Journalof Machine Learning Research, 1041–1071.

• Wade, S., Walker, S. & Petrone, S. (2014). A Predictive Study ofDirichlet Process Mixture Models for Curve Fitting. Scandin. J.Statist., 41, 580–605.

• Wade, S., Peruzzi, M. & Petrone, S. (2016+) Review of Bayesiannonparametric mixture models for regression. (manuscript)