On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian...

79
On Bayesian nonparametric regression Sonia Petrone Bocconi University, Milano, Italy with Sara Wade (University of Warwick, UK) and Michele Peruzzi (Bocconi University) Building Bridges, Oslo, 22-14 May 2017

Transcript of On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian...

Page 1: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

On Bayesian nonparametric regression

Sonia Petrone

Bocconi University, Milano, Italy

with Sara Wade (University of Warwick, UK)and Michele Peruzzi (Bocconi University)

Building Bridges, Oslo, 22-14 May 2017

Page 2: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation

1. Regression is possibly the most important problem in Statistics!

2. bridging parametric & nonparametricBayesian nonparametrics somehow extends parametricconstructions to processes..

3. explosion of methods for Bayesian nonparametric regression.However:

• still, less theory than for density estimation

• rich but fragmented literature

• little “ready-to-go” software

→ aim: a brief overview, having these issues in mind

Page 3: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation

1. Regression is possibly the most important problem in Statistics!

2. bridging parametric & nonparametricBayesian nonparametrics somehow extends parametricconstructions to processes..

3. explosion of methods for Bayesian nonparametric regression.However:

• still, less theory than for density estimation

• rich but fragmented literature

• little “ready-to-go” software

→ aim: a brief overview, having these issues in mind

Page 4: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation

1. Regression is possibly the most important problem in Statistics!

2. bridging parametric & nonparametricBayesian nonparametrics somehow extends parametricconstructions to processes..

3. explosion of methods for Bayesian nonparametric regression.However:

• still, less theory than for density estimation

• rich but fragmented literature

• little “ready-to-go” software

→ aim: a brief overview, having these issues in mind

Page 5: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 1. regression

Regression is possibly the most important problem in Statistics

Classical approaches now ‘compete’ with machine learningmethods, deep learning, and more....

more bridges?frequentist/Bayesian; · · · parametric/nonparametric, · · ·statistics/machine learning? . . . probabilistic/non-probabilistic?

Anyway, the basic issue of quantifying the uncertainty remainscrucial. The output of regression is the basis for risk evaluation anddecision. A poor quantification of uncertainty may be a disaster...

..and the Bayesian approach is based on quantifying informationand uncertainty!

Page 6: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 1. regression

Regression is possibly the most important problem in Statistics

Classical approaches now ‘compete’ with machine learningmethods, deep learning, and more....

more bridges?frequentist/Bayesian; · · · parametric/nonparametric, · · ·statistics/machine learning? . . . probabilistic/non-probabilistic?

Anyway, the basic issue of quantifying the uncertainty remainscrucial. The output of regression is the basis for risk evaluation anddecision. A poor quantification of uncertainty may be a disaster...

..and the Bayesian approach is based on quantifying informationand uncertainty!

Page 7: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

deep learning, deep learning, deep learning....

Page 8: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 2. parametric & nonpar conjugate priors

Bayesian nonparametrics

Xi | Fiid∼ F , F ∼ prior distribution.

The most popular nonparametric prior, the Dirichlet process, is theextension to a process of the Dirichlet conjugate prior for(parametric) multinomial sampling.

Page 9: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 2. parametric conjugate priors

X | ξ ∼ p(x | ξ) in the NEF,ξ ∼ standard conjugate prior (Diaconis & Ylvisaker, 1979).

Bernoulli BetaMultinomial Dirichlet

N(0,Σ) Inverse-Wishart...

...

But, conjugate priors for a multivariate NEF are restrictive.

Generalized Dirichlet Connor & Mosiman (1969)Generalized Wishart Brown, Le, Zidek (1994)

Enriched conjugate priors Consonni and Veronese (2001)...

...

If f (x1, . . . , xk ; ξ) NEF, and can be decomposed in a product ofNEF densities, each having its own parameters, the enrichedconjugate prior on ξ is obtained by giving a standard conjugateprior for each of those densities.

Page 10: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 2. parametric conjugate priors

X | ξ ∼ p(x | ξ) in the NEF,ξ ∼ standard conjugate prior (Diaconis & Ylvisaker, 1979).

Bernoulli BetaMultinomial Dirichlet

N(0,Σ) Inverse-Wishart...

...

But, conjugate priors for a multivariate NEF are restrictive.

Generalized Dirichlet Connor & Mosiman (1969)Generalized Wishart Brown, Le, Zidek (1994)

Enriched conjugate priors Consonni and Veronese (2001)...

...

If f (x1, . . . , xk ; ξ) NEF, and can be decomposed in a product ofNEF densities, each having its own parameters, the enrichedconjugate prior on ξ is obtained by giving a standard conjugateprior for each of those densities.

Page 11: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 2. parametric conjugate priors

X | ξ ∼ p(x | ξ) in the NEF,ξ ∼ standard conjugate prior (Diaconis & Ylvisaker, 1979).

Bernoulli BetaMultinomial Dirichlet

N(0,Σ) Inverse-Wishart...

...

But, conjugate priors for a multivariate NEF are restrictive.

Generalized Dirichlet Connor & Mosiman (1969)Generalized Wishart Brown, Le, Zidek (1994)

Enriched conjugate priors Consonni and Veronese (2001)...

...

If f (x1, . . . , xk ; ξ) NEF, and can be decomposed in a product ofNEF densities, each having its own parameters, the enrichedconjugate prior on ξ is obtained by giving a standard conjugateprior for each of those densities.

Page 12: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

examples

• (X ,Y ) | Σ ∼ Nk(0,Σ).decompose f (x , y ; Σ) = fx(x ;φ)f (y | x ; θ) (both Gaussian),and assign conjugate priors of φ and θ.→ Generalized Wishart on Σ−1.

• (N1, . . . ,Nk) | (p1, . . . , pk) ∼ Multinomial(N, p1, . . . , pk).Re-write the multinomial as a product, by suitablereparametrization→ Generalized Dirichlet distribution for (p1, . . . , pk).

Page 13: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 2. nonparametric conjugate priors

The DP extends to a process. Let Xi | Fiid∼ F .

A Dirichlet process prior for F , F ∼ DP(αF0), is such that, for anyfinite partition, the random vector of probabilities(p1, . . . , pk) = (F (A1), . . . ,F (Ak)) ∼ Dir(αF0(A1), . . . , αF0(Ak))(Ferguson, 1973).

The DP inherits conjugacy from the conjugacy of the Dirichletdistribution for multinomial sampling; but also its lack of flexibility.This is the more true when F is on <k .

...An enriched conjugate nonparametric prior?

Page 14: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Enriched conjugate prior distributions: non-parametric

Doksum’s Neutral to the Right Process (Doksum, 1974) extendsthe enriched conjugate Dirichlet distribution to a process.

However, NTR processes are limited to univariate randomdistributions.

The Dirichlet distribution implies that any permutation of (p1, . . . , pk) is

completely neutral (that is, p1, p2/(1− p1), . . . , pk/(1−∑k−1

j=1 pj) are

independent).

The Generalized Dirichlet only assumes that one ordered vector

(p1, . . . , pk) is completely neutral. In <, there is a natural ordering. In

more dimensions (e.g., contingency tables p(x , y)), there is no natural

ordering.

Page 15: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Enriched conjugate prior distributions: non-parametric

Doksum’s Neutral to the Right Process (Doksum, 1974) extendsthe enriched conjugate Dirichlet distribution to a process.

However, NTR processes are limited to univariate randomdistributions.

The Dirichlet distribution implies that any permutation of (p1, . . . , pk) is

completely neutral (that is, p1, p2/(1− p1), . . . , pk/(1−∑k−1

j=1 pj) are

independent).

The Generalized Dirichlet only assumes that one ordered vector

(p1, . . . , pk) is completely neutral. In <, there is a natural ordering. In

more dimensions (e.g., contingency tables p(x , y)), there is no natural

ordering.

Page 16: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Enriched Dirichlet Process

(Wade, Mongelluzzo, P., 2011).

Finite case: we define an Enriched Dirichlet distribution for[p(x , y)] by choosing the ordering throughp(x , y) = py |x(y | x)px(x), and assuming that the vectors of themarginal and conditional probabilities have independent Dirichletdisributions.

Nonparametric case: P(x , y) random distribution on <k . Assume:

• Px ∼ DP(αxP0,x)

• for any x , Py |x ∼ DP(αy (x)P0,y |x)

all independent.

This well defines a probability law for the random P(x , y), namedEnriched Dirichlet Process (EDP)

Page 17: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Enriched Dirichlet Process

(Wade, Mongelluzzo, P., 2011).

Finite case: we define an Enriched Dirichlet distribution for[p(x , y)] by choosing the ordering throughp(x , y) = py |x(y | x)px(x), and assuming that the vectors of themarginal and conditional probabilities have independent Dirichletdisributions.

Nonparametric case: P(x , y) random distribution on <k . Assume:

• Px ∼ DP(αxP0,x)

• for any x , Py |x ∼ DP(αy (x)P0,y |x)

all independent.

This well defines a probability law for the random P(x , y), namedEnriched Dirichlet Process (EDP)

Page 18: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 3. Bayesian nonparametric regression

Focus on mixture models for1. conditional density estimation: estimate f (y | x), and2. density regression: how the density fx(y) of Y varies with x .

1 Details of the ‘nonparametric’ prior matter – not only forasymptotics, but for finite sample prediction.We show the case of two different BNP priors, bothconsistent: but one of them is clearly better – better exploitsinformation.Bayesian criteria to formalize such improvement?

2 Explosion of dependent Dirichlet processes (DDP) models.How can we compare? and offer a ‘default choice’ topractitioners, possibly in an R-package?

Page 19: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

motivation - 3. Bayesian nonparametric regression

Focus on mixture models for1. conditional density estimation: estimate f (y | x), and2. density regression: how the density fx(y) of Y varies with x .

1 Details of the ‘nonparametric’ prior matter – not only forasymptotics, but for finite sample prediction.We show the case of two different BNP priors, bothconsistent: but one of them is clearly better – better exploitsinformation.Bayesian criteria to formalize such improvement?

2 Explosion of dependent Dirichlet processes (DDP) models.How can we compare? and offer a ‘default choice’ topractitioners, possibly in an R-package?

Page 20: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Outline

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)Example: Improving prediction by Enriched DP mixtures

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Page 21: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Page 22: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Density regression

X predictor, p-dimensional; Y response.

∗ Bayesian nonparametric (mean) regression:

: x → m(x) = E(Y | x)

flexible model/prior on m(x) (basis expansions, Gaussian processes

and splines (Wahba (1990), Denison, Holmes, Mallick, Smith (2002)),

wavelets (Vidakovic, 2009), neural networks (Neal, 1996),..

∗ Yet, the mean may be a too poor summary of the relationshipbetween x and y .⇒ median, quantile regression,.. density regression

: x → f (y | x)

Limited literature on optimal estimators of f (y | x). (Efromovich,Ann. Stat. 2007). How about Bayesian methods?

Page 23: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Random or fixed design

• regression: random design(Xi ,Yi ), i = 1, . . . , n are a random sample from f (x , y).

Then, estimate the joint density f (x , y) and from this the

conditional f (y | x) = f (x ,y)fx (x)

.

• fixed designx1, . . . , xn is a deterministic sequence. If predictor is x ,Y ∼ f (y | x). in fact, fx(y).

• replicates of y values at a given x (x typically categorical orordinal), (e.g. ANOVA)

• no replicates (x typically continuous)still interest in fx(y): borrowing strength through smoothnessconditions along x .

Page 24: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)Example: Improving prediction by Enriched DP mixtures

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Page 25: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Random design: DP mixtures for f (x , y)

(Xi ,Yi ) | fiid∼ f (x , y), f ∼ prior prob. law

∗ Model f as a mixture of kernels:

(Xi ,Yi ) | Giid∼ fG (x , y) =

∫K (x , y | θ)dG (θ)

G ∼ DP(αG0)

Usually, K (x , y | θ) = Np+1(x , y | µ,Σ), with θ = (µ,Σ), andG0(θ) conjugate prior.

Since, a.s., G =∑∞

j=1 pjδθ∗j , where (pj) ∼ stick-breaking(α)

independent of θ∗jiid∼ G0, the mixture above reduces to

fG (x , y) =∞∑j=1

wjK (x , y | θ∗j ).

Page 26: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Random design: DP mixtures for f (x , y)

(Xi ,Yi ) | fiid∼ f (x , y), f ∼ prior prob. law

∗ Model f as a mixture of kernels:

(Xi ,Yi ) | Giid∼ fG (x , y) =

∫K (x , y | θ)dG (θ)

G ∼ DP(αG0)

Usually, K (x , y | θ) = Np+1(x , y | µ,Σ), with θ = (µ,Σ), andG0(θ) conjugate prior.

Since, a.s., G =∑∞

j=1 pjδθ∗j , where (pj) ∼ stick-breaking(α)

independent of θ∗jiid∼ G0, the mixture above reduces to

fG (x , y) =∞∑j=1

wjK (x , y | θ∗j ).

Page 27: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Random design: DP mixtures for f (x , y)

(Xi ,Yi ) | fiid∼ f (x , y), f ∼ prior prob. law

∗ Model f as a mixture of kernels:

(Xi ,Yi ) | Giid∼ fG (x , y) =

∫K (x , y | θ)dG (θ)

G ∼ DP(αG0)

Usually, K (x , y | θ) = Np+1(x , y | µ,Σ), with θ = (µ,Σ), andG0(θ) conjugate prior.

Since, a.s., G =∑∞

j=1 pjδθ∗j , where (pj) ∼ stick-breaking(α)

independent of θ∗jiid∼ G0, the mixture above reduces to

fG (x , y) =∞∑j=1

wjK (x , y | θ∗j ).

Page 28: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

mixture of Gaussian kernels

Page 29: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

mixture of Gaussian kernels

Page 30: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Latent variable representation

The DP-mixture model is equivalently expressed as

(Xi ,Yi ) | θiind∼ K (x , y | θi )

θi | Giid∼ G

G ∼ DP(αG0)

Integrating the θi out, one has back the countable mixture model

(Xi ,Yi ) | Giid∼ fG (x , y) =

∞∑j=1

pj K (x , y | θ∗j ).

Then the conditional density f (y | x) is obtained as

fG (y | x) =

∑j pj K (x , y | θ∗j )∑j piK (x | θ∗j )

=∑j

pj(x)K (y | x , θ∗j )

where pj(x) = pj K (x | θ∗j )/(∑

j ′ pj ′K (x | θ∗j ′).

Page 31: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Latent variable representation

The DP-mixture model is equivalently expressed as

(Xi ,Yi ) | θiind∼ K (x , y | θi )

θi | Giid∼ G

G ∼ DP(αG0)

Integrating the θi out, one has back the countable mixture model

(Xi ,Yi ) | Giid∼ fG (x , y) =

∞∑j=1

pj K (x , y | θ∗j ).

Then the conditional density f (y | x) is obtained as

fG (y | x) =

∑j pj K (x , y | θ∗j )∑j piK (x | θ∗j )

=∑j

pj(x)K (y | x , θ∗j )

where pj(x) = pj K (x | θ∗j )/(∑

j ′ pj ′K (x | θ∗j ′).

Page 32: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Random partition

Since G is a.s. discrete, ties in a sample (θ1, . . . , θn) from G havepositive probability, so that

(θ1, . . . , θn) described by (ρn; θ∗1 , . . . , θ∗k(ρn)

)

• a random partition ρn = (s1, . . . , sn)

• the cluster-specific parameters θ∗j

Ex: for n = 5, ρn = (1, 1, 2, 2, 1) gives (θ1, . . . , θn) = (θ∗1 , θ∗1 , θ∗2 , θ∗2 , θ∗1),

kn = 2 two clusters of size n1 = 3, n2 = 2 resp., with cluster-specificparameters θ∗1 , θ

∗2 .

• The DP induces a probability law of the random partition

p(ρn) =Γ(α)

Γ(α + n)αkn

kn∏j=1

Γ(nj)

• Given the partition ρn, the cluster specific parameters θ∗j are i.i.d.∼ G0.

Page 33: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

mixture of Gaussian kernels

Page 34: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

mixture of Gaussian kernels

Page 35: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

mixture of Gaussian kernels

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

0 2 4 6

−2

02

46

8

x 1

y●

● ●

●●

●●

●●

●●

●● ●

● ●●

● ●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●● ●

●●

●●

● ●

●●

Page 36: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Inference

∗ posterior on ρn

p(ρn | x1:n, y1:n) ∝ p(ρn)kn∏j=1

∫ ∏i :(xi ,yi )∈Sj

K (xi , yi | θ∗j )dP0(θ∗j )

∝ p(ρn)kn∏j=1

m({(xi , yi ) ∈ Cj} | ρn)

prior × independent marginal likelihoods in each clusters.

∗ posterior on (θ∗j ) Given the partition, clusters are independent,and inference on θ∗j is based only on obs. in group Cj

p(θ∗1, . . . , θ∗kn | x1:n, y1:n, ρn) =

kn∏j=1

p(θ∗j | ρn, {(xi , yi ) ∈ Cj})

Page 37: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Inference

∗ posterior on ρn

p(ρn | x1:n, y1:n) ∝ p(ρn)kn∏j=1

∫ ∏i :(xi ,yi )∈Sj

K (xi , yi | θ∗j )dP0(θ∗j )

∝ p(ρn)kn∏j=1

m({(xi , yi ) ∈ Cj} | ρn)

prior × independent marginal likelihoods in each clusters.

∗ posterior on (θ∗j ) Given the partition, clusters are independent,and inference on θ∗j is based only on obs. in group Cj

p(θ∗1, . . . , θ∗kn | x1:n, y1:n, ρn) =

kn∏j=1

p(θ∗j | ρn, {(xi , yi ) ∈ Cj})

Page 38: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Predictive distribution (density estimate)

With quadratic loss, the Bayesian density estimate is the predictivedensity

(Xn+1,Yn+1) | x1:n, y1:n ∼ f̂ (x , y) = E(f (x , y) | x1:n, y1:n).

∗ Recalling that G | θ1:n, x1:n, y1:n ∼ DP(αG0 +∑n

i=1 δθi ),

f̂ (x , y) = E(E(fG (x , y) | θ1:n) | x1:n, y1:n)

α + nfG0(x , y) +

n

α + nE

(n∑

i=1

K (x , y | θi )n

| x1:n, y1:n

)

average of prior guess fG0 and expectation of a kernel estimatewith kernels centered at the θi .

Page 39: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

conditional density estimate

∗ Since (θ1, . . . , θn)↔ (ρn, θ∗1, . . . , θ

∗k), the joint density estimate is

f̂ (x , y) =α

α + nfG0(x , y)

+∑ρn

k(ρn)∑j=1

nj(ρn)

α + nf (x , y | (xi , yi ) ∈ Cj(ρn)

p(ρn | x1:n, y1:n)

average of the prior guess fG0 , and given the partition ρn, of thepredictive densities in clusters Cj(ρn).

From f̂ (x , y), one can find an estimate of f (y | x).

Page 40: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

’clustering’ and density estimate

The partition is not of main interest (mixture components just playthe role of kernels), but p(ρn | x1:n, y1:n) plays a crucial role.

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

0 2 4 6

−2

02

46

8

x 1

y

●●

● ●

●●

●●

●●

●●

●● ●

● ●●

● ●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●● ●

●●

●●

● ●

●●

Such role of the prior and posterior distribution of the randompartition is often overlooked.

Page 41: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

How comparing nonparametric priors?

Frequentist properties.Results on frequentist asymptotic properties for multivariatedensity estimation, and some results for regression and conditionaldensity estimation (Wu & Ghosal (2008; 2010); Tokdar (2011), Norets

& Pelenis (2012), Shen, Tokdar & Ghosal (2013), Canale & Dunson

(2015), Bhattacharya, Pati, Dunson (2014)– anisotropic; Norets & Pati

(2016+), . . .

But, how about (Bayesian) finite sample properties and predictiveperformance?

Different nonparametric priors may be consistent, but have quitediverse predictive performance!The implied distribution on the random partition plays a crucialrole.

Example 1. Wade, Walker & Petrone (Scand. J. Statist., 2014)introduce a restricted partition model to overcome difficulties ofDP mixtures in curve fitting. Their proposal clearly leads toimproved prediction.

Page 42: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

How comparing nonparametric priors?

Frequentist properties.Results on frequentist asymptotic properties for multivariatedensity estimation, and some results for regression and conditionaldensity estimation (Wu & Ghosal (2008; 2010); Tokdar (2011), Norets

& Pelenis (2012), Shen, Tokdar & Ghosal (2013), Canale & Dunson

(2015), Bhattacharya, Pati, Dunson (2014)– anisotropic; Norets & Pati

(2016+), . . .

But, how about (Bayesian) finite sample properties and predictiveperformance?

Different nonparametric priors may be consistent, but have quitediverse predictive performance!The implied distribution on the random partition plays a crucialrole.

Example 1. Wade, Walker & Petrone (Scand. J. Statist., 2014)introduce a restricted partition model to overcome difficulties ofDP mixtures in curve fitting. Their proposal clearly leads toimproved prediction.

Page 43: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

How comparing nonparametric priors?

Frequentist properties.Results on frequentist asymptotic properties for multivariatedensity estimation, and some results for regression and conditionaldensity estimation (Wu & Ghosal (2008; 2010); Tokdar (2011), Norets

& Pelenis (2012), Shen, Tokdar & Ghosal (2013), Canale & Dunson

(2015), Bhattacharya, Pati, Dunson (2014)– anisotropic; Norets & Pati

(2016+), . . .

But, how about (Bayesian) finite sample properties and predictiveperformance?

Different nonparametric priors may be consistent, but have quitediverse predictive performance!The implied distribution on the random partition plays a crucialrole.

Example 1. Wade, Walker & Petrone (Scand. J. Statist., 2014)introduce a restricted partition model to overcome difficulties ofDP mixtures in curve fitting. Their proposal clearly leads toimproved prediction.

Page 44: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Problem: anisotropic case

X continuos predictors; Gaussian kernels.The DP mixture of multivariate Gaussian distributions uses joint clustersto fit the density f (x , y).

BUT, the conditional density fy |x and the marginal density fx might havedifferent smoothness; in regression, typically fy |x is smoother than fx .

here, many small clusters (kernels) are needed to fit the fx density, whilemuch fewer kernels would suffice for fy |x .If the dimension of x is large, the likelihood is dominated by the xcomponent and many small clusters are suggested by the posterior on ρn.This impoverishes the performance of the model.

Page 45: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Problem: anisotropic case

X continuos predictors; Gaussian kernels.The DP mixture of multivariate Gaussian distributions uses joint clustersto fit the density f (x , y).BUT, the conditional density fy |x and the marginal density fx might havedifferent smoothness; in regression, typically fy |x is smoother than fx .

here, many small clusters (kernels) are needed to fit the fx density, whilemuch fewer kernels would suffice for fy |x .If the dimension of x is large, the likelihood is dominated by the xcomponent and many small clusters are suggested by the posterior on ρn.This impoverishes the performance of the model.

Page 46: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Problem: anisotropic case

X continuos predictors; Gaussian kernels.The DP mixture of multivariate Gaussian distributions uses joint clustersto fit the density f (x , y).BUT, the conditional density fy |x and the marginal density fx might havedifferent smoothness; in regression, typically fy |x is smoother than fx .

here, many small clusters (kernels) are needed to fit the fx density, whilemuch fewer kernels would suffice for fy |x .If the dimension of x is large, the likelihood is dominated by the xcomponent and many small clusters are suggested by the posterior on ρn.This impoverishes the performance of the model.

Page 47: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

This undesirable behavior does not seem to vanish with increasing samplesize.

If fx(x) requires many clusters, the unappealing behaviour of the randompartition could be reflected in worse convergence rates. Efromovich[2007] shows that if the conditional density is smoother than the joint, itcan be estimated at a faster rate.

Thus, improving inference on the random partition to take into accountthe different degree of smoothness of fx and fy |x is crucial.

Page 48: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Improving prediction

Consider the Dirichlet mixture of Gaussian kernels

(Xi ,Yi ) | G ∼∫

Np+1(µ,Σ)dG (µ,Σ), G ∼ DP(αG0).

The base measure of the DP, G0(µ,Σ), is usually Normal-InvWishart. BUT, this conjugate prior is restrictive if p is large

Page 49: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Improving prediction

• Write the kernels as

Np+1(x , y | µ,Σ) = Np(x | µx ,Σx) N(y | x ′β, σ2y |x)

and use simple spherical x-kernels (Shahbaba and Neal, 2009,Hannah et al., 2011). Thus,

fx (x | P) =∞∑j=1

wjNp(x | µ∗xj ,

σ2∗x1,j 0 · · · 0

0 σ2∗x2,j 0 0

0 · · · · · · · · ·0 0 0 σ2∗

xp,j

)

f (y | x ,P) =∞∑j=1

wj(x)N(y | x ′β∗j , σ2 ∗y |x,j).

• Denote the x-parameters φ = ((µx1, . . . , µxp), (σ2x1, . . . , σ

2xp)) and

the y |x-parameters θ = (β, σ2y |x). Assign independent conjugate

priors G0,φ(φ) and G0,θ(θ) (that leads to an enriched conjugate priorfor (µ,Σ)

• In the DP mixture model, assume individual-specific (φi , θi ) as arandom sample from G ∼ DP(αG0,φ G0,θ)

Page 50: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Improving prediction

• Write the kernels as

Np+1(x , y | µ,Σ) = Np(x | µx ,Σx) N(y | x ′β, σ2y |x)

and use simple spherical x-kernels (Shahbaba and Neal, 2009,Hannah et al., 2011). Thus,

fx (x | P) =∞∑j=1

wjNp(x | µ∗xj ,

σ2∗x1,j 0 · · · 0

0 σ2∗x2,j 0 0

0 · · · · · · · · ·0 0 0 σ2∗

xp,j

)

f (y | x ,P) =∞∑j=1

wj(x)N(y | x ′β∗j , σ2 ∗y |x,j).

• Denote the x-parameters φ = ((µx1, . . . , µxp), (σ2x1, . . . , σ

2xp)) and

the y |x-parameters θ = (β, σ2y |x). Assign independent conjugate

priors G0,φ(φ) and G0,θ(θ) (that leads to an enriched conjugate priorfor (µ,Σ)

• In the DP mixture model, assume individual-specific (φi , θi ) as arandom sample from G ∼ DP(αG0,φ G0,θ)

Page 51: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Improving prediction

• Write the kernels as

Np+1(x , y | µ,Σ) = Np(x | µx ,Σx) N(y | x ′β, σ2y |x)

and use simple spherical x-kernels (Shahbaba and Neal, 2009,Hannah et al., 2011). Thus,

fx (x | P) =∞∑j=1

wjNp(x | µ∗xj ,

σ2∗x1,j 0 · · · 0

0 σ2∗x2,j 0 0

0 · · · · · · · · ·0 0 0 σ2∗

xp,j

)

f (y | x ,P) =∞∑j=1

wj(x)N(y | x ′β∗j , σ2 ∗y |x,j).

• Denote the x-parameters φ = ((µx1, . . . , µxp), (σ2x1, . . . , σ

2xp)) and

the y |x-parameters θ = (β, σ2y |x). Assign independent conjugate

priors G0,φ(φ) and G0,θ(θ) (that leads to an enriched conjugate priorfor (µ,Σ)

• In the DP mixture model, assume individual-specific (φi , θi ) as arandom sample from G ∼ DP(αG0,φ G0,θ)

Page 52: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Joint DP mixture model

Yi |xi , βi , σ2y ,iind∼ N(β′ix i , σ

2y |x ,i ), θi = (βi , σ

2y |x ,i ),

Xi |µi , σ2x ,iind∼

p∏h=1

N(µx ,h,i , σ2x ,h,i ), φi = (µx ,i , σ

2x ,i )

(θi , φi ) | Gi .i .d .∼ G ,

G ∼ DP(αG0θ × G0φ).

with G0θ and G0φ independent conjugate Normal-Inverse Gammapriors.

Page 53: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

behavior for large p

The model is flexible, and MCMC computations are standard.

BUT, if p is large, many (independent) kernels will be typicallyneeded to describe the (dependent) marginal fx , while therelationship Y | x can be smoother.

However, the DP only allows joint clusters of (φi , θi ), i = 1, . . . , n.

Given its crucial role, difficulties in the random partition haverelevant consequences on predictionWe would like to use a prior on P that allows many φi clusters, tofit fx , but fewer θj clusters, and it is still conjugate, so thatcomputations remain simple.

⇒ Enriched Dirichlet Process (Wade, Mongelluzzo, P., 2011)

Page 54: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

behavior for large p

The model is flexible, and MCMC computations are standard.

BUT, if p is large, many (independent) kernels will be typicallyneeded to describe the (dependent) marginal fx , while therelationship Y | x can be smoother.

However, the DP only allows joint clusters of (φi , θi ), i = 1, . . . , n.

Given its crucial role, difficulties in the random partition haverelevant consequences on predictionWe would like to use a prior on P that allows many φi clusters, tofit fx , but fewer θj clusters, and it is still conjugate, so thatcomputations remain simple.

⇒ Enriched Dirichlet Process (Wade, Mongelluzzo, P., 2011)

Page 55: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Enriched Dirichlet Process

Extends the idea that leads from the Dirichlet distribution (conjugate tothe multinomial) to the enriched (or generalized) Dirichlet dist. (Connor& Mosiman (1969);

and from the DP to Doksum (1974) neutral-to-the-right priors for

random probability measures on the real line.

EDP: Assign a prior for the random prob. measure G (θ, φ) byassuming:

• Pθ ∼ DP(αθP0,θ)

• for any θ, Pφ|θ ∼ DP(αφ(θ)P0,φ|θ)

all independent.

The EDP gives a nested random partition: ρn = (ρn,θ, ρn,φ), thatallows many φ-clusters inside each θ-cluster.

Page 56: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

EDP: nested random partition

ρn = (ρn,θ, ρn,φ) : many φ-clusters inside each θ-cluster.

• Pθ ∼ DP(αP0θ) gives a Chinese restaurant process:customers choose restaurants, and restaurants are colored

with colors θ∗hiid∼ P0θ (nonatomic):

• Pφ|θ ∼ DP(αφ(θ)P0φ|θ) gives a nested CRP: within eachrestaurant, customers sits at tables as in the CRP. Tables in

restaurant θ∗h are colored with colors φ∗j |hiid∼ P0,φ|θ(φ | θ).

Page 57: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

EDP: nested random partition

ρn = (ρn,θ, ρn,φ) : many φ-clusters inside each θ-cluster.

• Pθ ∼ DP(αP0θ) gives a Chinese restaurant process:customers choose restaurants, and restaurants are colored

with colors θ∗hiid∼ P0θ (nonatomic):

• Pφ|θ ∼ DP(αφ(θ)P0φ|θ) gives a nested CRP: within eachrestaurant, customers sits at tables as in the CRP. Tables in

restaurant θ∗h are colored with colors φ∗j |hiid∼ P0,φ|θ(φ | θ).

Page 58: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Joint EDP mixture model

• Model (replace DP with EDP):

Yi |xi , βi , σ2y ,i

ind∼ N(β′i x i , σ2y |x,i ), θi = (βi , σ

2y |x,i ),

Xi |µi , σ2x,i

ind∼p∏

h=1

N(µx,h,i , σ2x,i,h), φi = (µx,i , σ

2x,i )

(θi , φi ) | Gi.i.d.∼ G ,

G ∼ EDP(αθ, αφ(·),G0,θ × G0,φ|θ).

• Computations remain simple, as the EDP is a conjugate prior;

• Inference on a cluster-specific θ∗j = (β,jσy |x∗), (thus,ultimately, on the conditional density f (y | x , θ)), exploits theinformation from the observations in all the φ∗h-clusters thatshare the same θ∗j ⇒ much improved inference and prediction

(Wade, Dunson, Petrone, Trippa, JMLR (2014))

Page 59: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Simulation study

Toy example to demonstrate two key advantages of the EDP model

• it can recover the true coarser θ-partition

• improved prediction and smaller credible intervals result.

Data: 200 obs (xi , yi ) whereThe covariates Xi are sampled from a p-variate normal,

Xi = (Xi,1, . . . ,Xi,p)′iid∼ N(µ,Σ),

centered at µ = (4, . . . , 4)′ with Σh,h = 4 for h = 1, . . . , p, andcovariances that model two groups of covariates with different correlationstructure.

The true regression only depends on the first covariate; it is a nonlinearregression obtained as a mixture

Yi |xiind∼ p(xi,1)N(yi |β1,0+β1,1xi,1, σ

21)+(1−p(xi,1))N(yi |β2,0+β2,1xi,1, σ

22)

Page 60: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Simulation study

Toy example to demonstrate two key advantages of the EDP model

• it can recover the true coarser θ-partition

• improved prediction and smaller credible intervals result.

Data: 200 obs (xi , yi ) whereThe covariates Xi are sampled from a p-variate normal,

Xi = (Xi,1, . . . ,Xi,p)′iid∼ N(µ,Σ),

centered at µ = (4, . . . , 4)′ with Σh,h = 4 for h = 1, . . . , p, andcovariances that model two groups of covariates with different correlationstructure.

The true regression only depends on the first covariate; it is a nonlinearregression obtained as a mixture

Yi |xiind∼ p(xi,1)N(yi |β1,0+β1,1xi,1, σ

21)+(1−p(xi,1))N(yi |β2,0+β2,1xi,1, σ

22)

Page 61: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

prediction error

Table: Prediction error for both models as p increases.

p = 1 p = 5 p = 10 p = 15

l̂1 l̂2 l̂1 l̂2 l̂1 l̂2 l̂1 l̂2DP 0.03 0.05 0.16 0.2 0.25 0.34 0.26 0.34

EDP 0.04 0.05 0.06 0.1 0.09 0.16 0.12 0.21

Page 62: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Real data: Alzheimer disease diagnosis

Data were obtained from the Alzheimer’s Disease Neuroimaging Initiative(ADNI) database, which is publicly accessible at UCLA’s Laboratory ofNeuroimaging.

covariates: summaries of p = 15 brain structures computed fromstructural MRI obtained at the first visit for 377 patients, of which 159have been diagnosed with AD and 218 are cognitively normal (CN).

response: Yi = 1 (cognitively normal subject); or = 0 (diagnosed withAD).

Aim: Prediction of AD status

Page 63: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Extension of the model to binary response

The model is extended to a local probit model:

Yi |xi , βiind∼ Bern(Φ(x iβi )), Xi |µi , σ

2i

ind∼p∏

h=1

N(µi,h, σ2i,h),

(βi , µi , σ2i )|G iid∼ G , G ∼ Q.

First, DP prior for G : G ∼ DP(α,G0β × G0ψ), with G0β = N(0p,C−1)

and G0ψ product of p normal-inverse gamma. We let α ∼ Gamma(1, 1).

EDP prior for G : correlation between the measurements of the brainstructures and non-normal univariate histograms of the covariates suggestthat many Gaussian kernels with local independence will be needed toapproximate the density of X . The conditional density of the response,on the other hand, may not be so complicated. This motivates the choiceof an EDP prior.

Page 64: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Prediction for 10 new subjects

* * *

* * * * * * *

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Subject

p(y=

1|x)

● ●

●● ●

● ●

(a) DP

* * *

* * * * * * *

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Subject

p(y=

1|x)

●● ● ● ●

● ●

(b) EDP

Figure: Predicted probability of being healthy against subject index for 10new subjects and represented with circles (DP in blue and EDP in red)with the true outcome as black stars. The bars about the predictiondepict the 95% credible intervals.

Page 65: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Page 66: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Fixed design: Conditional models

Sample (xi ,Yi ,ν), ν = 1, . . . , ni ), x fixed input.Interest in fx(y) (no longer a conditional f (y | x)).

Joint DP mixture models are used in this context, too. Yet, theyunnecessarily require to model the marginal density fx(x).

In a Bayesian approach, one wants to assign a prior on the set ofrandom prob. measures {Fx(·), x ∈ X}.The random measures Fx must be dependent, as we want somesmoothness along x , for borrowing strength; and have a givenmarginal distribution, e.g. Fx ∼ DP.

Early proposals: Cifarelli & Regazzini (1978), extending mixturesof DP (Antoniak, 1974).

Influential idea: dependent Dirichlet processes (DDP), McEachern(1999; 2000), based on dependent stick-breaking constructions.

Page 67: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Fixed design: Conditional models

Sample (xi ,Yi ,ν), ν = 1, . . . , ni ), x fixed input.Interest in fx(y) (no longer a conditional f (y | x)).Joint DP mixture models are used in this context, too. Yet, theyunnecessarily require to model the marginal density fx(x).

In a Bayesian approach, one wants to assign a prior on the set ofrandom prob. measures {Fx(·), x ∈ X}.The random measures Fx must be dependent, as we want somesmoothness along x , for borrowing strength; and have a givenmarginal distribution, e.g. Fx ∼ DP.

Early proposals: Cifarelli & Regazzini (1978), extending mixturesof DP (Antoniak, 1974).

Influential idea: dependent Dirichlet processes (DDP), McEachern(1999; 2000), based on dependent stick-breaking constructions.

Page 68: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Fixed design: Conditional models

Sample (xi ,Yi ,ν), ν = 1, . . . , ni ), x fixed input.Interest in fx(y) (no longer a conditional f (y | x)).Joint DP mixture models are used in this context, too. Yet, theyunnecessarily require to model the marginal density fx(x).

In a Bayesian approach, one wants to assign a prior on the set ofrandom prob. measures {Fx(·), x ∈ X}.The random measures Fx must be dependent, as we want somesmoothness along x , for borrowing strength; and have a givenmarginal distribution, e.g. Fx ∼ DP.

Early proposals: Cifarelli & Regazzini (1978), extending mixturesof DP (Antoniak, 1974).

Influential idea: dependent Dirichlet processes (DDP), McEachern(1999; 2000), based on dependent stick-breaking constructions.

Page 69: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

DDP mixture models

Use a mixture model for fx(y)

Yi ,ν | x ,Gxind∼ fGx (y) =

∫K (y | x , θ)dGx(θ)

where the mixing distribution Gx is indexed by x .

A DDP prior on {Gx , x ∈ X} assumes that Gx ∼ DP(α(x)G0,x),and dependence is introduced through the dependentstick-breaking constructions:

Gx(·) =∞∑j=1

pj(x)δθ∗j (x)(·)

where, for each j :

• (wj(x), x ∈ X ) is a stochastic process, with stick-breakingconstruction w1(x) = v1(x); w2(x) = v2(x)(1− v1(x)), . . . where(vj(x), x ∈ X ) is a stochastic process with marginalsvj(x) ∼ Beta(1, α(x)), and the vj(·) are independent across x ;

• (θ∗j (x), x ∈ X ) is a stochastic process with marginals G0,x . Theprocesses θj(·) are independent across j , and indep. of the vj(·).

Page 70: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

DDP mixture models

Use a mixture model for fx(y)

Yi ,ν | x ,Gxind∼ fGx (y) =

∫K (y | x , θ)dGx(θ)

where the mixing distribution Gx is indexed by x .

A DDP prior on {Gx , x ∈ X} assumes that Gx ∼ DP(α(x)G0,x),and dependence is introduced through the dependentstick-breaking constructions:

Gx(·) =∞∑j=1

pj(x)δθ∗j (x)(·)

where, for each j :

• (wj(x), x ∈ X ) is a stochastic process, with stick-breakingconstruction w1(x) = v1(x); w2(x) = v2(x)(1− v1(x)), . . . where(vj(x), x ∈ X ) is a stochastic process with marginalsvj(x) ∼ Beta(1, α(x)), and the vj(·) are independent across x ;

• (θ∗j (x), x ∈ X ) is a stochastic process with marginals G0,x . Theprocesses θj(·) are independent across j , and indep. of the vj(·).

Page 71: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Conditional approach: Single weights

The DDP allows both the mixing weights and the atoms to dependon x . But this is redundant, and one either considers 1) modelswith single weights and 2) models with covariate dependentweights.Single weights: assume wj(x) = wj with flexible θj(x):

fGx (y |x) =∞∑j=1

wjK (y |θj(x)).

• Ex. K (y |θj(x)) = N(y |µj(x), σ2j ) with µjiid∼ GP.

• Popular because inference relies on established algorithms forBNP mixtures.

Page 72: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Conditional approach: Covariate dependent weights

Covariate dependent weights: flexible wj(x) with θj(x) = θj :

fPx (y |x) =∞∑j=1

wj(x)K (y |θj , x),

for ex. K (y |θj(x)) = N(y |β′jx , σ2j ).

• Most techniques to define wj(x) s.t∑

j wj(x) = 1 use astick-breaking approach:

w1(x) = v1(x) and for j > 1 wj(x) = vj(x)∏j ′<j

(1− vj ′(x)),

• Proposals for vj(x) include Griffin and Steel (2006), Dunsonand Park (2008), normalized weights model (Antoniano,Wade, Walker; 2014); probit stick-breaking (Rodriguez andDunson, 2011); logit stick-breaking (Ren, Du, Carin &Dunson (2011); Rigon & Durante, 2017+), . . .

How can we compare?

Page 73: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Model comparison

The EDP example shows clear improvement in the predictiveperformance. We would like to formally express the gain ofinformation that a model/prior can provide.

So far

• Careful and detailed study of the information and assumptionintroduced through the prior/model; and the analyticimplications on the predictive distribution;

• compare on simulated and real data.

Page 74: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Model comparison

The EDP example shows clear improvement in the predictiveperformance. We would like to formally express the gain ofinformation that a model/prior can provide.So far

• Careful and detailed study of the information and assumptionintroduced through the prior/model; and the analyticimplications on the predictive distribution;

• compare on simulated and real data.

Page 75: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

comparisons

Some findings (Peruzzi, Petrone & Wade, 2016+)

• The ‘single weights’ mixture model needs flexible atoms, forgiving reasonable prediction. But this has drawbacks,including identifiability.

• Covariate-dependent weights allow local selection of theclustering, and generally better prediction.In particular, Peruzzi & Wade (2016+) compared the kernelstick-breaking model (Dunson & Park, 2008) and thenormalized weights model (Antoniano, Wade, Walker, 2014).The latter appears to allow faster computations.

Page 76: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

Computations

A substantial challenge for comparison is due to computations: oneneeds an algorithm that can give fast computations, and be fairlyeasily adapted to different models.

Peruzzi & Wade (2016+) suggest a clever algorithm, based onadaptive truncation and Metropolis-Hastings, moving from aproposal by Griffin (2016).

This is a necessary basis for providing R-packages for densityregression and conditional density estimation.

Page 77: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

contents

1 Preliminaries on BNP regression

2 Random design: DP mixtures for f (x , y)

3 Fixed design: Dependent stick-breaking mixture models

4 discussion

Page 78: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

summary and discussion

I tried to give an overview of proposals for Bayesian densityregression, based on mixture models.

Careful study of the finite-sample properties of these models, forcomparison

Together with flexible, easily-exportable computational tools, theseare necessary steps for providing R-packages for density regression.

Page 79: On Bayesian nonparametric regression · 2. bridging parametric & nonparametric Bayesian nonparametrics somehow extends parametric constructions to processes.. 3. explosion of methods

references

• Wade, S., Mongelluzzo, S. & Petrone, S. (2011). An enrichedconjugate prior for Bayesian nonparametric inference. BayesianAnalysis, 3, 359–386.

• Wade, S., Dunson, D., Petrone, S. & Trippa, L. (2014). Improvingprediction from Dirichlet process mixtures via enrichment. Journalof Machine Learning Research, 1041–1071.

• Wade, S., Walker, S. & Petrone, S. (2014). A Predictive Study ofDirichlet Process Mixture Models for Curve Fitting. Scandin. J.Statist., 41, 580–605.

• Wade, S., Peruzzi, M. & Petrone, S. (2016+) Review of Bayesiannonparametric mixture models for regression. (manuscript)