Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf ·...

95
Bayesian Nonparametrics: a Soft Introduction Vanda Inácio Pontificia Universidad Católica de Chile, Chile XIII EBEB, February 22, 2016 Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 1 / 95

Transcript of Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf ·...

Page 1: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Bayesian Nonparametrics: a Soft Introduction

Vanda Inácio

Pontificia Universidad Católica de Chile, Chile

XIII EBEB, February 22, 2016

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 1 / 95

Page 2: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Outline

1 Introduction/motivation

2 Finite mixture models

3 Dirichlet process mixture models

4 Mixtures of finite Polya trees

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 2 / 95

Page 3: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Some references (covers from Amazon.com)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 3 / 95

Page 4: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Some references (covers from Amazon.com)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 4 / 95

Page 5: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Some references

Hanson, Branscum and Johnson (2005). Bayesian nonparametric modeling and dataanalysis: an introduction. In Handbook of Statistics, Elsevier.

Hanson, and Jara, A (2013). Surviving fully Bayesian nonparametric regression models. InBayesian Theory and Applications, Oxford University Press, UK.

Muller and Rodriguez (2013). Nonparametric Bayesian Inference. NSF-CBMS RegionalConference Series in Probability and Statistics, Volume 9.

http://projecteuclid.org/euclid.cbms/1362163742

Kottas and Rodriguez (2014). Unpublished lecture notes.

https://users.soe.ucsc.edu/ thanos/notes-1.pdf

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 5 / 95

Page 6: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Some references

Muller and Quintana (2004). Nonparametric Bayesian data analysis. Statistical Science19, 95–110.

Muller and Mitra (2013). Bayesian nonparametric inference – why and how. BayesianAnalysis 8, 269–302.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 6 / 95

Page 7: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Available software

DPpackage: Bayesian Semi- and Nonparametric Modeling in R

https://cran.r-project.org/web/packages/DPpackage/

Bayesian Regression: Nonparametric and Parametric Models

http://georgek.people.uic.edu/BayesSoftware.html

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 7 / 95

Page 8: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Why Bayesian nonparametrics?Motivation

The motivation is twofold:

1 Allowing for model flexibility.

2 Safeguarding against model misspecification.

Bayesian nonparametrics rely on parametric baseline models while allowing for data-drivendeviations.

So that we can see if parametric models might actually fit by embedding them innonparametric families (sensitivity analysis).

Because Bayesian nonparametric modeling is feasible due to modern MCMC methods.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 8 / 95

Page 9: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Why Bayesian nonparametrics?Motivation

The goal here is to provide a rich class of statistical models for data analysis.

Data distributions can easily be multimodal or have severe skewness.

The normal distribution is still widely used in practice.

But the normal family is symmetric and has only two parameters, one corresponding tolocation and the other corresponding to dispersion.

Common practice is to log transform the data. But if after the log, the data still havemultimodality and/or skewness, the normal distribution would not be adequate.

Other parametric distributions are similarly constrained.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 9 / 95

Page 10: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Why Bayesian nonparametrics?Motivation

We thus proceed to discuss broader families that allow for flexibility and robustness beyondwhat is achievable using parametric models.

This goal is accomplished by setting a particular parametric class and expanding it so thatthere are many more possibilities included.

In the parametric Bayesian approach, given a particular dataset, a family of models isselected.

In the normal case, this involves location and scale parameters.

We then select prior distributions for these two parameters. This induces a prior on thefamily of distributions.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 10 / 95

Page 11: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Why Bayesian nonparametrics?Motivation

In the nonparametric case, we do the same, except that, instead of having a small numberof parameters that characterize the family of distributions for the data, conceptually there isan infinite number of parameters.

But since we live in a finite world, practically dictates that these models must be truncatedto have a possibly large but finite number of parameters.

We call them richly parametric as a result.

Thus, nonparametric Bayesian modeling requires a prior distribution on the (potentially)infinite dimensional parameters.

Technically, this amounts to placing a prior distribution on the space of all distributionfunctions.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 11 / 95

Page 12: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Why Bayesian nonparametrics?Motivation

We consider two approaches:

1 The first involves the specification of a Dirichlet process mixture model for the datadistribution.

2 The second approach involves the specification of a mixture of finite Polya trees prioron the space of all distribution functions.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 12 / 95

Page 13: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsMotivation

We start with finite mixture models since they, to a certain extent, provide the backgroundfor Dirichlet process mixture models.

The natural question is: why mixture models?

In a diversity of situations, the complexity of the observed data may render the use of asingle parametric distribution insufficient for data modeling.

For example, in the context of medical tests, biomarker outcomes for some specificdisease may consist of, at least, two subgroups, corresponding to mild diseased andsevere diseased individuals.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 13 / 95

Page 14: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsMotivation

A mixture model assumes that data can be represented by a weighted sum of distributions,with each distribution representing a proportion of the data.

It is worth nothing that multimodality is not the sole motivation for the use of mixturemodels.

For instance, skew data can also be handled by mixtures.

Of course, for this latter case, one could use a skew distribution, but this would imply thatwe expect skewness in advance, while with the more general mixture model, we canhandle this and other nonstandard features of the data without the need to know them inadvance.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 14 / 95

Page 15: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsFormulation

Throughout, we consider the particular case of mixtures of normal distributions.

General mixtures of normal densities can approximate any continuous density (e.g., Lo1984).

We assume thus that the data y = (y1, . . . , yn) follows a mixture of K normal distributions,whose density is given by

f (y | ω,θ) =K∑

k=1

ωkφ(y | µk , σ2k ), (1)

with ω = (ω1, . . . , ωK ) and θ = (µ1, . . . , µK , σ21 , . . . , σ

2K ).

Hence, each data point would arise from one of K mixture components, with eachcomponent having its own mean and variance.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 15 / 95

Page 16: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsFormulation

Model (1) is called a location-scale mixture of normals, since both the mean (location) andvariance (scale) vary across components.

An alternative model would treat the variances across components as constants.

Generally, location-scale mixture of normals produce more accurate results and alsocorrespond to more realistic representations of the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 16 / 95

Page 17: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsFormulation

Model (1) can be equivalently written as

f (y | ω,θ) =

∫φ(y | µ, σ2)dG(µ, σ2).

Here G is a discrete finite mixing distribution given by

G(·) =K∑

k=1

ωkδ(µk ,σ2k )(·),

where δa denotes a point mass at a.

The likelihood of the data is then

L(ω,θ; y) =n∏

i=1

K∑k=1

ωkφ(yi | µk , σ2k ),

which is not analytically tractable.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 17 / 95

Page 18: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsData augmentation

This issue is overcomed by data augmentation.

To this end, consider the latent variable zi ∈ 1, . . . ,K with zi = k indicating thatobservation yi comes from component k , i.e., from the normal component with mean µkand variance σ2

k .

The mixture model can then be viewed hierarchically; the observed data yi is modeledconditionally on zi and zi is also given a probabilistic specification.

It can then be written

yi | zi ,θind.∼ φ(yi | µzi , σ

2zi

),

zi | ωiid∼ Mult(ω).

If we marginalize over zi we recover the original mixture formulation.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 18 / 95

Page 19: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsLikelihood

By introducing, z = (z1, . . . , zn) we obtain the complete/augmented data likelihood, whichis given by

L(ω,θ; y, z) =n∏

i=1

ωziφ(yi | µzi , σ2zi

)

=K∏

k=1

∏i:zi =k

ωkφ(yi | µk , σ2k )

=K∏

k=1

ωnkk

∏i:zi =k

φ(yi | µk , σ2k ). (2)

Here nk =∑n

i=1 I(zi = k) counts the number of observations allocated to component k .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 19 / 95

Page 20: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsPrior distributions

The model specification is completed by specifying prior distributions of the weights,means and variances

(ω,µ,σ2) ∼ p(ω,µ,σ2).

We consider prior independence and thus

p(ω,µ,σ2) = p(ω)p(µ)p(σ2).

We will make use of conjugate priors. Specifically, the weights are assigned a Dirichletdistribution

(ω1, . . . , ωK ) ∼ Dir(α1, . . . , αK ).

In turn, the means and variances are assigned normal and inverse-gamma priordistributions, respectively. That is,

µk ∼ N(aµ, b2µ), σ2

k ∼ IG(aσ2 , bσ2 ), k = 1, . . . ,K .

Note that placing a prior on the collection (ωk, µk , σ2k) is equivalent to placing a prior

on the finite mixing distribution G

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 20 / 95

Page 21: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsPosterior inference

By combining the likelihood in (2) with this prior specification, the joint posterior distributionis

p(ω,θ | y, z) ∝ L(ω,θ; y, z)p(ω)K∏

k=1

p(µk )p(σ2k ) (3)

Although the posterior distribution in (3) does not have a recognizable form, all fullconditional distributions have simple conjugate forms.

Specifically, for the mean and variance of the components we have

µk | else ∼ N

(aµ/b2

µ +∑

i:zi =k yi/σ2k

1/b2µ + nk/σ

2k

,1

1/b2µ + nk/σ

2k

), (4)

σ2k | else ∼ IG

aσ2 + nk/2, bσ2 +∑

i:zi =k

(yi − µk )2/2

, (5)

for k = 1, . . . ,K .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 21 / 95

Page 22: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsPosterior inference

For i = 1, . . . , n, the full conditional distribution for zi is

zi | else ∼ Mult(pi ).

Here, pi = (pi1, . . . , piK ), with

pik = Pr(zi = k | ω,θ, yi ) =ωkφ(yi | µk , σ

2k )∑K

l=1 ωlφ(yi | µl , σ2l ). (6)

Finally, the full conditional of the weights is given by

ω | else ∼ Dir(α1 + n1, . . . , αK + nK ).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 22 / 95

Page 23: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsGibbs sampler algorithm

Algorithm

Set an initial value for ω, and θ, say, ω(0), and θ(0).

for t = 1, . . . ,T , do:

1 for i = 1, . . . , n, and k = 1, . . . ,K compute posterior probabilities of membership usingequation (6).

2 for i = 1, . . . , n, sample the latent component indicator,

z(t)i ∼ Multi(p(t)

i ).

3 Conditional on z(t), update ω,

ω(t) ∼ Dirichlet(α1 + n(t)1 , . . . , αK + n(t)

K ), n(t)1 =

n∑i=1

z(t)i .

4 Conditional on z(t), update µ(t)k and (σ

(t)k )2, for k = 1, . . . ,K , from (4) and (5),

respectively.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 23 / 95

Page 24: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsIdentifiability issues

Due to identifiability issues, such as the so-called label switching problem (Jasra et al.2005), it makes difference whether there is interest in making inferences about the mixturecomponent-specific parameters and clustering.

The label switching problem (also known as label ambiguity) refers to the fact that there isnothing in the likelihood to distinguish mixture component k from mixture component k ′.

Permuting the K labels in any of K ! ways results in the same model for the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 24 / 95

Page 25: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsIdentifiability issues

As a concrete example, in the k = 2 case, consider

p1 = 0.3, µ1 = 1, p2 = 0.7, µ2 = 1.5, σ21 = σ2

2 = 1, (Scenario A).

Then, the model is equivalent to one with

p1 = 0.7, µ1 = 1.5, p2 = 0.3, µ2 = 1, σ21 = σ2

2 = 1, (Scenario B).

If one is only interested in density estimation, then everything is fine, because as illustratedin the example

fA(y) = 0.3φ(y | 1, 1) + 0.7φ(y | 1.5, 1)

= 0.7φ(y | 1.5, 1) + 0.3φ(y | 1, 1) = fB(y).

Usually, imposing µ1 < µ2 < · · · < µK and using informative priors helps to mitigate theidentifiability issues.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 25 / 95

Page 26: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsHow to choose K ?

A drawback of finite mixture models is that we must choose the number of components K ,which is a non-trivial task in general.

One possibility is placing a prior on K , which leads to a model that changes dimension(number of parameters) with K .

One approach to fit such model is to use reversible jump type of algorithms, but these typeof algorithms tend to be difficult to implement efficiently in practice.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 26 / 95

Page 27: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsHow to choose K ?

Another possibility is to fit the model for different values of K and assess the adequacy ofthe fit, through the LPML criterion, for instance.

Placing a K too small is much worst than placing a K too large.

For instance, one cannot get a trimodal density out of a mixture of two components but onecan essentially ignore extra mixture terms and get bimodal densities from mixtures of threeor more components.

Marin and Robert (2013) contains a good discussion of Bayesian finite mixture models.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 27 / 95

Page 28: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsExample

True data generating distribution: 0.3φ(y | −6, 1) + 0.3φ(y | 0, 1) + 0.4φ(y | 6, 1), n = 500.

K=2

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

K=3

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

K=4

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

K=5

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 28 / 95

Page 29: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Finite mixture modelsExample

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 29 / 95

Page 30: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDP prior

Another alternative is to use a DP prior (Ferguson 1973, 1974) for the mixing distributionG, resulting in a Dirichlet process mixture (DPM) model.

The DP prior for G, on one hand, offers the theoretical advantage of having full weaksupport on all mixing distributions and, on the other hand, the practical advantage ofautomatically determining the number of components that best fits a given dataset.

The resulting model is

f (y) ≡ f (y | G) =

∫k(y | θ)dG(θ), G ∼ DP(α,G0).

In the context of DPM of normals, k(y | θ) = φ(y | µ, σ2).

The full weak support means that, under mild conditions, any distribution with the samesupport as G0 can be well approximated weakly by a DP.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 30 / 95

Page 31: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDP prior

But what does it mean G ∼ DP(α,G0)?

The DP was originally defined by Ferguson (1973, 1974).

The DP generates random probability measures (random distributions), G, defined onsome sigma-algebra (collection of subsets) of a sample space Ω, such that the distributionof any finite partition of Ω is a Dirichlet distribution.

That is, G ∼ DP(α,G0) if for any k and any partition B1, . . . ,Bk of Ω, then

[G(B1), . . . ,G(Bk )] ∼ Dir[αG0(B1), . . . , αG0(Bk )].

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 31 / 95

Page 32: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDP prior

The definition of the Dirichlet process and the properties of the Dirichlet distribution implythat for any subset B of Ω

G(B) ∼ Beta(αG0(B), α(1− G0(B))).

Thus,

EG(B) = G0(B), VarG(B) =G0(B)(1− G0(B))

α+ 1.

Hence, from the above, we have

G is centered on G0 (G0 is also referred as baseline distribution).

α > 0 is a precision parameter controlling the variance of the process. If α is large,G is highly concentrated around G0.

Therefore, the DP prior is centered on a parametric model through the specification of G0,while allowing α to control uncertainty in this choice.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 32 / 95

Page 33: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesStick-breaking representation

Although useful, the definition of the DP does not provide an intuition for what realizationsof a DP actually look like.

Undoubtedly, the most useful definition of the DP is its constructive definition (Sethuraman,1994), according to which G has an almost sure representation of the form

G(·) =∞∑

k=1

ωkδθk (·). (7)

In (7), we have

θkiid∼ G0, k = 1, 2, . . .

ω1 = v1, and for k ≥ 2, ωk = vk∏

l<k (1− vl ), with vkiid∼ Beta(1, α) (independently

of the θk )⇒ Stick-breaking construction.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 33 / 95

Page 34: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesStick-breaking representation

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 34 / 95

Page 35: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesStick-breaking representation

As the stick-breaking construction proceeds, the stick gets shorter and shorter and thelengths allocated to higher index atoms decrease stochastically, with the rate of decreasedepending on α.

Since vk ∼ Beta(1, α), then

E(vk ) =1

1 + α,

so that values of α close to zero lead to high weight on the first few atoms, with theremaining atoms being assigned small probabilities.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 35 / 95

Page 36: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesStick-breaking representation

−3 −2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

α = 0.5

θ

ω

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

α = 1

θ

ω

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

α = 5

θ

ω

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

α = 10

θ

ω

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 36 / 95

Page 37: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDP prior trajectories

30 trajectories from a DP with G0 = N(0, 1) (based on 1000 draws).

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 1C

DF

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 5

CD

F

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 50

CD

F

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

α = 100

CD

F

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 37 / 95

Page 38: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDP prior trajectories

30 trajectories from a DP with G0 = Beta(2, 10) (based on 1000 draws).

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

α = 1C

DF

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

α = 5

CD

F

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

α = 50

CD

F

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

α = 100

CD

F

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 38 / 95

Page 39: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDP prior trajectories

30 trajectories from a DP with G0 = Pois(1) (based on 1000 draws).

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

α = 1C

DF

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

α = 5

CD

F

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

α = 50

CD

F

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

α = 100

CD

F

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 39 / 95

Page 40: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesConjugacy of the DP prior

The DP prior is closed under sampling. That is, the posterior distribution is also a DP.

Let y1, . . . , yn | Giid∼ G and G ∼ DP(α,G0). Then

G | y1, . . . , yn ∼ DP

(α+ n,

α

α+ nG0 +

1α+ n

n∑i=1

δyi

)

The posterior mean is then

E(G | y1, . . . , yn) =α

α+ nG0 +

nα+ n

n∑i=1

1nδyi

Thus, the posterior mean is a weighted average between the centering distribution and theempirical cdf.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 40 / 95

Page 41: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesPosterior inference using a DP prior/ Bayesian bootstrap

When α→ 0, the limiting posterior distribution is

G | y1, . . . , yn ∼ DP

(n,

1n

n∑i=1

δyi

),

and thus

E(G | y1, . . . , yn) =1n

n∑i=1

δyi

This limiting posterior distribution is known as the Bayesian bootstrap (Rubin 1981,Gasparini 1995).

A Bayesian bootstrap estimator for G can be computed as

G =n∑

i=1

piδyi , (p1, . . . , pn) ∼ Dir(n; 1, . . . , 1). (8)

Again, from (8) it follows that

E(G | y1, . . . , yn) = E

( n∑i=1

piδyi

)=

n∑i=1

E(piδyi =1n

n∑i=1

δyi

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 41 / 95

Page 42: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesPosterior inference using a DP prior/ Bayesian bootstrap

Recall that in the frequentist bootstrap (Efron, 1979) pi ∈ 0, 1/n, . . . , n/n correspondingto the number of times yi appears in a bootstrap sample.

Thus, the weights in the Bayesian bootstrap are smoother than those from Efron’sfrequentist bootstrap.

This is justified by the fact that in the BB the weights arise from a Dirichlet (continuousdistribution) while the weights in the classical bootstrap have a discrete distribution.

Note that in the BB the data should be regarded as fixed, so that we do not resample fromit.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 42 / 95

Page 43: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesPosterior inference using a DP prior

Estimating the posterior mean under a DP prior using simulated data. The true distribution

generating the data is N(3, 1), while the centering is a N(0, 1) distribution.

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

n=50,α = 1

CD

F

TrueECDFPost. MeanG0

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

n=50,α = 50

CD

F

TrueECDFPost. MeanG0

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

n=50,α = 100

CD

F

TrueECDFPost. MeanG0

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 43 / 95

Page 44: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesPosterior inference using a DP prior

Estimating the posterior mean under a DP prior using simulated data. The true distribution

generating the data is N(3, 1), while the centering is a N(0, 1) distribution.

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

n=500,α = 1

CD

F

TrueECDFPost. MeanG0

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

n=500,α = 50

CD

F

TrueECDFPost. MeanG0

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

n=500,α = 100

CD

F

TrueECDFPost. MeanG0

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 44 / 95

Page 45: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDPM model

The DP generates flexible albeit discrete distributions.

Due to this fact, and as we have already anticipated, it is commonly used as a prior inmixture models.

The Dirichlet process mixture model can be specified as

f (y) =

∫k(y | θ)dG(θ), G ∼ DP(α,G0)

Because G is random, the mixture density f is also random.

Further, the mixture density can be discrete or continuous, univariate or multivariate,depending on the nature of k(y | θ).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 45 / 95

Page 46: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesDPM model

Hereby, we focus on Dirichlet process mixtures of normals

f (y) =

∫φ(y | µ, σ2)dG(µ, σ2), G ∼ DP(α,G0). (9)

The centering distribution G0 is defined on R× R+.

Due to conjugacy reasons, G0 is usually taken to be the normal-inverse-gammadistribution, that is

G0 ≡ N(aµ, b2µ)IG(aσ2 , bσ2 ).

To allow for extra flexibility, hyperpriors can be placed on aµ, b2µ, aσ2 , bσ2 .

The stick-breaking representation of the DP allows us to write (9) as the followingcountably infinite mixture of normals

f (y) =∞∑

k=1

ωkφ(y | µk , σ2k ), (10)

Note that equation (10) resembles the finite mixture model considered earlier but with theimportant difference that the number of mixture components is set to infinite and theweights now follow a stick-breaking construction.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 46 / 95

Page 47: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesMCMC

Posterior inference can be conducted using (at least) two different kinds of MCMCstrategies:

1 Employing a truncation of the stick-breaking representation (Ishwaran and James,2001, 2002).

2 Using a marginal/collapsed Gibbs sampling where the mixing distribution is ingratesout from the model (MacEachern 1998, Neal 2000).

Throughout approach 1 will be detailed.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 47 / 95

Page 48: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesBlocked Gibbs sampler

The blocked Gibbs sampler relies on truncating the stick-breaking representation to a finitenumber of components.

Hence

GN (·) =N∑

k=1

δ(µk ,σ2k )(·).

The atoms (µk , σ2k ) are iid G0, i.e., (µk , σ

2k )

iid∼ N(aµ, b2µ)IG(aσ2 , bσ2 ), k = 1, . . . ,N.

The weights arise through a truncated stick-breaking construction

ω1 = v1, for k ≥ 2 ωk = vk∏l<k

(1− vl ), vkiid∼ Beta(1, α), k = 1, . . . ,N − 1

vN = 1, ωN =

N−1∏l=1

(1− vl )

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 48 / 95

Page 49: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesBlocked Gibbs sampler

Ishwaran and James (2001) showed the following bound for the truncation error

4

1− E

N−1∑

k=1

ωk

n ≈ 4n exp

(1− Nα

)

For instance, if n = 500 and if we use a truncate value of N = 20, then for α = 1, we get abound of ≈ 1.1× 10−5.

In practice, N = 20 or N = 50 are commonly chosen as a default.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 49 / 95

Page 50: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesBlocked Gibbs sampler

Using the truncated version GN of G, the normal mixture density can be expressed as

f (y) =N∑

k=1

ωkφ(y | µk , σ2k ),

with ωk generated from the truncated stick-breaking representation, whereas

µkiid∼ N(aµ, b2

µ), and σ2k

iid∼ IG(aσ2 , bσ2 ).

As it was the case for the finite mixture model, derivation of the full conditionals for Gibbssampling involves the data-augmented likelihood.

The means, variances, and latent component indicators are sampled in an identicalmanner to the finite mixture model.

The main difference is that, unlike in the finite mixture model, uncertainty in the componentweights is shifted to v, the inputs into the construction of the stick-breaking weights.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 50 / 95

Page 51: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesBlocked Gibbs sampler

The full conditional distributions are

µk | else ∼ N

(aµ/b2

µ +∑

i:zi =k yi/σ2k

1/b2µ + nk/σ

2k

,1

1/b2µ + nk/σ

2k

), (11)

σ2k | else ∼ IG

aσ2 + nk/2, bσ2 +∑

i:zi =k

(yi − µk )2/2

, (12)

for k = 1, . . . ,N.

For i = 1, . . . , n, the full conditional distribution for zi is

zi | else ∼ Mult(pi ),

with pi = (pi1, . . . , piN ) and pik =ωkφ(yi |µk ,σ

2k )∑K

l=1 ωlφ(yi |µl ,σ2l )

, k = 1, . . . ,N.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 51 / 95

Page 52: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesBlocked Gibbs sampler

For k = 1, . . . ,N − 1, update the inputs of the stick-breaking weights have from

vk | else ∼ Beta

nk + 1, α+N∑

l=k+1

nl

. (13)

Regarding the precision parameter α, it can be fixed at a small value, for instance, α = 1 iswidely used in applications.

Alternatively, one can place a prior on α and allow the data to inform about the appropriatevalue of α.

Letting α ∼ Gamma(aα, bα), the resulting full conditional for α is

α | else ∼ Gamma(aα + N − 1, bα −N−1∑k=1

log(1− vk )). (14)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 52 / 95

Page 53: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Algorithm

Set an initial value for α, ω, and θ, say, α(0), ω(0), and θ(0).

for t = 1, . . . ,T , do:

1 for i = 1, . . . , n, and k = 1, . . . ,K , compute posterior probabilities of membership usingequation (6).

2 for i = 1, . . . , n, sample the latent component indicator,

z(t)i ∼ Multi(p(t)

i ).

3 Simulate stick-breaking inputs v(t) from equation (13).

4 Given v(t), compute ω(t) using the stick-breaking construction.

5 Conditional on z(t), update µ(t)k and (σ

(t)k )2, for k = 1, . . . ,K , from (11) and (12),

respectively.

6 Update the precision parameter α from (14).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 53 / 95

Page 54: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesBlocked Gibbs sampler

A valid concern with the blocked Gibbs sampler is that by truncating the stick-breakingrepresentation we are effectively fitting a finite (and hence parametric) mixture model.

Quoting Dunson (2011):

“For example, if we let N = 25 as a truncation level, a natural question is how this is betteror intrinsically different than fitting a finite mixture model with 25 components. One answeris that N is not the number of components occupied by the subjects in your sample but is

instead an upper bound on the number of subjects.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 54 / 95

Page 55: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesCollapsed Gibbs sampler/ Polya urn scheme

The function DPdensity from the R package DPpackage can also to be used to fit a DPMof normals.

The MCMC scheme behind this function is a marginalized/collapsed Gibbs sampler.

The collapsed Gibbs sampler avoids specifying a truncation level by marginalizing out Gand relying on the Polya urn scheme of Blackwell and MacQueen (1973).

Lettingyi ∼ φ(θi ), θi = (µi , σ

2i ) ∼ G, G ∼ DP(α,G0),

and marginalizing out G, we obtain the Polya urn predictive rule

p(θi | θ1, . . . ,θi−1) ∝α

α+ i − 1G0(θi ) +

1α+ i − 1

i−1∑j=1

δθj (θi )

The Polya urn rule form the basis of the collapsed Gibbs sampler. For those interested infurther details, see (Escobar and West 1995, MacEachern 1998, Neal 2000).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 55 / 95

Page 56: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesExample

Same data from the finite mixture example. DPM fit (left) and comparison against the fit of a 3

component mixture model, which corresponds to the true data generating mechanism (right).

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

y

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 56 / 95

Page 57: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesExample

True data generating mechanism: φSN(y | µ = 0, σ2 = 1, λ = 8)

n=150

y

Den

sity

−0.5 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

n=500

y

Den

sity

−0.5 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 57 / 95

Page 58: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesExample

True data generating mechanism: φ(y | 0, 1). DPM fit (blue) against normal fit (red).

n=150

y

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

n=500

y

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 58 / 95

Page 59: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesCensored responses

The DPM can easily handle censored responses (left,right, or interval).

Remember that under the hierarchical formulation

yi | µ,σ2, z ∼ φ(yi | µzi , σ2zi

)

...

For instance, if yi is right censored, we know that its true value, say y∗i , is greater than yi ,that is, y∗i > yi .

We can take care of these censored observations by simply adding an extra step in theblocked Gibbs algorithm.

In fact, we can simulate those observations from

y∗i | yi , zi ,µ,σ2 ∼ φ(y∗i | µzi , σ

2zi

)I(y∗i > yi ).

This can be accomplished by simulating y∗i from a truncated normal distribution with lowerlimit equal to yi .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 59 / 95

Page 60: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dirichlet process mixturesCensored responses

DPM fit (blue line) agains Kaplan–Meier fit (black line). The censored observations are

represented by crosses.

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

y

Sur

viva

l

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 60 / 95

Page 61: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesDensity regression

So far we have focused on the problem of density estimation.

We will now move to the problem of density regression.

Traditional regression models allow just one or two characteristics (e.g., mean and/orvariance) to change as a function of covariates.

Here we will explore tools that allow the entire density/distribution to change as a functionof covariates.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 61 / 95

Page 62: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesDensity regression

Let (x1, y1), . . . , (xn, yn) be regression data, where xi ∈ X ⊆ Rp .

It is assumed thatyi | xi

ind.∼ f (· | xi ), i = 1, . . . , n.

We specify a probability model for the entire collection of densities F = f (· | x) : x ∈ X.

Further, one possibility is to model the conditional density using covariate-dependentmixture of normal models

f (y | x) =

∫φ(y | µ, σ2)dGx(µ, σ2).

The probability model for the conditional densities is induced by specifying a prior for thecollection of mixing distributions

GX = Gx : x ∈ X ∼ G.

Gx is the random mixing distribution at covariate x and G is the prior for the collection GX .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 62 / 95

Page 63: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesDDP prior

One possibility for G is the dependent DP (DDP) proposed by MacEachern (1999,2000),which is built upon the constructive definition of the DP.

In its full generality, the DDP is specified as follows:

Gx(·) =∞∑

k=1

ωk,xδθk,x (·), ω1,x = v1,x, ωk,x = vk,x∏l<k

(1− vk,x).

Here

θ1,x,θ2,x, . . . are realizations of a stochastic process (e.g., a Gaussian process) overX .

v1,x, v2,x, . . . are realizations from a stochastic process on X such thatvk,x ∼ Beta(1, αx)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 63 / 95

Page 64: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesDDP prior

Because of complications involved in allowing the weights to depend on covariates, the‘single weights’ DDP, which assumes fixed weights, is commonly used.

Following De Iorio et al. (2009), a possibility for Gx is

Gx(·) =∞∑

k=1

ωkδθk,x (·),

where the weights match those from a standard DP and θk,x = (µk,x, σ2k ), with

µk,x = xTβk .

Thus, under this formulation, the base stochastic processes are replaced with a basedistribution G0 that generates the component-specific regression coefficients andvariances.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 64 / 95

Page 65: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesDPM of Gaussian regression models

Thus, the conditional density can therefore be represented as a DP mixture of Gaussianregression models

f (y | x) =

∫φ(y | xTβ, σ2)dG(β, σ2), G ∼ DP(α,G0). (15)

For example, G0 could be Np(µβ,Sβ)IG(aσ2 , bσ2 ).

This model is known as the linear dependent Dirichlet process (LDDP) (De Iorio 2004,2009).

Note that Eq. (15) can be equivalently written as

f (y | x) =∞∑

k=1

ωkφ(y | xTβk , σ2k ).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 65 / 95

Page 66: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesDPM of Gaussian regression models - spline based version

The LDDP although flexible, does not allow for nonlinear effects of the covariates.

An alternative is instead of considering µk,x = xTβk to consider an additive formulationbased on B-splines, namely

µk,x = βk0 +

p∑r=1

Lr∑s=1

βkrsΨs(xr , dr )

,

where Ψ(x , d) corresponds to the sth B spline basis function of degree d evaluated at x .

As in the density estimation setup, posterior inference can be conducted using a blockedGibbs sampler or a collapsed Gibbs sampler.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 66 / 95

Page 67: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Dependent Dirichlet process mixturesExample

Data generating mechanism: yi | xi ∼ 0.5φ(yi | −3.5− 1.5xi , 12) + 0.5φ(yi | 2 + 3.5xi , 1.252),

xi ∼ U(0, 1), for i = 1, . . . , 500. Results from the fit of a LDDP.

−5 0 5

0.00

0.05

0.10

0.15

0.20

0.25

x=0.2

y

Den

sity

−5 0 5

0.00

0.10

0.20

0.30

x=0.4

y

Den

sity

−5 0 5

0.00

0.10

0.20

0.30

x=0.7

y

Den

sity

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 67 / 95

Page 68: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesPT prior

We now focus on another popular nonparametric prior for density/distribution estimation.The discussion closely follows Branscum, Johnson, and Baron (2013).

Polya tree priors have been discussed as early as Freedman (1963), Fabius (1964), andFerguson (1974).

However, the natural starting point for understanding their potential use in modeling data isLavine (1992, 1994), while Hanson (2006) considered some computational details.

Polya trees are way less popular than DPs but they are a very powerful tool as well.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 68 / 95

Page 69: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesPT prior

Suppose that a random sample y1, . . . , yn is obtained from an unknown/randomdistribution F .

Polya tree priors, as Dirichlet process priors, place a distribution on a collection ofdistributions.

A Polya tree for a distribution F is constructed by dividing the sample space intofiner-and-fines disjoint sets using successive binary partitioning.

For instance, the first partition splits the sample space into two non overlapping intervals.

In the second partition, those two intervals are each split, yielding a finer partition thatcontain four intervals.

Then, these four intervals are each split to give an eight interval third level partition of thesample space.

At level j of the tree, the sample space is partitioned into 2j intervals, j = 1, . . .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 69 / 95

Page 70: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesPT prior

Let Bj,k denotes the k th interval at level j of the tree, for j = 1, . . ., and k = 1, . . . , 2j .

Observe that the partitions are nested within one another, starting at the top of the tree andworking up.

For example, by definition, B1,1 = B2,1 ∪ B2,2 and Bj−1,1 = Bj,1 ∪ Bj,2, etc.

These intervals can be thought as the bins in the histogram.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 70 / 95

Page 71: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

For a full tree the splitting continues ad infinitum.

However, in practice, we truncate to a fixed J, hence the term, finite Polya tree.

Generally, setting J equal to 4, 5, or 6 often works well in practice.

Another option is to select J so that roughly J = log2 n (Hanson, 2006).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 71 / 95

Page 72: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Informally speaking, the unknown distribution F assigns the data points yi s to the intervalsat level J of the tree and the task is to use the observed distribution of the data into theintervals to estimate F .

Although all levels of the tree are important for the purpose of estimating F , of primaryimportance is level J.

The goal here is to produce a data-driven estimate of F that assigns high probability tointervals that contain lots of data, assigns low probability to empty intervals, and assignsmidrange probability to intervals that contain some (but not a lot) of the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 72 / 95

Page 73: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Let us consider first the simplest case J = 1.

Then data are assigned to either B1,1 or B1,2.

The probability of a data point yi being assigned to B1,1 is F (B1,1) = Pr(yi ∈ B1,1) whichsince F is a cdf and B1,1 is an interval of the form (L,U) is to be interpreted asF (B1,1) = F (U)− F (L).

Denote this unknown probability by π1,1.

Then, by the complement rule, π1,2 = 1− π1,1 is the probability assigned to set B1,2.

In notation, Pr(yi ∈ B1,1) = F (B1,1) = π1,1 and Pr(yi ∈ B1,2) = F (B1,2) = π1,2.

Since F is unknown, π1,1 and π1,2 are also unknown.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 73 / 95

Page 74: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

To help interpret these parameters, suppose the data arise from a right-skewed distribution(lots of more data in B1,1 than in B1,2), then π1,1 would be large and hence π1,2 would besmall, and vice versa for left-skewed data.

Obviously, in practice, J = 1 is never used because it would lead to a crude estimate of thedensity function f , much like estimating a density using a relative frequency histogram thatcontains only two bins.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 74 / 95

Page 75: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Let us now consider, again for simplicity, the case of J = 2. We have now four sets.

Data assignment is based on whether the data point was in B1,1 or B1,2 at the previouslevel j = 1.

If yi ∈ B1,1 then yi is assigned to interval B2,1 with unknown probability π2,1 or to intervalB2,2 with probability π2,2 = 1− π2,1.

Similarly, define π2,3 and π2,4(= 1− π2,3) to be the probability of yi being assigned to setB2,3 or B2,4, respectively, given that yi was on B1,2.

The πj,k s are conditional parameters, since π2,1 = Pr(yi ∈ B2,1 | yi ∈ B1,1) andπ2,3 = Pr(yi | yi ∈ B1,2).

To relate the πj,k s to F we must determine the marginal probability of assignment to thevarious intervals at level J(= 2).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 75 / 95

Page 76: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Observe that interval B2,1 is nested on interval B1,1, so the marginal probability of intervalB2,1 is

F (B2,1) = Pr(yi ∈ B2,1)

= Pr(yi ∈ B2,1 ∩ B1,1)

= Pr(yi ∈ B2,1 | yi ∈ B1,1) Pr(yi ∈ B1,1)

= π2,1π1,1.

Similar steps lead to F (B2,2) = (1− π2,1)π1,1, F (B2,3) = π2,3π1,2, andF (B2,4) = (1− π2,3)π1,2.

Suppose again that F is right skewed . Then the data will estimate π1,1 to be large, and itwill estimate π2,1 to be large since most of the n data points will be assigned to set B2,1.

Therefore, the estimate of F (B2,1) will be (relatively) large.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 76 / 95

Page 77: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Notice that level 1 has only one unique parameter, π1,1, associated with it because π1,2 iscompletely determined by π1,1.

Similarly, level 2 has two unique parameters, π2,1 and π2,3 associated with it.

We can continue the partitioning to any level J.

For J = 3, we add eight conditional probabilities parameters, π3,1, π3,2, . . . , π3,8, but onlyfour of these are unique.

For instance, π3,1 is the probability that yi is in interval B3,1 given that it is in interval B2,1,and

F (B3,1) = Pr(yi ∈ B3,1)

= Pr(yi ∈ B3,1 ∩ B2,1)

= Pr(yi ∈ B3,1 | yi ∈ B2,1) Pr(yi ∈ B2,1)

= π3,1π2,1π1,1.

In general, we have

F (Bj,k ) =

j∏l=1

πl,Int(k−1)2l−j +1, j = 1, . . . , J, k = 1, . . . , 2j .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 77 / 95

Page 78: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

The key point is that if we can estimate all of the πj,k s, then we can estimate the probabilitythat it is allocated by F to each interval at level J.

So far, we have modeled the probability of assignment to each set at level J, but we havenot modeled how probability mass is distributed within each interval at level J.

For instance, all the yi s can be clumped together in the center of the interval.

Alternatively, the data could be uniformly distributed, clumped to the right or left side of theinterval or have any other dispersion pattern within each interval.

To address this issue, we model the data according to how a user-specified parametricdistribution F0 allocated probability mass within the intervals at level J.

So, as it was in the DP (or DPM), here with finite Polya trees, the user also needs tospecify a probability distribution (and we will see in a few slides that F0 is also a centeringdistribution).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 78 / 95

Page 79: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

The distribution F0 is also used to determine the lower and upper endpoints of all intervalsin the tree.

The median of F0 is used to split the sample space into two intervals at level 1 of the tree.

The quartiles of F0 define cut points for intervals at level 2.

Writing the 25th percentile as F−10 (1/4), the median as F−1

0 (2/4), and the 75th percentileas F−1

0 (3/4), we have for a sample space that covers the real line

B2,1 =(−∞,F−1

0 (1/4)), B2,2 =

(F−1

0 (1/4),F−10 (2/4)

)B2,3 =

(F−1

0 (2/4),F−10 (3/4)

), B2,4 =

(F−1

0 (3/4),∞)

In general, the (j, k)th interval is

Bj,k =

(F−1

0

(k − 1

2j

),F−1

0

(k2j

)), j = 1, . . . , J, k = 1, . . . , 2j .

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 79 / 95

Page 80: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

The collection Π = πj,k : j = 1, . . . , J, k = 1, . . . , 2j constitutes the unknown parameterscorresponding to F .

The probabilities in Π are assumed mutually independent. That is, for instance, (π21, π22)and (π23, π24) are independent.

We thus need to specify a prior distribution over this collection.

Due to the fact that when k is an even number between 2 and 2j , πj,k = 1− πj,k−1, priorsare needed only on πj,k when k is odd.

Since the πj,k s are probabilities, it is standard to use independent beta priors, specifically

πj,2k−1 ∼ Beta(cρ(j), cρ(j)), j = 1, . . . , J, k = 1, . . . , 2j−1.

In most of the applications ρ(j) = j2 as this guarantees an absolutely continuous F (in aninfinite tree) (Ferguson, 1974).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 80 / 95

Page 81: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Before proceeding on further considerations about the role of c, let us note that under thisparametrization

E [F (B(j, k))] =12j

= F0(Bj,k ).

Thus, F0 is the prior expectation of the unknown distribution function F .

F0 is usually selected based on our best prior assessment of the data-generatingdistribution F .

The parameter c > 0, also referred to as the weight parameter, acts much like as theprecision parameter α in the Dirichlet process.

As in the Dirichlet process, large values of c leads to realizations of F that are close to F0.

A very low value of c (e.g., c = 0.1) will often lead to an estimate of F that is similar to theempirical cdf. Usually c = 1 works well in practice.

Just like α, c can be regarded as random a prior placed on it.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 81 / 95

Page 82: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Once we have selected J, F0, and c we have all the elements needed to specify a finitePolya tree for F .

The formula for the density function f (our interest here) is given by

f (y) = 2J p(BJ,k(y))f0(y). (16)

Here k(y) ∈ 1, . . . , 2J identifies the interval at level J containing y and p(BJ,k(y)) is theprobability of that interval (the product of J of the πj,k s) (Hanson, 2006).

The interval that contains y at level J can be determined using the formulak(y) = Int(2J F0(y) + 1).

Note that the density at stage J is just the product of a weighting factor 2J p(BJ,k(y)) andthe original used-specified parametric density f0.

Prior or posterior distributions that focus high probability on regions around πjk = 0.5 for allj and k will behave very much like the f0 density.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 82 / 95

Page 83: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior

Given Π, G is known.

So, in order to compute posterior estimates of F (f ) we only need to know how to updatethe πj,k s.

Fortunately, just like the DP, Polya trees enjoy also a simple conjugacy result.

Specifically, if

y1, . . . , yn | Fiid∼ F , F ∼ FPTJ (c,F0),

then F | y is updated through

πj,2k−1 | yind.∼ Beta

(cj2 +

n∑i=1

I(yi ∈ Bj,k ), cj2 +n∑

i=1

I(yi ∈ Bj,k+1)

), (17)

for j = 1, . . . , J and k = 1, . . . , 2j−1.

In words, we update the Beta parameters by counting the number of observations that fallin each set of each level of the tree.

That is, F | y is a PT with Beta parameters updated through (17).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 83 / 95

Page 84: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesFinite PT prior: example

Μ-2Σ Μ Μ+2Σ

HaL

Μ-2Σ Μ Μ+2Σ

HbL

Μ-2Σ Μ Μ+2Σ

HcL

Μ-2Σ Μ Μ+2Σ

HdL

FPT density estimates considering F0 = N(µ, σ2) and J = 3. (a) N(µ, σ2) density. (b) j = 1; π1,1 = 0.45. (c) j = 2;

π2,1 = 0.4, π2,3 = 0.6. (d) J=3; π3,1 = 0.3, π3,3 = 0.3, π3,5 = 0.6, π3, 7 = 0.3.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 84 / 95

Page 85: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya trees

All densities in the previous figure are too jagged, which turns out to be the result of usinga fixed F0.

In fact, one of the major criticisms of Polya trees is that, unlike the DP, inferences aresomewhat sensitive to the choice of a fixed partition.

A remedy is to place a prior distribution on the parameters of F0, say θ, we denote theresulting centering distribution as F0,θ to emphasize the dependence on θ.

A prior on θ implies that the starting and endpoints of the sets of the tree areuncertain/random.

This has the effect of smoothing out the abrupt jumps at these points that are noticeable inthe previous figure.

In fact, in panel (d) of the previous figure it is also shown the estimate obtained byconsidering θ = (µ, σ2) as random (dashed line).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 85 / 95

Page 86: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya trees

So, the final model is

y1, . . . , yn | Fiid∼ F ,

F | c,θ ∼ FPTJ (F0,θ , c),

θ ∼ p(θ),

and it is known as a mixture of finite Polya trees.

It can be alternatively written as

F ∼∫

FPTJ (F0,θ , c)p(θ)dθ.

The formula for the density function f is identical to that given in (16).

To conduct posterior inference we will now need to know how to sample θ | y,Π (wealready know how to sample Π | y,θ).

We will make this concrete considering the particular case of F0,θ = N(µ, σ2).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 86 / 95

Page 87: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya trees

We center random F at F0,θ = N(µ, σ2), where θ = (µ, σ2).

Using (16) the likelihood L(Π,θ; y) is

n∏i=1

2Jφ(y | µ, σ2)p(kθ(J, yi )).

We write p(kθ(J, yi )) instead of p(BJ,k(yi )) to alleviate notation and to make clear the

dependence on θ.

As in Branscum et al. (2008), we assume µ ∼ N(aµ, b2µ) and σ ∼ Γ(aσ , bσ).

Assuming further that θ and Π are a priori independent, the joint posterior density isproportional to

p(θ,Π | y) ∝ L(Π,θ; y)p(θ)p(Π).

The full conditionals for µ and σ are not recognizable as belonging to a parametric familythus these parameters are updated through Metropolis–Hastings steps.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 87 / 95

Page 88: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesMCMC

Algorithm1 µ is updated by sampling µ∗ ∼ N(µ, s1) and accepted with probability

min

1,exp−0.5b−2

µ (µ∗ − aµ)2

exp−0.5b−2µ (µ− aµ)2

∏ni=1 p(kµ∗,σ(J, yi ))∏ni=1 p(kµ,σ(J, yi ))

×exp−0.5σ−2 ∑n

i=1(yi − µ∗)2exp−0.5σ−2 ∑n

i=1(yi − µ)2

.

Here s1 is a tuning parameter that needs to be calibrated to achieve a desirable acceptance rate.

2 σ is updated by sampling σ∗ ∼ Γ(σs2, s2) and accepted with probability

min

1,fΓ(σ∗; aσ, bσ)

fΓ(σ; aσ, bσ)

∏ni=1 p(kµ,σ∗ (J, yi ))∏ni=1 p(kµ,σ(J, yi ))

×σn exp−0.5(σ∗)−2 ∑n

i=1(yi − µ)2σ∗n exp−0.5σ−2 ∑n

i=1(yi − µ)2fΓ(σ;σ∗s2, s2)

fΓ(σ∗;σs2, s2)

,

where s2 has the same meaning as s1.

3 Use (17) to update πj,k , for k = 1, . . . , 2j−1 and j = 1, . . . , J.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 88 / 95

Page 89: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesExample

Data generated from 0.5φ(y | −3, 1) + 0.5φ(y | 3, 1), n = 500.

J=3

y

Den

sity

−6 −4 −2 0 2 4 6

0.00

0.10

0.20

0.30

J=4

y

Den

sity

−6 −4 −2 0 2 4 60.

000.

100.

200.

30

J=5

y

Den

sity

−6 −4 −2 0 2 4 6

0.00

0.10

0.20

0.30

J=6

y

Den

sity

−6 −4 −2 0 2 4 6

0.00

0.10

0.20

0.30

J=7

y

Den

sity

−6 −4 −2 0 2 4 6

0.00

0.10

0.20

0.30

J=8

y

Den

sity

−6 −4 −2 0 2 4 6

0.00

0.10

0.20

0.30

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 89 / 95

Page 90: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesExample

True data generating mechanism:

0.3φSN(y | µ = −2, σ2 = 1.52, λ = 6) + 0.7φSN(y | µ = 2, σ2 = 1.52, λ = −3). J = 4 was

considered.

n=150

y

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

n=500

y

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 90 / 95

Page 91: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Mixtures of finite Polya treesExample

True data generating mechanism: φSN(y | µ = 0, σ2 = 1, λ = 10). MFPT estimate (solid blue

line) agains DPM estimate (dashed red line).

n=150

y

Den

sity

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

n=500

y

Den

sity

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 91 / 95

Page 92: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Linear dependent tailfree process

A Polya tree define the conditional probabilities πj+1,2k−1, πj+1,2k as beta distributions.

To accommodate covariates, and in a spirit of density regression, Jara and Hanson (2011)proposed to model these probabilities through logistic regression.

Specifically, given covariates x, the probabilities (πj+1,2k−1, πj+1,2k ) are defined as

log

(πj+1,2k−1

πj+1,2k

)= xTτj,k .

The resulting model is known as linear dependent tail free process. For further details seeJara and Hanson, 2011.

The function LDTFPdensity in DPpackage implements this model.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 92 / 95

Page 93: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Other prior distributions

Throughout this presentation we have focused on Dirichlet process mixtures and mixturesof finite Polya trees.

We did not mean to be exhaustive. The aim was to provide, as the name says, anintroduction.

Other popular Bayesian nonparametric models include:

Gausian processes,

Bernstein polynomials,

Splines/ wavelets/ neural networks, etc.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 93 / 95

Page 94: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

AcknowledgementsI am happily in debt to...

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 94 / 95

Page 95: Bayesian Nonparametrics: a Soft Introduction › ~lsanchez › ebeb2016 › slides2vanda.pdf · Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven

Thank you...

...for your attention!!!

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 95 / 95