Bayesian Nonparametrics, Applications to biology, ecology, and marketing
-
Upload
julyan-arbel -
Category
Data & Analytics
-
view
513 -
download
0
Transcript of Bayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian NonparametricsApplications to biology, ecology, and marketing
Antonio Canale
Universita di Torino &Collegio Carlo Alberto
StaTalk19 February, 2016
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Developmental toxicity studies
• Developmental toxicity is any alteration which interferes withnormal growth caused by environmental factors
• environmental factors include drugs, lifestyle factors such asalcohol, smoke, and environmental toxic chemicals or physicalfactors
• typical settings involve animals experiments
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Ethylene glycol
• Ethylene glycol is used in many industrial processes as e.g. anantifreeze, an industrial humectant, a solvent in paint and plasticindustry.
• we consider data from a developmental toxicity study of ethyleneglycol in mice conducted by the National Toxicology Program(Price et al. 1985)
• Pregnant mice were assigned to dose groups of 0, 750, 1500, or3000 mg/kg/day, with the number of implants measured for eachmouse at the end of the experiment.
• The scientific interest lies in studying a dose-response trend in thedistribution of the number of implants
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Ethylene glycol data (control group, mean 13.32,variance 4.89)
5 10 15 20
01
23
45
6
freq
uenc
ies
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Let’s go nonparametric!
Clearly we cannot try to estimate the pmf of the number of implantswith
yi ∼ Pois(λ)
λ ∼ Ga(a, b)
since the sampling model is too restrictive.
Hence we have a good reason to be nonparametric
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Simple approach
• A draw form the DP process produce an almost sure discretedistribution.
• We may think to assumeyi ∼ P
P ∼ DP(α,P0)
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Simple approach
• The posterior is in closed form, i.e.
(P | yn) ∼ DP
((α + n)
{αP0 +
∑i
δyi
}),
• which is actually quite unappealing in not allowing borrowing ofinformation about local deviations from P0.
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Simple approach
5 10 15 20
0.00
0.05
0.10
0.15
0.20
pmf
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Mixture of Poisson
• An alternative is
Pr(Y = j) =
∫Poi(j ;λ)dP(λ), P ∼ DP(αP0),
• DPM of Poisson seems extremely flexible and to provide a naturalmodification of the DPM of Gaussians;
• the resulting prior on the count distribution is actually quiteinflexible;
• distributions that are under-dispersed cannot be approximated;
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Round a continous distribution
• Take a continuous density
• Define a0 = 0, a1 = 1, . . .
• Calculate p(j) =∫ aj+1
ajf (x)dx
• Obtain the discrete countdistribution
0 1 2 3 4 5
0.0
0.2
0.4
0.6
x
f(x)
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Round a continous distribution
• Take a continuous density
• Define a0 = 0, a1 = 1, . . .
• Calculate p(j) =∫ aj+1
ajf (x)dx
• Obtain the discrete countdistribution
0 1 2 3 4 5
0.0
0.2
0.4
0.6
x
f(x)
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Round a continous distribution
• Take a continuous density
• Define a0 = 0, a1 = 1, . . .
• Calculate p(j) =∫ aj+1
ajf (x)dx
• Obtain the discrete countdistribution
0 1 2 3 4 5
0.0
0.2
0.4
0.6
x
f(x)
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Round a continous distribution
• Take a continuous density
• Define a0 = 0, a1 = 1, . . .
• Calculate p(j) =∫ aj+1
ajf (x)dx
• Obtain the discrete countdistribution
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
y
p(y)
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Rounded Gaussian Mixture (Canale and Dunson,2011)
p(·;P) =
∫RG (·;µ, τ−1)dP(µ, τ−1),
P ∼ DP(αP0),
Toxicology Ecology Marketing Human fertility More applications
Developmental toxicity studies
Rounded Gaussian Mixture
5 10 15 20
0.00
0.05
0.10
0.15
0.20
estimated pmf (blue) and empirical pmf (black)
pmf
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
quantile
chan
ge in
# im
plan
ts
Toxicology Ecology Marketing Human fertility More applications
Animal abundance
0 50 100 150 200 250
01
23
45
animal abundance
Toxicology Ecology Marketing Human fertility More applications
Another reason to avoid Poisson mixtures
• We compare
p(·;P) =
∫RG (·;µ, τ−1)dP(µ, τ−1),
p(·;P) =
∫Poi(·;λ)dP(λ)
• the DP is highly sensitive to the prior specifications of α which hasa major impact in the total number of clusters
• a more general NP prior can lead to more accurate estimates,especially for the number of mixture components. (Ishwaran andJames, 2001 and Lijoi et al. 2005, 2007)
P ∼ PY (θ, σ,P0)
Toxicology Ecology Marketing Human fertility More applications
Improving rounded mixtures (Canale and Prunster,2016)
●
●
●
●
1020
3040
50
σ
E(K
n | −
)
0.00 0.25 0.50 0.75
●●
●
●
● ● ●
●
● ● ● ●
●
●
●
●
1020
3040
50
σ
E(K
n | −
)
0.00 0.25 0.50 0.75
●●
●
●● ● ●●
●●
●
●
Figure: Posterior mean number of distinct clusters E [Kn|−] for the Okaloosadarters dataset: Poisson mixture and RG mixture for differentσ = 0, 0.25, 0.5, 0.75 and prior expected number of components E (Kn).
Toxicology Ecology Marketing Human fertility More applications
Marketing application
• we focus on data from 2, 050 SIM cards from customers having aprepayed contract in a single period;
• yi = (yi1, . . . , yi5) with the number of outgoing calls to fixednumbers (yi1), to mobile numbers of competing operators (yi2)and to mobile numbers of the same operator (yi3), the totalnumber of MMS (yi4) and SMS (yi5) sent;
Toxicology Ecology Marketing Human fertility More applications
• the RK method can be adapted in the multivariate context
• it is able to characterize the entire joint distribution;
• the use of underlying Gaussian mixtures allows the joint modelingof variables on different measurement scales (continuous,categorical, binary and counts). See also Canale and Dunson(2015)
• we can do inference on different objects: the whole multivariatedensity, the marginals, the conditionals.
• there are not so many alternatives to model a multivariate countdistribution!
Toxicology Ecology Marketing Human fertility More applications
Each concepts of before can be generalized into its multivariatecounterpart.
Pr(y = J) =
∫RKp(J; Θ)dP(Θ),
P ∼ DP(αP0)
with J ∈ N p and
RK (J; Θ) =
∫AJ
K (y∗; Θ)dy∗
where AJ = {y∗ : a1,J1 ≤ y∗1 < a1,J1+1, . . . , ap,Jp ≤ y∗p < ap,Jp+1}defines a disjoint partition of the sample space.
Toxicology Ecology Marketing Human fertility More applications
Marketing application
• we focused on the forecast of yi1, using data on yi2, . . . , yi5
• we split the dataset in a training and test subset;
• the approach is compared with prediction under a generalizedadditive model (GAM) with spline smoothing function;
• Smaller out-of-sample MAD (8.08 vs 8.76)
• side prediction automatically accomodate - e.g. pr(y1 = 0) orpr(y1 > T )
Toxicology Ecology Marketing Human fertility More applications
Human reproductive functioning
• we focus now on female reproductive functioning
• data refer to the basal body temperature (bbt), across themenstrual cycle.
• bbt curves follow a characteristic trajectory: during the follicularphase of the cycle leading up to ovulation, the bbt values tend tobe low, while after ovulation bbt rises progressively before droppingprior to the next cycle.
Toxicology Ecology Marketing Human fertility More applications
bbt curves model
• we model the data assuming
fij(t) = ηij(t) + εijt ,
where fij(t) is the cycle j of woman i at day t and η is theunderling bbt curve. The curve is observed with random noise εijt .
Toxicology Ecology Marketing Human fertility More applications
bbt curves mixture model
• we use the mixture model
p(ηij) = P, P =∞∑h=1
πhηh,
with a stick-breaking prior on the weights π and a suitable basemeasures ηh ∼ P0 (note that the atoms, here are curves)
• but the regular shape of a healthy woman is a well known fact apriori
• we are Bayesians, we can include this prior information!
Toxicology Ecology Marketing Human fertility More applications
bbt curves mixture model
• we use the mixture model
p(ηij) = P, P =∞∑h=1
πhηh,
with a stick-breaking prior on the weights π and a suitable basemeasures ηh ∼ P0 (note that the atoms, here are curves)
• but the regular shape of a healthy woman is a well known fact apriori
• we are Bayesians, we can include this prior information!
Toxicology Ecology Marketing Human fertility More applications
bbt curves mixture model
• we use the mixture model
p(ηij) = P, P =∞∑h=1
πhηh,
with a stick-breaking prior on the weights π and a suitable basemeasures ηh ∼ P0 (note that the atoms, here are curves)
• but the regular shape of a healthy woman is a well known fact apriori
• we are Bayesians, we can include this prior information!
Toxicology Ecology Marketing Human fertility More applications
Atomic base measure
• it is sufficient to assume that
P = wδη0 + (1− w)∞∑h=1
πhηh,
with η0 representing the S-shape trajectory known a priori.
• there are technical challenges in assuming an atomic base measurethat we are trying to solve (Canale, Nipoti, Lijoi and Pruenster,20??)
Toxicology Ecology Marketing Human fertility More applications
Atomic base measure
• it is sufficient to assume that
P = wδη0 + (1− w)∞∑h=1
πhηh,
with η0 representing the S-shape trajectory known a priori.
• there are technical challenges in assuming an atomic base measurethat we are trying to solve (Canale, Nipoti, Lijoi and Pruenster,20??)
Toxicology Ecology Marketing Human fertility More applications
Image reconstruction
(Wang, Canale, and Dunson 2016)
Toxicology Ecology Marketing Human fertility More applications
Brain-network data analysis
(Durante, Canale, and Dunson 201?)
Toxicology Ecology Marketing Human fertility More applications
Demand-supply model
(Canale and Ruggiero 2016)
Toxicology Ecology Marketing Human fertility More applications
To conclude
• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak
• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in
everyone else’s backyard.” (John Tukey)
Toxicology Ecology Marketing Human fertility More applications
To conclude
• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak
• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in
everyone else’s backyard.” (John Tukey)
Toxicology Ecology Marketing Human fertility More applications
To conclude
• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak
• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in
everyone else’s backyard.” (John Tukey)
Toxicology Ecology Marketing Human fertility More applications
To conclude
• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak
• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in
everyone else’s backyard.” (John Tukey)
Toxicology Ecology Marketing Human fertility More applications
To conclude
• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak
• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in
everyone else’s backyard.” (John Tukey)