Stan: A platform for Bayesian inference

1/35

Stan: A platform for Bayesian inference

Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee,Ben Goodrich, Michael Betancourt, Marcus Brubaker,

Jiqiang Guo, Peter Li, and Allen Riddell

Department of Statistics, Columbia University, New York(and other places)

5 Mar 2014

Gelman Carpenter Hoffman Lee Goodrich Betancourt . . . Stan: A platform for Bayesian inference

2/35


3/35


4/35


5/35

Stan code for linear regression

data {int N;int K;vector[N] y;matrix[N,K] X;

}parameters {

vector[K] b;real<lower=0> sigma;

}model {

y ~ normal(X*b, sigma);}


6/35

Stan code for an item response model: 1

data {int<lower=1> I; // # questionsint<lower=1> J; // # studentsint<lower=1> K; // # schoolsint<lower=1> N; // # observationsint<lower=1,upper=I> ii[N]; // question for nint<lower=1,upper=J> jj[N]; // student for nint<lower=1,upper=J> kk[N]; // school for nint<lower=0,upper=1> y[N]; // correctness for n

}parameters {vector[J] alpha; // ability for student jvector[I] beta; // difficulty for item ivector[K] gamma; // ability for school kvector<lower=0>[I] delta; // discrimination for item ireal<lower=0,upper=20> sigma_gamma; // school ability scale

}


7/35

Stan code for an item response model: 2

model {alpha ~ normal(0,1); // priorsbeta ~ normal(0,10);gamma ~ normal(0,sigma_gamma);delta ~ lognormal(-0.5,1);for (n in 1:N) // likelihood

y[n] ~ bernoulli_logit(delta[ii[n]]* ( alpha[jj[n]] + gamma[kk[n]] - beta[ii[n]] ) );

}


8/35

Stan overview

I Fit open-ended Bayesian modelsI Specify log posterior density in C++I Code a distribution once, then use it everywhereI Hamiltonian No-U-Turn samplerI AutodiffI Runs from R or Python; postprocessing


9/35

People

I Core:I Andrew Gelman (stats/poli-sci): R, proto-user . . .I Bob Carpenter (comp sci/linguistics): C++, language

design. . .I Matt Hoffman (comp sci/acoustics): NUTS, optimization,

C++, . . .I Daniel Lee (comp sci/stats): C++, make, integration, . . .I Ben Goodrich (stats/poli-sci): linear algebra, R, C++, . . .I Michael Betancourt (physics): geometric samplers, C++, . . .I Marcus Brubaker (comp sci): linear algebra, optimization,

C++, . . .I Jiqiang Guo (stats): R, C++, . . .I Peter Li (stats/math): C++, . . .I Allen Riddell (comp sci): Python, C++, . . .

I Research collaboratorsI Users


10/35

Funding

I U.S. National Science FoundationI NovartisI Columbia UniversityI U.S. Department of Energy


11/35

Roles of Stan

I Bayesian inference for unsophisticated users (alternative toBUGS)

I Bayesian inference for sophisticated users (alternative toprogramming it yourself)

I Fast and scalable gradient computationI Environment for developing new algorithms


12/35

Example from toxicology


13/35

Sparse data


14/35

Validation using predictive simulations


15/35

Public opinion: Health care reform


16/35

Public opinion: School vouchers


17/35

Public opinion: Death penalty

I Hierarchical time series modelI Open-ended


18/35

Tree rings


19/35

Speed dating

I Each person meets 10–20 “dates”I Rate each date on Attractiveness, Sincerity, Intelligence,

Ambition, Fun to be with, Shared interestsI Outcomes

I Do you want to see this person again? (Yes/No)I How much do you like this person (1–10)

I How important are each of the 6 attributes?I Logistic or linear regressionI Hierarchical model: coefficients vary by person


20/35

Steps of Bayesian data analysis

I Model buildingI InferenceI Model checkingI Model understanding and improvement


21/35

Background on Bayesian computation

I Point estimates and standard errorsI Hierarchical modelsI Posterior simulationI Markov chain Monte Carlo (Gibbs sampler and Metropolis

algorithm)I Hamiltonian Monte Carlo


22/35

Solving problems

I Problem: Gibbs too slow, Metropolis too problem-specificI Solution: Hamiltonian Monte Carlo

I Problem: Interpreters too slow, won’t scaleI Solution: Compilation

I Problem: Need gradients of log posterior for HMCI Solution: Reverse-mode algorithmic differentation

I Problem: Existing algo-diff slow, limited, unextensibleI Solution: Our own algo-diff

I Problem: Algo-diff requires fully templated functionsI Solution: Our own density library, Eigen linear algebra


23/35

Radford Neal (2011) on Hamiltonian Monte Carlo

“One practical impediment to the use of Hamiltonian Monte Carlois the need to select suitable values for the leapfrog stepsize, ε, andthe number of leapfrog steps L . . . Tuning HMC will usually requirepreliminary runs with trial values for ε and L . . . Unfortunately,preliminary runs can be misleading . . . ”


24/35

The No U-Turn Sampler

I Created by Matt HoffmanI Run the HMC steps until they start to turn around

(bend with an angle > 180◦)I Computationally efficientI Requires no tuning of #stepsI Complications to preserve detailed balance


25/35

NUTS Example TrajectoryHoffman and Gelman

−0.1 0 0.1 0.2 0.3 0.4 0.5−0.1

0

0.1

0.2

0.3

0.4

Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipseis a contour of the target distribution, the black open circles are the positions θtraced out by the leapfrog integrator and associated with elements of the set ofvisited states B, the black solid circle is the starting position, the red solid circlesare positions associated with states that must be excluded from the set C ofpossible next samples because their joint probability is below the slice variable u,and the positions with a red “x” through them correspond to states that must beexcluded from C to satisfy detailed balance. The blue arrow is the vector from thepositions associated with the leftmost to the rightmost leaf nodes in the rightmostheight-3 subtree, and the magenta arrow is the (normalized) momentum vectorat the final state in the trajectory. The doubling process stops here, since theblue and magenta arrows make an angle of more than 90 degrees. The crossed-out nodes with a red “x” are in the right half-tree, and must be ignored whenchoosing the next sample.

being more complicated, the analogous algorithm that eliminates the slice variable seemsempirically to be slightly less efficient than the algorithm presented in this paper.

6

I Blue ellipse is contour of target distributionI Initial position at black solid circleI Arrows indicate a U-turn in momentum


26/35

NUTS vs. Gibbs and MetropolisThe No-U-Turn Sampler

Figure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots

compare 1,000 independent draws from a highly correlated 250-dimensional distribu-

tion (right) with 1,000,000 samples (thinned to 1,000 samples for display) generated by

random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display)

generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS

(second from right). Only the first two dimensions are shown here.

4.4 Comparing the Efficiency of HMC and NUTS

Figure 6 compares the efficiency of HMC (with various simulation lengths λ ≈ �L) andNUTS (which chooses simulation lengths automatically). The x-axis in each plot is thetarget δ used by the dual averaging algorithm from section 3.2 to automatically tune the stepsize �. The y-axis is the effective sample size (ESS) generated by each sampler, normalized bythe number of gradient evaluations used in generating the samples. HMC’s best performanceseems to occur around δ = 0.65, suggesting that this is indeed a reasonable default valuefor a variety of problems. NUTS’s best performance seems to occur around δ = 0.6, butdoes not seem to depend strongly on δ within the range δ ∈ [0.45, 0.65]. δ = 0.6 thereforeseems like a reasonable default value for NUTS.

On the two logistic regression problems NUTS is able to produce effectively indepen-dent samples about as efficiently as HMC can. On the multivariate normal and stochasticvolatility problems, NUTS with δ = 0.6 outperforms HMC’s best ESS by about a factor ofthree.

As expected, HMC’s performance degrades if an inappropriate simulation length is cho-sen. Across the four target distributions we tested, the best simulation lengths λ for HMCvaried by about a factor of 100, with the longest optimal λ being 17.62 (for the multivari-ate normal) and the shortest optimal λ being 0.17 (for the simple logistic regression). Inpractice, finding a good simulation length for HMC will usually require some number ofpreliminary runs. The results in Figure 6 suggest that NUTS can generate samples at leastas efficiently as HMC, even discounting the cost of any preliminary runs needed to tuneHMC’s simulation length.

25

I Two dimensions of highly correlated 250-dim distributionI 1M samples from Metropolis, 1M from Gibbs (thin to 1K)I 1K samples from NUTS, 1K independent draws


27/35

NUTS vs. Basic HMC

I 250-D normal and logistic regression modelsI Vertical axis shows effective #sims (big is good)I (Left) NUTS; (Right) HMC with increasing t = εL


28/35

NUTS vs. Basic HMC II

I Hierarchical logistic regression and stochastic volatilityI Simulation time is step size ε times #steps LI NUTS can beat optimally tuned HMC


29/35

Solving more problems in Stan

I Problem: Need ease of use of BUGSI Solution: Compile directed graphical model language

I Problem: Need to tune parameters for HMCI Solution: Auto tuning, adaptation

I Problem: Efficient up-to-proportion density calcsI Solution: Density template metaprogramming

I Problem: Limited error checking, recoveryI Solution: Static model typing, informative exceptions

I Problem: Poor boundary behaviorI Solution: Calculate limits (e.g. limx→0 x log x)

I Problem: Restrictive licensing (e.g., closed, GPL, etc.)I Solution: Open-source, BSD license


30/35

Example: the “Kumaraswamy distribution”

p(θ|a, b) = a b xθ−1(1− θa)(b − 1) for a, b > 0 and θ ∈ (0, 1)

model {// Put priors on a and b here if you want

// Put in the rest of your model

// Kumaraswamy log-likelihoodincrement_log_prob(N*(log(a)+log(b))+(a-1)*sum_log_theta);for (n in 1:N)

increment_log_prob((b-1)*log1m(pow(theta[n],a)));}


31/35

Check that it worked

R code:

N <- 1000a <- 3b <- 2theta <- rbeta(N,1,b)^(1/a)Kumaraswamy <- stan (file="Kumaraswamy.stan", data=c("N","theta"))print (Kumaraswamy)

Result:

Inference for Stan model: Kumaraswamy.4 chains: each with iter=2000; warmup=1000; thin=1; 2000 saved.

mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhata 2.9 0.1 2.7 2.9 2.9 3.0 3.1 766 1b 1.9 0.1 1.8 1.9 1.9 2.0 2.1 787 1lp__ 282.4 1.0 279.7 282.1 282.7 283.1 283.4 683 1


32/35

Big Data, Big Model, Scalable Computing

10-3

10-2

10-1

100

102

103

104

105

106

Sec

onds

/ S

amp

le

Total Ratings

10 Items

100 Items

1000 Items


33/35

Thinking about scalability

I Hierarchical item response model:

Stan JAGS# items # raters # groups # data time memory time memory

20 2,000 100 40,000 :02m 16MB :03m 220MB40 8,000 200 320,000 :16m 92MB :40m 1400MB80 32,000 400 2,560,000 4h:10m 580MB :??m ?MB

I Also, Stan generated 4x effective sample size per iteration


34/35

Successes and struggles

I Simple linear and logistic regressionI Simple, extendable, full Bayes

I Hierarchical item response modelI Scalable, fast

I Discrete parametersI Mixture formulation

I Cormack-Jolly-Seber mark-recapture modelI 2 minutes for Stan vs. overnight for Bugs

I Diff eq models in pharmacologyI Stability, priors, multilevel, full Bayes

I Age-period-cohort model of votingI Modular model building

I 5000 downloads, 800 people on users listI Existing programs don’t run, don’t converge, or are slow


35/35

Future work

I ProgrammingI Faster gradients and higher-order derivativesI Functions

I Statistical algorithmsI Riemannian Hamiltonian Monte CarloI (Penalized) mleI (Penalized) marginal mleI Black-box variational BayesI Data partitioning and expectation propagation


Stan: A platform for Bayesian inference

Documents

Transcript of Stan: A platform for Bayesian inference