Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf ·...

65
Stan Probabilistic Programming Language Core Development Team: Andrew Gelman, Bob Carpenter, Matt Hoffman Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, Marco Inacio, Jeff Arnold, Mitzi Morris Microsoft Research NYC 2014 mc-stan.org

Transcript of Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf ·...

Page 1: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

StanProbabilistic Programming Language

Core Development Team:

Andrew Gelman, Bob Carpenter, Matt Hoffman

Daniel Lee, Ben Goodrich, Michael Betancourt,

Marcus Brubaker, Jiqiang Guo, Peter Li,

Allen Riddell, Marco Inacio, Jeff Arnold,

Mitzi Morris

Microsoft Research NYC 2014 mc-stan.org

1

Page 2: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Current:

Stan 2.4

2

Page 3: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

What is Stan?

• Stan is an imperative probabilistic programming language

– cf., BUGS: declarative; Church: functional; Figaro: object-oriented

• Stan program

– declares variables

– codes log posterior (or penalized likelihood)

• Stan inference

– Sampling for full Bayesian inference

– Optimization + curvature for maximum likelihood estimates

3

Page 4: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Example: Bernoulli

data {int<lower=0> N;int<lower=0,upper=1> y[N];

}parameters {

real<lower=0,upper=1> theta;}model {

y ~ bernoulli(theta);}

notes: theta uniform on [0,1], y vectorized

4

Page 5: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

RStan Execution> N <- 5; y <- c(0,1,1,0,0);> fit <- stan("bernoulli.stan", data = c("N", "y"));> print(fit, digits=2)

Inference for Stan model: bernoulli.4 chains, each with iter=2000; warmup=1000; thin=1;

mean se_mean sd 2.5% 50% 97.5% n_eff Rhattheta 0.43 0.01 0.18 0.11 0.42 0.78 1229 1lp__ -5.33 0.02 0.80 -7.46 -5.04 -4.78 1201 1

> hist( extract(fit)$theta )

Histogram of extract(fit)$theta

extract(fit)$theta

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

5

Page 6: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Stan is Open

• Stan is open source

– Stan Core C++, CmdStan interface: new BSD

– PyStan, RStan interfaces: GPLv3

– Dependencies: Eigen, Boost, (googletest)

• GitHub hosted publicly (stan-dev)

– issue tracking (bug reports, feature requests)

– pull requests, code review, continuous integration hooks

• Google Groups public mailing lists (stan-dev, stan-users)

6

Page 7: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Platforms and Interfaces

• PlatformsLinux, Mac OS X, Windows

• C++ APIportable, standards compliant (C++03 now, moving to C++11)

• Interfaces– CmdStan: Command-line or shell interface (direct executable)

– RStan: R interface (Rcpp in memory)

– PyStan: Python interface (Cython in memory)

– MStan∗: MATLAB interface (lightweight external process)

– JuliaStan∗: Julia interface (lightweight external process)

∗ User contributed

7

Page 8: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Who’s Using Stan?

• 830+ user mailing list; 100+ citations

– physical, biomedical, and social sciences

– plus engineering, education, finance, and marketing

• Application areas

clinical drug trials, general computational statistics, entomology, opthal-

mology, neurology, sociology and population dynamics, genomics, agri-

culture, psycholinguistics, molecular biology, population dynamics, ma-

terials engineering, botany, astrophysics, oceanography, election predic-

tion, fisheries, cancer biology, public health and epidemiology, popu-

lation ecology, collaborative filtering, climatology, educational testing,

natural language processing, econometrics

8

Page 9: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Documentation and Examples

• Documentation

– 400+ page user’s guide and reference

– For each interface: installation, getting started, reference

• Examples

– BUGS and JAGS examples (all 3 volumes),

– Gelman and Hill, Data Analysis Using Regression andMultilevel/Hierarchical Models

– Wagenmakers and Lee, Bayesian Cognitive Modeling

– two books in progress

– user-contributed examples in group

9

Page 10: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Scaling and Evaluation• Type of scaling

– more data (e.g., observations)

– more parameters (e.g., regression coefficients)

– more complex models (e.g., multilevel priors)

• Metrics

– time to convergence

– time per effective sample after convergence

– memory usage

– 0 to ∞ orders of magnitude improvement on black-boxstate-of-the-art (e.g., BUGS, JAGS)

– more improvement with more complex models

10

Page 11: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Basic Program Blocks

• data (once)

– content: declare data types, sizes, and constraints

– execute: read from data source, validate constraints

• parameters (every log prob eval)

– content: declare parameter types, sizes, and constraints

– execute: transform to constrained, optional Jacobian

• model (every log prob eval)

– content: definine posterior density (up to constant)

– execute: execute statements

11

Page 12: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Derived Variable Blocks

• transformed data (once after data)

– content: declare and define transformed data variables

– execute: execute definition statements, validate constraints

• transformed parameters (every log prob eval)

– content: declare and define transformed parameter vars

– execute: execute definition statements, validate constraints

• generated quantities (once per draw, double type)

– content: declare and define generated quantity variables;includes pseudo-random number generators(for posterior predictions, event probabilities, decision making)

– execute: execute definition statements, validate constraints

12

Page 13: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

User-Defined Functions• functions (compiled with model)

– content: declare and define general (recursive) functions(use them elsewhere in program)

– execute: compile with model

• Example

functions {

real relative_difference(real u, real v) {return 2 * fabs(u - v) / (fabs(u) + fabs(v));

}

}

13

Page 14: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Variable and Expression TypesVariables and expressions are strongly, statically typed.

• Primitive: int, real

• Matrix: matrix[M,N], vector[M], row_vector[N]

• Bounded: primitive or matrix, with<lower=L>, <upper=U>, <lower=L,upper=U>

• Constrained Vectors: simplex[K], ordered[N],positive_ordered[N], unit_length[N]

• Constrained Matrices: cov_matrix[K], corr_matrix[K],cholesky_factor_cov[M,N], cholesky_factor_corr[K]

• Arrays: of any type (and dimensionality)

14

Page 15: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Arithmetic and Matrix OperatorsOp. Prec. Assoc. Placement Description

+ 5 left binary infix addition- 5 left binary infix subtraction

* 4 left binary infix multiplication/ 4 left binary infix (right) division

\ 3 left binary infix left division

.* 2 left binary infix elementwise multiplication

./ 2 left binary infix elementwise division

! 1 n/a unary prefix logical negation- 1 n/a unary prefix negation+ 1 n/a unary prefix promotion (no-op in Stan)

^ 2 right binary infix exponentiation

’ 0 n/a unary postfix transposition

() 0 n/a prefix, wrap function application[] 0 left prefix, wrap array, matrix indexing

15

Page 16: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Logical Operators

Op. Prec. Assoc. Placement Description

|| 9 left binary infix logical or

&& 8 left binary infix logical and

== 7 left binary infix equality!= 7 left binary infix inequality

< 6 left binary infix less than<= 6 left binary infix less than or equal> 6 left binary infix greater than>= 6 left binary infix greater than or equal

16

Page 17: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Built-in Math Functions

• All built-in C++ functions and operatorsC math, TR1, C++11, including all trig, pow, and special log1m, erf, erfc,

fma, atan2, etc.

• Extensive library of statistical functionse.g., softmax, log gamma and digamma functions, beta functions, Bessel

functions of first and second kind, etc.

• Efficient, arithmetically stable compound functionse.g., multiply log, log sum of exponentials, log inverse logit

17

Page 18: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Built-in Matrix Functions• Type inference, e.g., row vector times vector is scalar

• Basic and elementwise arithmetic: all ops

• Solvers: matrix division, (log) determinant, inverse

• Decompositions: QR, Eigenvalues and Eigenvectors,Cholesky factorization, singular value decomposition

• Ordering, Slicing, Broadcasting: sort, rank, block, rep

• Reductions: sum, product, norms

• Compound Operations: quadratic forms, variance scaling

• Specializations: triangular, positive-definite, etc.

18

Page 19: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Distribution Library

• Each distribution has

– log density or mass function

– cumulative distribution functions, plus complementary ver-sions, plus log scale

– pseudo Random number generators

• Alternative parameterizations(e.g., Cholesky-based multi-normal, log-scale Poisson, logit-scale Bernoulli)

• New multivariate correlation matrix density: LKJdegrees of freedom controls shrinkage to (expansion from) unit matrix;

independently scale by parameter

19

Page 20: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Statements• Sampling: y ~ normal(mu,sigma) (increments log probability)

• Log probability: increment_log_prob(lp);

• Assignment: y_hat <- x * beta;

• For loop: for (n in 1:N) ...

• While loop: while (cond) ...

• Conditional: if (cond) ...; else if (cond) ...; else ...;

• Block: { ... } (allows local variables)

• Print: print("theta=",theta);

• Throw: raise_exception("x must be positive; found x=", x);

20

Page 21: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Full Bayes with MCMC

• Adaptive Hamiltonian Monte Carlo (HMC)

• Adaptation during warmup

– step size adapted to target Metropolis acceptance rate

– mass matrix estimated with regularizationsample covariance of second half of warmup iterations

(assumes constant posterior curvature)

• Adaptation during sampling

– number of stepsaka no-U-turn sampler (NUTS)

• Initialization user-specified or random unconstrained

21

Page 22: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Posterior Inference

• Generated quantities block for inference(predictions, decisions, and event probabilities)

• Extractors for samples in RStan and PyStan

• Coda-like posterior summary

– posterior mean w. standard error, standard deviation, quan-tiles

– split-R multi-chain convergence diagnostic (Gelman and ...)

– multi-chain effective sample size estimation (FFT algorithm)

• Model comparison with WAIC or cross-validation(internal log likelihoods; external cross-sample statistics)

22

Page 23: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Penalized MLE

• Posterior mode finding via L-BFGS optimization(uses model gradient, efficiently approximates Hessian)

• Disables Jacobians for parameter inverse transforms

• Models, data, initialization as in MCMC

• Curvature (Hessian) used to estimate posterior covariance(or standard errors on unconstrained scale)

23

Page 24: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Stan as a Research Tool

• Stan can be used to explore algorithms

• Models transformed to unconstrained support on Rn

• Once a model is compiled, have

– log probability, gradient, and Hessian

– data I/O and parameter initialization

– model provides variable names and dimensionalities

– transforms to and from constrained representation(with or without Jacobian)

• Very Near Future:

– second- and higher-order derivatives via auto-diff

24

Page 25: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Future:

Stan 3 & Beyond

25

Page 26: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Stan 3: The Refactoring

• Third time’s the charm in software (re)design

• Goal of removing duplicated code in interfaces

• Minimizing dependencies between interfaces and core

• More modular and flexible

– user-facing C++ API

– interface API

• Design and some coding underway

26

Page 27: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Differential Equation Solver

• Auto-diff solutions w.r.t. parameters & initial states

• Integrate coupled system for solution with partials

• Auto-diff coupled Jacobian for stiff systems

• C++ prototype integrated for large PK/PD models

– Project with Novartis: longitudinal clinical trial w. multiple drugs,dosings, placebo control, hierarchical model of patient-level ef-fects, meta-analysis

• Generalized code complete, under testing

• With: Frederic Bois, Amy Racine-Poon, Sebastian Weber

27

Page 28: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Higher-Order Auto-diff

• Finish higher-order auto-diff for probability functions

• May punt some cumulative distribution functions(Black art iterative algorithms required)

• Code complete; under testing

28

Page 29: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Riemannian Manifold HMC

• Supports posteriors with position-dependent curvaturewith local mass matrix estimation(e.g., hierarchical models)

• NUTS generalized to RHMC

• SoftAbs metric– Eigendecompose Hessian

– positive definite with positive eigenvalues

– condition by narrowing eigenvalue range

• Code complete; awaiting full higher-order auto-diff

• (Betancourt arXiv papers)

29

Page 30: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Thermodynamic Sampler

• Enhances posterior mode-finding with multiple modes

• Physically motivated alternative to “simulated” annealingand tempering (not really simulated!)

• Supplies external heat bath

• Operates through contact manifold

• System relaxes more naturally between energy levels

• Prototype complete

• (Betancourt arXiv paper)

30

Page 31: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Ensemble MCMC

• Markov chain is ensemble of parameter vectors

• Interpolate/extrapolate in ensemble for Metropolis pro-posal

– Walk & stretch moves (Goodman & Weare)

– Differential evolution (ter Braak)

• Highly parallelizable

• Derivative free

• Code complete; awaiting integration

31

Page 32: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Marginal Maximum Likelihood (MML)

• Enable MLE-like estimate with infinite likelihoods(e.g., hierarchical models, mixture models / clustering)

• Marginalize out lower-level parameters to estimate hierar-chical

• Estimate lower-level parameters based on hierarchical es-timates

• Gradient-based nested optimization algorithm

• Errors / posterior variance estimated as in MLE

• Design complete; awaiting parameter tagging

32

Page 33: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

MLE & MML Errors

• Simple posterior approximation

• Standard errors for estimators

• Approximate posterior covariance with multivariate nor-mal

• Estimate with Laplace approximation (based on curvature)

• Sample and transform to constrained scale

• Design complete

33

Page 34: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Variational Bayes (VB)

• Black box to scale arbitrary models

• Batch or stochastic (data streaming)

• Compute expectations via approximations (e.g., Laplace)

• Point estimate parametric approximations to posterior

• Optimize parameters to minimize KL divergence

• Prototype stage

• With: Dave Blei, Alp Kucukelbir, and Rajesh Ranganath

34

Page 35: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Expectation Propagation (EP)

• Black box to scale arbitrary models

• Data-parallelization

– cavity distributions guide shard combination

• Point estimate parametric approximations to posterior

• Optimize parameters to minimize KL divergence

• Prototype stage

• With: Aki Vehtari, Nicolas Chopin, Christian Robert, John Cunningham

35

Page 36: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

The End

36

Page 37: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Stan’s Namesake

• Stanislaw Ulam (1909–1984)

• Co-inventor of Monte Carlo method (and hydrogen bomb)

• Ulam holding the Fermiac, Enrico Fermi’s physical Monte Carlo

simulator for random neutron diffusion

37

Page 38: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Appendix I

Under the Hood

38

Page 39: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Euclidean Hamiltonian• Phase space: q position (parameters); p momentum

• Posterior density: π(q)

• Mass matrix: M

• Potential energy: V(q) = − logπ(q)

• Kinetic energy: T(p) = 12p>M−1p

• Hamiltonian: H(p, q) = V(q)+ T(p)

• Diff eqs:

dqdt

= +∂H∂p

dpdt

= −∂H∂q

39

Page 40: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Leapfrog Integrator Steps• Solves Hamilton’s equations by simulating dynamics

(symplectic [volume preserving]; ε3 error per step, ε2 total error)

• Given: step size ε, mass matrix M, parameters q

• Initialize kinetic energy, p ∼ Normal(0, I)

• Repeat for L leapfrog steps:

p ← p − ε2∂V(q)∂q

[half step in momentum]

q ← q + εM−1 p [full step in position]

p ← p − ε2∂V(q)∂q

[half step in momentum]

40

Page 41: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Standard HMC

• Initialize parameters diffuselyStan’s default: q ∼ Uniform(−2,2) on unconstrained scale

• For each draw

– leapfrog integrator generates proposal

– Metropolis accept step ensures detailed balance

• Balancing act: small ε has low error, requires many steps

• Results highly sensitive to step size ε and mass matrix M

41

Page 42: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Tuning HMC During Warmup

• Chicken-and-egg problem

– convergence to high mass volume requires adaptation

– adaptation requires convergence

• During warmup, tune

– step size: line search to achieve target acceptance rate

– mass matrix: estimate with second half of warmup

• Use exponentially growing adaptation block sizes

42

Page 43: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Position-Independent Curvature

• Euclidean HMC uses global mass matrix M

• Works for densities with position-independent curvature

• Counterexample: hierarchical model

– hierarchical variance parameter controls lower-level scale

– mitigate by reducing target acceptance rate

• Riemannian-manifold HMC (coming soon)

– automatically adapts to varying curvature

– no need to estimate mass matrix

– need to regularize Hessian-based curvature estimate(Betancourt arXiv; SoftAbs metric)

43

Page 44: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Adapting HMC During Sampling

• No-U-turn sampler (NUTS)

• Subtle algorithm to maintain detailed balance

• Move randomly forward or backward in time

• Double number of leapfrog steps each move (binary tree)

• Stop when a subtree makes a U-turn(rare: throw away second half if not end to end U-turn)

• Slice sample points along last branch of tree

• Generalized to Riemannian-manifold HMC(Betancourt arXiv paper)

44

Page 45: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Reverse-Mode Auto Diff

• Eval gradient in small multiple of function eval time(independent of dimensionality)

• Templated C++ overload for all functions

• Code partial derivatives for basic operations

• Function evaluation builds up expression tree

• Dynamic program propagates chain rule in reverse pass

• Extensible w. object-oriented custom partial propagation

• Arena-based memory management(customize operator new)

45

Page 46: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Forward-Mode Auto Diff

• Templated C++ overload for all functions

• Code partial derivatives for basic operations

• Function evaluation propagates chain rule forward

• Nest reverse-mode in forward for higher-order

• Jacobians

– Rerun propagation pass in reverse mode

– Rerun forward construction with forward mode

• Faster autodiff rewrite coming in six months to one year

46

Page 47: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Autodiff Functionals

• Fully encapsulates autodiff in C++

• Autodiff operations are functionals (higher-order functions)

– gradients, Jacobians, gradient-vector product

– directional derivative

– Hessian-vector product

– Hessian

– gradient of trace of matrix-Hessian product(for SoftAbs RHMC)

• Functions to differentiate coded as functors (or pointers)(enables dynamic C++ bind or lambda)

47

Page 48: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Variable Transforms

• Code HMC and optimization with Rn support

• Transform constrained parameters to unconstrained

– lower (upper) bound: offset (negated) log transform

– lower and upper bound: scaled, offset logit transform

– simplex: centered, stick-breaking logit transform

– ordered: free first element, log transform offsets

– unit length: spherical coordinates

– covariance matrix: Cholesky factor positive diagonal

– correlation matrix: rows unit length via quadratic stick-breaking

48

Page 49: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Variable Transforms (cont.)

• Inverse transform from unconstrained Rn

• Evaluate log probability in model block on natural scale

• Optionally adjust log probability for change of variables(add log determinant of inverse transform Jacobian)

49

Page 50: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Parsing and Compilation

• Stan code parsed to abstract syntax tree (AST)(Boost Spirit Qi, recursive descent, lazy semantic actions)

• C++ model class code generation from AST(Boost Variant)

• C++ code compilation

• Dynamic linking for RStan, PyStan

50

Page 51: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Coding Probability Functions

• Vectorized to allow scalar or container arguments(containers all same shape; scalars broadcast as necessary)

• Avoid repeated computations, e.g. logσ in

log Normal(y|µ,σ) =∑Nn=1 log Normal(yn|µ,σ)

=∑Nn=1− log

√2π − logσ − yn − µ2σ 2

• recursive expression templates to broadcast and cachescalars, generalize containers (arrays, matrices, vectors)

• traits metaprogram to drop constants (e.g., − log√2π )

and calculate intermediate and return types

51

Page 52: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Models with Discrete Parameters

• e.g., simple mixture models, survival models, HMMs, dis-crete measurement error models, missing data

• Marginalize out discrete parameters

• Efficient sampling due to Rao-Blackwellization

• Inference straightforward with expectations

• Too difficult for many of our users(exploring encapsulation options)

52

Page 53: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Models with Missing Data

• In principle, missing data just additional parameters

• In practice, how to declare?

– observed data as data variables

– missing data as parameters

– combine into single vector(in transformed parameters or local in model)

53

Page 54: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Appendix II

Bayesian Data Analysis

54

Page 55: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Bayesian Data Analysis

• “By Bayesian data analysis, we mean practical methods formaking inferences from data using probability models forquantities we observe and about which we wish to learn.”

• “The essential characteristic of Bayesian methods is theirexplict use of probability for quantifying uncertainty ininferences based on statistical analysis.”

Gelman et al., Bayesian Data Analysis, 3rd edition, 2013

55

Page 56: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Bayesian Mechanics

1. Set up full probability model

• for all observable & unobservable quantities

• consistent w. problem knowledge & data collection

2. Condition on observed data

• caclulate posterior probability of unobserved quanti-ties conditional on observed quantities

3. Evaluate

• model fit

• implications of posterior

Ibid.

56

Page 57: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Basic Quantities

• Basic Quantities

– y: observed data

– y: unknown, potentially observable quantities

– θ: parameters (and other unobserved quantities)

– x: constants, predictors for conditional models

• Random models for things that could’ve been otherwise

– Everyone: Model data y as random

– Bayesians: Model parameters θ as random

57

Page 58: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Distribution Naming Conventions

• Joint: p(y, θ)

• Sampling / Likelihood: p(y|θ)

• Prior: p(θ)

• Posterior: p(θ|y)

• Data Marginal: p(y)

• Posterior Predictive: p(y|y)

y modeled data, θ parameters, y predictions,

implicit: x, x unmodeled data (for y, y), size constants

58

Page 59: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Bayes’s Rule for the Posterior

• Suppose the data y is fixed (i.e., observed). Then

p(θ|y) = p(y, θ)p(y)

= p(y|θ)p(θ)p(y)

= p(y|θ)p(θ)∫p(y, θ) dθ

= p(y|θ)p(θ)∫p(y|θ)p(θ) dθ

∝ p(y|θ)p(θ) = p(y, θ)

• Posterior proportional to likelihood times prior (i.e., joint)

59

Page 60: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Monte Carlo Methods

• For integrals that are impossible to solve analytically

• But for which sampling and evaluation is tractable

• Compute plug-in estimates of statistics based on randomlygenerated variates (e.g., means, variances, quantiles/intervals,comparisons)

• Accuracy with M (independent) samples proportional to

1√M

e.g., 100 times more samples per decimal place!

(Metropolis and Ulam 1949)

60

Page 61: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Monte Carlo Example• Posterior expectation of θ:

E[θ|y] =∫θ p(θ|y) dθ.

• Bayesian estimate minimizing expected square error:

θ = arg minθ′E[(θ − θ′)2|y] = E[θ|y]

• Generate samples θ(1), θ(2), . . . , θ(M) drawn from p(θ|y)

• Monte Carlo Estimator plugs in average for expectation:

E[θ|y] ≈ 1M

M∑m=1

θ(m)

61

Page 62: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Monte Carlo Example II

• Bayesian alternative to frequentist hypothesis testing

• Use probability to summarize results

• Bayesian comparison: probability θ1 > θ2 given data y?

Pr[θ1 > θ2|y] =∫ ∫

I(θ1 > θ2) p(θ1|y) p(θ2|y) dθ1 dθ2

≈ 1M

M∑m=1

I(θ(m)1 > θ(m)2 )

• (Bayesian hierarchical model “adjusts” for multiple com-parisons)

62

Page 63: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Markov Chain Monte Carlo

• When sampling independently from p(θ|y) impossible

• θ(m) drawn via a Markov chain p(θ(m)|y, θ(m−1))

• Require MCMC marginal p(θ(m)|y) equal to true posteriormarginal

• Leads to auto-correlation in samples θ(1), . . . , θ(m)

• Effective sample size Neff divides out autocorrelation (mustbe estimated)

• Estimation accuracy proportional to 1/√Neff

63

Page 64: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Gibbs Sampling

• Samples a parameter given data and other parameters

• Requires conditional posterior p(θn|y, θ−n)

• Conditional posterior easy in directed graphical model

• Requires general unidimensional sampler for non-conjugacy

– JAGS uses slice sampler

– BUGS uses adaptive rejection sampler

• Conditional sampling and general unidimensional samplercan both lead to slow convergence and mixing

(Geman and Geman 1984)

64

Page 65: Probabilistic Programming Language - Machine learninghunch.net/~nyoml/stan-current-future.pdf · Probabilistic Programming Language ... row vector times vector is scalar ... – step

Metropolis-Hastings Sampling

• Proposes new point by changing all parameters randomly

• Computes accept probability of new point based on ratioof new to old log probability (and proposal density)

• Only requires evaluation of p(θ|y)

• Requires good proposal mechanism to be effective

• Acceptance requires small changes in log probability

• But small step sizes lead to random walks and slow con-vergence and mixing

(Metropolis et al. 1953; Hastings 1970)

65