MCMC Using Parallel Computation

Andrew BeamST 790 – Advanced Bayesian Inference

02/21/2013

Outline

• Serial vs. Parallel Computation• Chain level parallelization in MCMC

• R code and example• Kernel speed-up and parallelization for larger data sets

• Calling C/Java from R• GPU example

• MCMC in pure java• Conclusions

Talk focus: Grab bag of ways to speed up R methods to use alternative platforms

MCMC Methods

• MCMC methods are great!• They are general.

• Work for every family of distribution.• They are exact (in the limit).

• Unlike approximate methods, e.g. variational methods• Flexible.• They do have some drawbacks…

• Can be painfully slow for complex problems• Complex problems are exactly when we would like to use MCMC

grad student

simulation is running.

Simulating!

Your funding has been revoked.

Dr. Advisor

Serial vs. Parallel Computation• Computers have used roughly the same architecture for

nearly 70 years– Proposed by John von Neumann in 1945

• Programs are stored on disk, loaded into system memory (RAM) by a control unit

• Instructions are executed one at a time by an arithmetic logic unit (ALU).

• Central Processing Unit (CPU) consistsof one ALU, control units, communication lines, and devices for I/O.– ALU is what does instruction execution.

Moore's law • “The number of transistors on integrated circuits doubles

approximately every two years.”– Intel co-founder Gordon E. Moore

• Around 2002, chip manufacturers began to shift from increasing the speed of single processors to providing multiple computational units on each processor.

• Consistent with Moore's law, but changes the way programmers (i.e. us) have to use them.

• CPU still consists of the same components, but now have multiple ALUs capable of executing instructions at the same time.

Multicore architectures• Each central processing unit (CPU) now consists of multiple

ALUs.• Can execute multiple instructions simultaneously.

• Communication between coreshappens with shared memory.

• Programs can create multiple execution “threads” to take advantage of this.

• Speed up requires explicit programming representations.• Running a program designed for a single core processor on a

multicore processor will not make it faster.

Program execution hierarchy• Atomic unit of execution is a thread

– A serial set of instructions– E.g. – a = 2+2; b=3*4; c = (b/a)2

– Threads communicate with each other using a variety of system dependent protocols

• A collection of threads is known as a process• Usually represent a common computational task.

• A collection of processes is known as an application.• Example: Google chrome browser:

• The entire browser is an application.• Each tab is a process controlling multiple threads.• A thread handles different elements of each page (i.e.

music, chat, etc.)

Decomposing the MCMC problem

• Given our computers are capable of doing multiple things at once, we should all the resources available to us.

• If a problem has independent parts, we should try to execute them in parallel. This is often called “embarrassingly” parallel.

• Let’s go through the general Metropolis-Hasting algorithms and look for parts we can speed up using parallel computation.

General MH MCMC Algorithm

1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times

How can we make this faster?

This is Nc independent operations – we can do this concurrently on a multicore machine.

• Without parallel computing, chains must be completed one at a time

Chain 1

Chain 2

Chain 3

Completed Chains:

Speeding up multi-chain simulations

4 available cores

• The first chain finishes and the second one starts…

Chain 1

Chain 2

Chain 3

Completed Chains:

4 available cores

• The second chain is done and the third one starts…

Chain 1

Chain 2

Chain 3

Completed Chains:

4 available cores

• Now all three are done.• If one chain takes T seconds, the

entire simulation takes 3*T seconds

Chain 1

Chain 2

Chain 3

Completed Chains:

4 available cores

• Now consider a parallel version. • Each chain is assigned to its own

Chain 2

Chain 3

Completed Chains:

Chain 1

• The entire simulation takes T + ε, where ε is the thread communication penalty and is usually << T

Chain 2

Chain 3

Completed Chains:

Chain 1

Multiple Chains vs. One Long Chain• It has been claimed that a chain of length N*L is

preferable to N chains of length L, due to better mixing properties of the longer chain.

• Robert and Casella agree (text, pg. 464)

• However, if you have a multicore machine you should NEVER just use one chain, for moderately sized problems because you can get additional samples for essentially no extra time.

• Buy one, get two free

• You can use multiple chins to assess convergence• Gelman and Rubin diagnostic.

• More samples is always better, even if they are spread over a few chains.

Chain-level parallelization in R• For small to moderate sized problems, it’s very easy

to do this in R with minor changes to existing code.• Small to moderate size = 100-1000 observations 10-100

predictors.

• Revolutionary Analytics has developed several packages to parallelize independent for loops.

• Windows – doSNOW• Linux/OSX – doMC

• Uses foreach paradigm to split up the work across several cores and then collect the results.

• Use special %dopar% par operator to indicate parallel tasks.

• This is more of a “hack” then a fully concurrent library.

• This is a good idea if your simulation currently takes a reasonable amount of time and doesn’t use too much memory.

doMC/doSNOW Paradigm

For N parallel tasks:

…Task 1

Task 2

Task N-1 Task N

Create N copies of the entire R workspace

…Run N R sessions in parallel

Collect results and return control to original R session.

Parallel Template (Windows):Load doSNOW package

Register parallel backend

Indicate how the results should be collected

Indicate this should be done in parallel

Assign return value of foreach to a variable

Example: Penalized Regression via MCMC

Prior for β

τ = 1

τ = 0.5

τ = 0.1

Simulation Description

• N = 100, p = 10• Xij ~ N(0,1)• True model: Y = X2 + X6

• Yi = X2 + X6 + ei, where ei ~ N(0,1)• Full penalty -> τ = 1• Sample using RW-MH such that βi = βi-1 + εi, where εi ~ N(0,

(1/2p)2) • Simulate initially for Nc = 1 chain, Ns = 100,000 iterations,

keep 10,000 for inference• Macbook pro, quad-core i7 with hyperthreading, 8GB ram.• Sampler code available in chain-parallel-lasso.R.

Demonstration code in mcmc-lasso.R• N and p are adjustable so you can see how many chains

your system can handle for a given problem size

Simulation Resultsβ1 ≈ 0 β2 ≈ 1 β3 ≈ 0 β4 ≈ 0 β5 ≈ 0

β6 ≈ 1 β7 ≈ 0 β8 ≈ 0 β9 ≈ 0 β10 ≈ 0

Simulation Results• Sampler appears to be working – finds correct variables.• Execution time: 3.02s• Ram used: 30MB• Returns a Nkx p matrix of samples Sampler

ProposeEvaluate

Accept/Reject

Avoid using rbind at all costs

Multiple Chain Penalized Regression• Wrap the single chain sampler into a function• Call it multiple times from within the foreach loop

Register parallel backendCall foreach and %dopar%

Windows quirk

Call single chain sampler

Return single chain results to be collected

Multiple Chain Penalized Regression

Example call using 2 chains:

Multiple Chain Execution Times

Single chain execution time

doSNOW/doMC not just for MCMC

• Worth noting that this paradigm is useful for a lot of other tasks in R

• Any task that has “embarrassingly” parallel portions• Bootstrap• Permutations• Cross-validation• Bagging• etc

• Be aware that every parallel R session will use the same amount of memory as the parent.

• Computation happens 5x faster, but use 5x as much memory

• Application level parallelism

Back the the MH Algorithm

Unfortunately loop is inherently serial and we can’t speed it up through parallelization

Can we speed up the inner loop?

What is the most expensive part of each iteration?

1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θnew))6. Accept θnew with probability ρ(K(θnew), (K(θnew))7. Repeat Ns times8. Repeat Nc times

• Each iteration we have to evaluate the kernel for N samples and p parameters – most expensive.

• As N gets bigger, we can use parallel devices to speed this up.• Communication penalties preclude multicore CPUs from

speeding this.

Kernel Evaluations

• For the penalized regression problem we have 3 main kernel computations:

• Evaluate Xβ – Could be expensive for large N or large p

• Evaluate the likelihood for Xβ, Y – N evaluations of the normal pdf.

• Evaluate the prior for β – cheap for small p, potentially expensive for large p.

• For demonstration, let’s assume we are in a scenario with large N and reasonable p

• Xβ – not expensive• Likelihood – expensive• Prior – very cheap

Kernel Evaluations in Java and C

• If evaluating our kernel is too slow in R, we can drop down to another language from within R and evaluate it there.

• This is actually what dnorm() in R does• R has native support for calling C functions.• rJava package provides interface for calling Java from R.

• For small problems (e.g. N ~ 100) C will be faster.• Reason – there is also a communication penalty since R can’t call Java

natively. R has to transfer data to Java, then start execution, then retrieve it.

• Java has rich external libraries set for scientific computing.• Java is portable – write once and run anywhere• No pointers!• Built in memory management – no malloc.• Unified concurrency library. Can write multithreaded programs that will run

on any platform.• Thread level parallelism saves memory

Example Java code for kernel evaluation function called from R

evaluate likelihood

evaluate priorfunction to evaluate normal PDF

function to evaluate prior

Calling Java from R

• rJava package has R-Java bindings that will allow R to use Java objects.

• Java based kernel for penalized regression available in java_based_sampler.R and java source code in dnorm.java

• Compiled java code is in dnorm.class• To compile source use javac command: javac dnorm.java• Not necessary, but if you modify the source you will have to

recompile

• Performance is slightly worse that C code used by dnorm() in R• Abstracting out kernel evaluation is a good idea if you have

complicated kernels not available in R or you have large values of N or p.

• What if you have so much data Java or C still isn’t enough?

GPU Computing

GPUCPU

• GPU – Graphics Processing Unit• Designed to do floating point math very quickly for image

processing• Cheap – usually a few hundred dollars• Highly parallel – 100s to 1000s of processors per unit• Very little control units – designed for floating point operations

• Good for doing a lot of simultaneous calculations (e.g. + - / * )

• CUDA - Compute Unified Device Architecture• SDK and driver set that gives access to NVIDIA floating point

units• Provides C-level interface and compiler

• Write kernel evaluation in CUDA C

CUDA Programming Paradigm• CUDA is based on notions of host (i.e. the CPU) and the device

(i.e. the GPU)• Create data objects (arrays, matrices, etc) on the host, transfer

them to the device, and then retrieve the results.• Computation is accomplished via “kernels” – small

computational programs.• GPU threads are given a block of data, which they use to

execute the kernel.• Compile C programs using NVIDIA’s nvcc compiler• We can use this to do the expensive normal PDF evaluation.

GPU Simulation Setup• Do N normal PDF evaluations for values ranging from 100 to

100,000,000.• Compare to R’s native dnorm()• GEFORCE GTX 650 – 2GB RAM, 384 CUDA processors• Requires CUDA SDK 3.0, compatible GPU and cuda drivers.• Code available in kernel.cu. • Compile on Linux using nvcc kernel.cu –o kernel

Execution Times

N GPU dnorm100 0.000348 0.00011000 0.000524 0.0004

10000 0.000733 0.0011.00E+005 0.00124 0.0071.00E+006 0.006677 0.0975.00E+006 0.0235 0.3851.00E+007 0.053936 0.7725.00E+007 0.22657 3.751.00E+008 0.542329 7.567

Speed up

With a sample size of 5 million, if you ran a simulation for 100,000 iterations it would take 10.7 hours days using dnorm(), but just over 30 minutes on a GPU.

GPUs and MCMC – Not just about speed

Pierre Jacob , Christian P. Robert, Murray H. Smith

Using parallel computation to improve Independent Metropolis--Hastings based estimation

In this paper, we consider the implications of the fact that parallel raw-power can be exploited by a generic Metropolis--Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent Metropolis--Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are based on the Rao--Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert (1996), Atchade and Perron (2005) and Douc and Robert (2010). We illustrate those improvements both on a toy normal example and on a classical probit regression model, but stress the fact that they are applicable in any case where the independent Metropolis-Hastings is applicable.

http://arxiv.org/abs/1010.1595

Non-R based solutions

• If you want your simulation to just “go faster”, you will probably need to abandon R all together.

• Implement everything in another language like C/C++ or Java and it will (most likely) be orders of magnitude faster

• Compiled code is in general much faster.• You can stop having wacky R memory issues.• Control the level of parallelization (thread vs. process vs.

application).• Interface more easily with devices like GPUs.• Pain now will be rewarded later.

BAJA – Bayesian Analysis in JAva• ST 790 Project• Pure java, general MH-MCMC sampler

• Adaptive sampling options available• Model specification via graphical directed, acyclic graph (DAG)• Parallel chain execution – multithreaded• Stochastic thinning via resolvant• Memory efficient (relatively)• Automatic generation of trace plots, histograms, ACF plots.• Export samples to .csv or R.

BAJA Parallel Performance

L = 1,000,000

Speed comparison with several platforms

Thank you!

MCMC Using Parallel Computation

Documents

Transcript of MCMC Using Parallel Computation

Parallel Java Course - Intel Developer Zone · Parallel Java Course Introduction Goals and Concepts Parallel Model, Computation or Logic ? Computation parallelization Early methods

Intro Parallel Computation

Embarrassingly Parallel Computation for Occlusion Culling

Parallel Bayesian MCMC Imputation for Multiple Distributed Lag ...

Anatomy of Parallel Computation with Tensors

Trans-dimensional MCMC Peter Green - VCLA UCLAvcla.stat.ucla.edu/old/MCMC/MCMC_tutorial/Lect4_RJMCMC.pdf · •Peter Green, Reversible jump Markov chain Monte Carlo computation and

COSC 6374 Parallel Computation Parallel Design Patterns ...gabriel/courses/cosc6374_s09/ParCo_19_Parallel… · COSC 6374 –Parallel Computation Edgar Gabriel Shared data dependencies

Summary of Parallel Computation

Parallel Skyline Computation on Multicore Architecturespl.postech.ac.kr/~gla/paper/is11a.pdf · 2014-05-22 · Keywords: Skyline computation, Multicore architecture, Parallel computation

Java Parallel Computation on Hadoop

MCMC Using Parallel Computation Andrew Beam ST 790 – Advanced Bayesian Inference 02/21/2013.

Introduction to Parallel Computation...Introduction to Parallel Computation Parallel Algorithm Sequence of square operations. Each square op. contains n3 independent (min,+) operations

Parallel computation - N Body Problem

Parallel quantum computation

Applying Parallel Computation Algorithms the Design of ...theory.stanford.edu/~megiddo/pdf/applying.pdf · Applying Parallel Computation Algorithms ... a good parallel algorithm for

Optimizing Data Shufﬂing in Data-Parallel Computation by … · Map/Reduce style data-parallel computation [15, 3, 23] is increasingly popular. A data-parallel computation job typically

Practical Bayesian Quantile Regression - Illinoisroger/seminar/keming.pdf · 2004-04-02 · 4. Bayesian posterior computation via MCMC An MCMC scheme would constructs a Markov chain

Engine for Parallel Graph Computation

CSE524 Parallel Computation

Parallel computation in seismic modeling › ForOurSponsors › ResearchReports › ...Parallel Computation In Seismic Modelinq Parallel computation in seismic modeling" LawrenceH.