Post on 13-Feb-2016
description
MCMC Using Parallel Computation
Andrew BeamST 790 – Advanced Bayesian Inference
02/21/2013
Outline
• Serial vs. Parallel Computation• Chain level parallelization in MCMC
• R code and example• Kernel speed-up and parallelization for larger data sets
• Calling C/Java from R• GPU example
• MCMC in pure java• Conclusions
Talk focus: Grab bag of ways to speed up R methods to use alternative platforms
MCMC Methods
• MCMC methods are great!• They are general.
• Work for every family of distribution.• They are exact (in the limit).
• Unlike approximate methods, e.g. variational methods• Flexible.• They do have some drawbacks…
• Can be painfully slow for complex problems• Complex problems are exactly when we would like to use MCMC
grad student
simulation is running.
Simulating!
Your funding has been revoked.
Dr. Advisor
Serial vs. Parallel Computation• Computers have used roughly the same architecture for
nearly 70 years– Proposed by John von Neumann in 1945
• Programs are stored on disk, loaded into system memory (RAM) by a control unit
• Instructions are executed one at a time by an arithmetic logic unit (ALU).
• Central Processing Unit (CPU) consistsof one ALU, control units, communication lines, and devices for I/O.– ALU is what does instruction execution.
Moore's law • “The number of transistors on integrated circuits doubles
approximately every two years.”– Intel co-founder Gordon E. Moore
• Around 2002, chip manufacturers began to shift from increasing the speed of single processors to providing multiple computational units on each processor.
• Consistent with Moore's law, but changes the way programmers (i.e. us) have to use them.
• CPU still consists of the same components, but now have multiple ALUs capable of executing instructions at the same time.
Multicore architectures• Each central processing unit (CPU) now consists of multiple
ALUs.• Can execute multiple instructions simultaneously.
• Communication between coreshappens with shared memory.
• Programs can create multiple execution “threads” to take advantage of this.
• Speed up requires explicit programming representations.• Running a program designed for a single core processor on a
multicore processor will not make it faster.
Program execution hierarchy• Atomic unit of execution is a thread
– A serial set of instructions– E.g. – a = 2+2; b=3*4; c = (b/a)2
– Threads communicate with each other using a variety of system dependent protocols
• A collection of threads is known as a process• Usually represent a common computational task.
• A collection of processes is known as an application.• Example: Google chrome browser:
• The entire browser is an application.• Each tab is a process controlling multiple threads.• A thread handles different elements of each page (i.e.
music, chat, etc.)
Decomposing the MCMC problem
• Given our computers are capable of doing multiple things at once, we should all the resources available to us.
• If a problem has independent parts, we should try to execute them in parallel. This is often called “embarrassingly” parallel.
• Let’s go through the general Metropolis-Hasting algorithms and look for parts we can speed up using parallel computation.
General MH MCMC Algorithm
1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times
How can we make this faster?
This is Nc independent operations – we can do this concurrently on a multicore machine.
1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times
• Without parallel computing, chains must be completed one at a time
Chain 1
Chain 2
Chain 3
Completed Chains:
Speeding up multi-chain simulations
4 available cores
• The first chain finishes and the second one starts…
Chain 1
Chain 2
Chain 3
Completed Chains:
Speeding up multi-chain simulations
4 available cores
• The second chain is done and the third one starts…
Chain 1
Chain 2
Chain 3
Completed Chains:
Speeding up multi-chain simulations
4 available cores
• Now all three are done.• If one chain takes T seconds, the
entire simulation takes 3*T seconds
Chain 1
Chain 2
Chain 3
Completed Chains:
Speeding up multi-chain simulations
4 available cores
• Now consider a parallel version. • Each chain is assigned to its own
core.
Chain 2
Chain 3
Completed Chains:
Chain 1
Speeding up multi-chain simulations
• The entire simulation takes T + ε, where ε is the thread communication penalty and is usually << T
Chain 2
Chain 3
Completed Chains:
Chain 1
Speeding up multi-chain simulations
Multiple Chains vs. One Long Chain• It has been claimed that a chain of length N*L is
preferable to N chains of length L, due to better mixing properties of the longer chain.
• Robert and Casella agree (text, pg. 464)
• However, if you have a multicore machine you should NEVER just use one chain, for moderately sized problems because you can get additional samples for essentially no extra time.
• Buy one, get two free
• You can use multiple chins to assess convergence• Gelman and Rubin diagnostic.
• More samples is always better, even if they are spread over a few chains.
Chain-level parallelization in R• For small to moderate sized problems, it’s very easy
to do this in R with minor changes to existing code.• Small to moderate size = 100-1000 observations 10-100
predictors.
• Revolutionary Analytics has developed several packages to parallelize independent for loops.
• Windows – doSNOW• Linux/OSX – doMC
• Uses foreach paradigm to split up the work across several cores and then collect the results.
• Use special %dopar% par operator to indicate parallel tasks.
• This is more of a “hack” then a fully concurrent library.
• This is a good idea if your simulation currently takes a reasonable amount of time and doesn’t use too much memory.
doMC/doSNOW Paradigm
For N parallel tasks:
…Task 1
Task 2
Task N-1 Task N
Create N copies of the entire R workspace
…Run N R sessions in parallel
…
Collect results and return control to original R session.
Parallel Template (Windows):Load doSNOW package
Register parallel backend
Indicate how the results should be collected
Indicate this should be done in parallel
Assign return value of foreach to a variable
Example: Penalized Regression via MCMC
Prior for β
τ = 1
τ = 0.5
τ = 0.1
Simulation Description
• N = 100, p = 10• Xij ~ N(0,1)• True model: Y = X2 + X6
• Yi = X2 + X6 + ei, where ei ~ N(0,1)• Full penalty -> τ = 1• Sample using RW-MH such that βi = βi-1 + εi, where εi ~ N(0,
(1/2p)2) • Simulate initially for Nc = 1 chain, Ns = 100,000 iterations,
keep 10,000 for inference• Macbook pro, quad-core i7 with hyperthreading, 8GB ram.• Sampler code available in chain-parallel-lasso.R.
Demonstration code in mcmc-lasso.R• N and p are adjustable so you can see how many chains
your system can handle for a given problem size
Simulation Resultsβ1 ≈ 0 β2 ≈ 1 β3 ≈ 0 β4 ≈ 0 β5 ≈ 0
β6 ≈ 1 β7 ≈ 0 β8 ≈ 0 β9 ≈ 0 β10 ≈ 0
Simulation Results• Sampler appears to be working – finds correct variables.• Execution time: 3.02s• Ram used: 30MB• Returns a Nkx p matrix of samples Sampler
Code
ProposeEvaluate
Accept/Reject
Avoid using rbind at all costs
Multiple Chain Penalized Regression• Wrap the single chain sampler into a function• Call it multiple times from within the foreach loop
Register parallel backendCall foreach and %dopar%
Windows quirk
Call single chain sampler
Return single chain results to be collected
Multiple Chain Penalized Regression
Example call using 2 chains:
Multiple Chain Execution Times
Single chain execution time
doSNOW/doMC not just for MCMC
• Worth noting that this paradigm is useful for a lot of other tasks in R
• Any task that has “embarrassingly” parallel portions• Bootstrap• Permutations• Cross-validation• Bagging• etc
• Be aware that every parallel R session will use the same amount of memory as the parent.
• Computation happens 5x faster, but use 5x as much memory
• Application level parallelism
Back the the MH Algorithm
Unfortunately loop is inherently serial and we can’t speed it up through parallelization
1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times
Can we speed up the inner loop?
What is the most expensive part of each iteration?
1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θnew))6. Accept θnew with probability ρ(K(θnew), (K(θnew))7. Repeat Ns times8. Repeat Nc times
• Each iteration we have to evaluate the kernel for N samples and p parameters – most expensive.
• As N gets bigger, we can use parallel devices to speed this up.• Communication penalties preclude multicore CPUs from
speeding this.
Kernel Evaluations
• For the penalized regression problem we have 3 main kernel computations:
• Evaluate Xβ – Could be expensive for large N or large p
• Evaluate the likelihood for Xβ, Y – N evaluations of the normal pdf.
• Evaluate the prior for β – cheap for small p, potentially expensive for large p.
• For demonstration, let’s assume we are in a scenario with large N and reasonable p
• Xβ – not expensive• Likelihood – expensive• Prior – very cheap
Kernel Evaluations in Java and C
• If evaluating our kernel is too slow in R, we can drop down to another language from within R and evaluate it there.
• This is actually what dnorm() in R does• R has native support for calling C functions.• rJava package provides interface for calling Java from R.
• For small problems (e.g. N ~ 100) C will be faster.• Reason – there is also a communication penalty since R can’t call Java
natively. R has to transfer data to Java, then start execution, then retrieve it.
• Java has rich external libraries set for scientific computing.• Java is portable – write once and run anywhere• No pointers!• Built in memory management – no malloc.• Unified concurrency library. Can write multithreaded programs that will run
on any platform.• Thread level parallelism saves memory
Example Java code for kernel evaluation function called from R
evaluate likelihood
evaluate priorfunction to evaluate normal PDF
function to evaluate prior
Calling Java from R
• rJava package has R-Java bindings that will allow R to use Java objects.
• Java based kernel for penalized regression available in java_based_sampler.R and java source code in dnorm.java
• Compiled java code is in dnorm.class• To compile source use javac command: javac dnorm.java• Not necessary, but if you modify the source you will have to
recompile
• Performance is slightly worse that C code used by dnorm() in R• Abstracting out kernel evaluation is a good idea if you have
complicated kernels not available in R or you have large values of N or p.
• What if you have so much data Java or C still isn’t enough?
GPU Computing
GPUCPU
• GPU – Graphics Processing Unit• Designed to do floating point math very quickly for image
processing• Cheap – usually a few hundred dollars• Highly parallel – 100s to 1000s of processors per unit• Very little control units – designed for floating point operations
• Good for doing a lot of simultaneous calculations (e.g. + - / * )
• CUDA - Compute Unified Device Architecture• SDK and driver set that gives access to NVIDIA floating point
units• Provides C-level interface and compiler
• Write kernel evaluation in CUDA C
CUDA Programming Paradigm• CUDA is based on notions of host (i.e. the CPU) and the device
(i.e. the GPU)• Create data objects (arrays, matrices, etc) on the host, transfer
them to the device, and then retrieve the results.• Computation is accomplished via “kernels” – small
computational programs.• GPU threads are given a block of data, which they use to
execute the kernel.• Compile C programs using NVIDIA’s nvcc compiler• We can use this to do the expensive normal PDF evaluation.
GPU Simulation Setup• Do N normal PDF evaluations for values ranging from 100 to
100,000,000.• Compare to R’s native dnorm()• GEFORCE GTX 650 – 2GB RAM, 384 CUDA processors• Requires CUDA SDK 3.0, compatible GPU and cuda drivers.• Code available in kernel.cu. • Compile on Linux using nvcc kernel.cu –o kernel
Execution Times
N GPU dnorm100 0.000348 0.00011000 0.000524 0.0004
10000 0.000733 0.0011.00E+005 0.00124 0.0071.00E+006 0.006677 0.0975.00E+006 0.0235 0.3851.00E+007 0.053936 0.7725.00E+007 0.22657 3.751.00E+008 0.542329 7.567
Speed up
With a sample size of 5 million, if you ran a simulation for 100,000 iterations it would take 10.7 hours days using dnorm(), but just over 30 minutes on a GPU.
GPUs and MCMC – Not just about speed
Pierre Jacob , Christian P. Robert, Murray H. Smith
Using parallel computation to improve Independent Metropolis--Hastings based estimation
In this paper, we consider the implications of the fact that parallel raw-power can be exploited by a generic Metropolis--Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent Metropolis--Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are based on the Rao--Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert (1996), Atchade and Perron (2005) and Douc and Robert (2010). We illustrate those improvements both on a toy normal example and on a classical probit regression model, but stress the fact that they are applicable in any case where the independent Metropolis-Hastings is applicable.
http://arxiv.org/abs/1010.1595
Non-R based solutions
• If you want your simulation to just “go faster”, you will probably need to abandon R all together.
• Implement everything in another language like C/C++ or Java and it will (most likely) be orders of magnitude faster
• Compiled code is in general much faster.• You can stop having wacky R memory issues.• Control the level of parallelization (thread vs. process vs.
application).• Interface more easily with devices like GPUs.• Pain now will be rewarded later.
BAJA – Bayesian Analysis in JAva• ST 790 Project• Pure java, general MH-MCMC sampler
• Adaptive sampling options available• Model specification via graphical directed, acyclic graph (DAG)• Parallel chain execution – multithreaded• Stochastic thinning via resolvant• Memory efficient (relatively)• Automatic generation of trace plots, histograms, ACF plots.• Export samples to .csv or R.
BAJA Parallel Performance
L = 1,000,000
Speed comparison with several platforms
Thank you!