R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only...

27
R on BioHPC Rstudio, Parallel R and BioconductoR 1 Updated for 2015-07-15

Transcript of R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only...

Page 1: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

R on BioHPCRstudio, Parallel R and BioconductoR

1 Updated for 2015-07-15

Page 2: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Today we’ll be looking at…

2

Page 3: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Why R?

3

• The dominant statistics environment in academia

• Large number of packages to do a lot of different analyses

• Excellent uptake in Bioinformatics – specialist packages

• (Relatively) easy to accomplish complex stats work

• Very active development right nowR Foundation, R Consortium, Revolution Analytics, RStudio, Microsoft…

Page 4: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Why not R?

4

• Quirky language – painful for e.g. Python programmers

• Generally thought to be quite slow – except for optimized linear algebra

• Complex ‘old-fashioned’ documentation

• Parallelization packages can be complex / outdated

… but it’s getting better quickly….

Page 5: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Exciting Recent Developments in R

5

Page 6: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

RStudio – An IDE for R, on the web

6

http://rstudio.biohpc.swmed.edu

BioHPC optimized R, access to cluster storage, persistent sessions

Page 7: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

When to use RStudio

7

• Development work with small datasets

• Creating R Markdown documents

• Working with Shiny for dataset visualizations

• Any small, short-running data analysis tasks

Large datasets, very long running jobs, parallel code?

Must use R on the cluster…

Page 8: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Using R on the cluster / clients

8

module load R/3.2.1-intel

Latest version, optimized, same as used by rstudio.biohpc.swmed.edu

Page 9: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Installing Packages

9

We have a set of common packages pre-installed in the R module.

You can install your own into your home directory (~/R)

install.packages(c("microbenchmark", "data.table"))

Some packages need additional libraries, won’t compile successfully.- Ask us to install them for you ([email protected])

This is for packages from CRAN – BioconductoR packages install differentlySee later!

Page 10: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Our R is faster than standard downloads

10

Compiled using Intel compiler and Intel Math Kernel Library

Task Standard R BioHPC R Speedup

Matrix Multiplication 139.15 1.80 77x

Cholesky Decomposition 19.53 0.32 61x

SVD 45.66 1.95 23x

PCA 201.30 6.25 32x

LDA 135.37 17.60 7x

This is on a cluster node – speedup is less on clients with fewer CPU cores

For your own Mac or PC see http://www.revolutionanalytics.com/revolution-r-open

mkl_test.R

Page 11: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Benchmarking functions in R (and compiling them)

11

Compiling a function that is called often can increase speedThe microbenchmark package allows you to benchmark functions

library(compiler)f <- function(n, x) for (i in 1:n) x = (1 + sin(x))^(cos(x))g <- cmpfun(f)

library(microbenchmark)compare <- microbenchmark(f(1000, 1), g(1000, 1), times = 1000)

library(ggplot2)autoplot(compare)

functions.R

Page 12: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

For speed – always vectorize!

12

54x speedup!

Using a function compilation improved median some (< 2x)Using vector form was much faster

distnorm <- function(){

x <- seq(-5, 5, 0.01)y <- rep(NA,length(x))

for(i in 1:length(x)) {y[i] <- stdnorm(x[i])

}

return(list(x=x,y=y))}

vdistnorm <- function(){

x <- seq(-5, 5, 0.01)y <- stdnorm(x)

return(list(x=x, y=y))

}

functions.R

Page 13: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Explicit Parallelization in R

13

Our optimized R automatically parallelizes linear algebra on a single machine- enough in a lot of cases!

Always prefer using vector/matrix form over for loops and apply functions to get the most out of these optimizations.

If you need more options you can control the parallelization:

library(parallel) # Single-node and cluster parallelization# apply functions and explicit execution

library(doParallel) # Simple parallel foreach loops

Can run parallel code on a single node (multicore) or across nodes (MPI)

Page 14: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Our Example Application

14

# Define a function that performs a random walk with a# specified bias that decaysrw2d <- function(n, mu, sigma){

steps=matrix(, nrow=n, ncol=2)for (i in 1:n){

steps[i,1] <- rnorm(1, mean=mu, sd=sigma )steps[i,2] <- rnorm(1, mean=mu, sd=sigma )mu <- mu/2

}return( apply(steps, 2, cumsum) )

}

mc_parallel.R

Page 15: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

A bigger task…

15

# Generate random walks of lengths between 1000 and 5000# foreach loopsystem.time(

results <- foreach(l=1000:5000) %do% rw2d(l, 3, 1))# user system elapsed# 85.872 0.145 86.242

# Applysystem.time(

results <- lapply( 1000:5000, rw2d, 3, 1))# user system elapsed# 81.175 0.114 81.511

mc_parallel.R

Page 16: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Start a cluster (of R slave workers on a single machine)

16

Single node, multiple cores running multiple R slaves

#Parallel Single nodelibrary(parallel)library(doParallel)

# Create a cluster of workers using all corescl <- makeCluster( detectCores() )# Tell foreach with %dopar% to use this clusterregisterDoParallel(cl)

stopCluster(cl)

mc_parallel.R

Page 17: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

R parallel vs MKL conflict

17

Intel MKL tries to use all cores for every linear algebra operationR is running multiple iterations of a loop in parallel using all cores

If used together too many threads/processes are launched – far more than cores!

export OMP_NUM_THREADS=1 # on terminal before running R

sys.setenv(OMP_NUM_THREADS="1") # within R

~ 5% improvement by disabling MKL multi-threading

Page 18: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

This time in parallel!

18

cl <- makeCluster( detectCores() )RegisterDoParallel(cl)Sys.setenv(OMP_NUM_THREADS="1")

# Generate 1000 random walks of increasing length# Parallel foreach loopsystem.time(

results <- foreach(l=1000:5000) %dopar% rw2d(l, 3, 1))# user system elapsed# 2.928 0.441 17.374

# Parallel applysystem.time(

results <- parLapply( cl, 1000:5000, rw2d, 3, 1))# user system elapsed# 0.339 0.171 8.460

stopCluster(cl)

5x Speedup

9x Speedup

mc_parallel.sh

Page 19: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

MPI parallelization – for really big jobs

19

MPI is available on R/3.1.2-intel only

We will continue to use the simple parallel and doParallel packages

Lots online about ‘snow’ – this is now behind the scenes in new versions of R

Please join us for coffee to discuss MPI projectsusing R

Work in progress optimizations with your help

Page 20: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

MPI parallelization – easy!

20

cl <- makeCluster( 128, type="MPI" )

Just one change in R code!

Number of MPI tasks

cores per node * nodes (or less if RAM limited)

48 cores per node for 256GB partition32 cores per node for other partitions

mpi_parallel.R

Page 21: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

MPI parallelization – submitting the job

21

#!/bin/bash

#SBATCH --job-name R_MPI_TEST

# Number of nodes required to run this job#SBATCH -N 4# Distribute n tasks per node#SBATCH --ntasks-per-node=32

#SBATCH -t 0-2:0:0#SBATCH -o job_%j.out#SBATCH -e job_%j.err#SBATCH --mail-type ALL#SBATCH --mail-user [email protected]

module load R/3.2.1-intel

ulimit -l unlimitedR --vanilla < mpi_parallel.R

# END OF SCRIPT

No mpirun!

mpi_parallel.sh

Page 22: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

MPI Performance

22

# Sequential (with MKL multi-threading)system.time(

results <- lapply( 1000:10000, rw2d, 3, 1))# user system elapsed # 329.173 0.610 330.607

# Parallel apply, 4 nodes, 128 MPI taskssystem.time(

results <- parLapply( cl, 1000:10000, rw2d, 3, 1))# user system elapsed # 18.815 0.951 19.848 16x Speedup

Page 23: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

BioconductoR

23

A comprehensive set of Bioinformatics related packages for R

Software and datasets

Page 24: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

BioconductoR

24

Base packages installed, plus some commonly used extras

Install additional packages to home directory:

source("http://bioconductor.org/biocLite.R")biocLite('limma')

Ask [email protected] for packages that fail to compile

Page 25: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

BioconductoR

25

Bioconductor workflows are fantastic tutorials

http://www.bioconductor.org/help/workflows/

Page 26: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

BioconductoR Example

26

DEMO

RNA-Seq Analysis&

UCSC Genome Browser

See bioconductor.Rmd

Page 27: R on BioHPC · MPI parallelization –for really big jobs 19 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online

Dallas R Users Group

27

http://www.meetup.com/Dallas-R-Users-Group/

University of Dallas, Irving, Saturdays(Accessible by DART Orange Line)