Lrz kurs: big data analysis
-
Upload
ferdinand-jamitzky -
Category
Education
-
view
101 -
download
2
Transcript of Lrz kurs: big data analysis
![Page 4: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/4.jpg)
Contents
1. A short introduction to big data
2. Parallel programming is hard
3. Hardware @LRZ
4. Functional Programming
5. Available packages for R
6. Parallel Programming Tools
7. SMP Programming
8. Cluster Programming
9. Job Scheduler
10.Calling external binary code
![Page 5: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/5.jpg)
big data
a short introduction
![Page 6: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/6.jpg)
What is Big Data?
In information technology, big data is a loosely-
defined term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools
(from wikipedia)
● Buzz Word
● High dimensional data
● Memory intensive data and/or algorithms
![Page 7: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/7.jpg)
Who does Big Data?
● Bioinformatics
● Genomics and other "Omics"
● Astronomy
● Meteorology
● Environmental Research
● Multiscale physics simulations
● Economic and financial simulations
● Social Networks
● Text Mining
● Large Hadron Collider
![Page 8: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/8.jpg)
Hardware for Big Data
● Large Arrays of Harddisks
● Solid State Disks as temp storage
● Large RAM
● Manycore
● Multicore
● Accelerators
● Tape Archives
![Page 9: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/9.jpg)
Software Middleware for Big Data
● MapReduce
● Distributed File Systems
● Parallel File Systems
● Distributed Databases
● Task Queues
● Memory Attached Files
![Page 10: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/10.jpg)
Supercomputer for Big Data
(Flash) Gordon: Data-Intensive Supercomputing at
the San Diego Supercomputing Centre
● 1,024 dual-socket Intel Sandy Bridge nodes,
each with 64 GB DDR3 1333 memory
● Over 300 TB of high performance Intel flash
memory SSDs via 64 dual-socket Intel
Westmere I/O nodes
● Large memory supernodes capable of
presenting over 2 TB of cache coherent
memory
● Dual rail QDR InfiniBand network
http://www.sdsc.edu/supercomputing/gordon/
![Page 11: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/11.jpg)
SuperMUC as Big Data System
SuperMUC
● 9,216 dual-socket Intel Sandy Bridge nodes,
each with 32 GB DDR3 1333 memory
● Parallel File System GPFS
● FDR10 InfiniBand network
● Bandwith to GPFS 200 GByte/s
● No Flash :-(
![Page 12: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/12.jpg)
parallel programming is hard
![Page 13: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/13.jpg)
Why parallel programming?
End of the free lunch
Moore's law means
no longer faster
processors, only more
of them. But beware!
2 x 3 GHz < 6 GHz
(cache consistency,
multi-threading, etc)
![Page 14: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/14.jpg)
The future is parallel
●Moore's law is still valid
●Number of transistors doubles every 2 years
●Clock speed saturates at 3 to 4 GHz
●multi-core processors vs many-core processors
●grid/cloud computing
●clusters
●GPGPUs
(intel 2000)
![Page 15: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/15.jpg)
The future is massively parallel
Connection Machine
CM-1 (1983)
12-D Hypercube
65536 1-bit cores
(AND, OR, NOT)
Rmax: 20 GFLOP/s
![Page 16: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/16.jpg)
The future is massively parallel
JUGENE
Blue Gene/P (2007)
3-D Torus or Tree
65536 64-bit cores
(PowerPC 450)
Rmax: 222 TFLOP/s
now: 1 PFLOP/s
294912 cores
![Page 17: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/17.jpg)
Supercomputer: SMP
SMP Machine:
shared memory
typically 10s of cores
threaded programs
bus interconnect
in R:
library(multicore)
and inlined code
Example: gvs1
128 GB RAM
16 cores
Example: uv3.cos.lrz.de
2000 GB RAM
1120 cores
![Page 18: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/18.jpg)
Supercomputer: MPI
Cluster of machines:
distributed memory
typically 100s of cores
message passing interface
infiniband interconnect
in R:
library(Rmpi)
and inlined code
Example: coolMUC
4700 GB RAM
2030 cores
Example: superMUC
320.000 GB RAM
160.000 cores
![Page 19: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/19.jpg)
Levels of Parallelism
●Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
●Socket Level
each socket contains 8 cores
●Core Level
each core has 16 vector registers
●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
●Pipeline Level (how many simultaneous pipelines)
hyperthreading
● Instruction Level (instructions per cycle)
out of order execution, branch prediction
![Page 20: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/20.jpg)
Problems: Access Times
Getting data from:
CPU register 1ns
L2 cache 10ns
memory 80 ns
network(IB) 200 ns
GPU(PCIe) 50.000 ns
harddisk 500.000 ns
Getting some food from:
fridge 10s
microwave 100s ~ 2min
pizza service 800s ~ 15min
city mall 2000s ~ 0.5h
mum sends cake 500.000 s~1 week
grown in own garden 5Ms ~ 2months
![Page 21: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/21.jpg)
Computing MFlop/s
mflops.internal <- function(np) {
a=matrix(runif(np**2),np,np)
b=matrix(runif(np**2),np,np)
nflops=np**2*(2*np-1)
time=system.time(a %*% b)[[3]]
nflops/time/1000000}
This function computes a matrix-matrix multiplication using np x np random matrices.
The number of floating point operations is:
● np x np matrix elements
● np multiplications and (np-1) additions
resulting in
np x np x (np+np-1) = np**2*(2*np-1) FLOPS
![Page 22: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/22.jpg)
Amdahl's law
Computing time for N processors
T(N) = T(1)/N + Tserial + Tcomm * N
Accelerator factor:
T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
saturation point!
![Page 23: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/23.jpg)
Amdahl's Law II
Acceleration factor for
Tserial/T(1)=0.01
![Page 24: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/24.jpg)
Amdahl's law III
> plot(N,type="l")
> lines(N/(1+0.01*N),col="red")
> lines(N/(1+0.01*N+0.001*N**2),col="green")
![Page 25: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/25.jpg)
R on the HLRB-II
Strong scaling for
up to 120 cores
then the computing time is
too low.
![Page 26: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/26.jpg)
Leibniz Supercomputing Centre
Hardware @ LRZ
![Page 27: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/27.jpg)
● Computer Centre (~175 employees) for all Munich Universities with
o more than 80,000 students and
o more than 26,000 employees
o including 8,500 scientists
● Regional Computer Centre for all Bavarian Universities
o Capacity computing
o Special equipment
o Backup and Archiving Centre (10 petabyte, more than 6 billion files)
o Distributed File Systems
o Competence centre (Networks, HPC, IT Management)
● National Supercomputing Centre
o Gauss Centre for Supercomputing
o Integrated in European HPC and Grid projects
The Leibniz Supercomputing Centre is…
![Page 28: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/28.jpg)
Hardware @ LRZhttp://www.lrz.de/services/compute/linux-cluster/overview/
The LRZ Linux Cluster:
Heterogeneous Cluster of Intel-compatible systems
● lx64ia, lx64ia2, lx64ia3 (login nodes)
●gvs1, gvs2, gvs3, gvs4 (remote visualisation nodes 8 GPUs)
●uv2, uv3 (SMP nodes 1.040 cores)
● ice1-login (cluster)
● lxa1 (coolMUC, MPP cluster)
The SuperMUC
●superMIG (migration system and fat island, 8.200 cores)
●superMUC (cluster of thin islands, 147.456 cores available in Sept 2012)
![Page 29: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/29.jpg)
SuperMUC Linux Cluster
Hardware@LRZ (new Sept 2012)
SuperMIG
8200 cores
CoolMUC
4300 cores
SGI UV
2080 cores
gvs1...4
64 cores
SGI ICE
512 cores
ia64 x86_64 GPU
lx64ia2
8 cores
lx64ia3
8 cores
supzero
80 cores
login
login
SuperMUC
147456 cores
supermuc
16 cores
![Page 30: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/30.jpg)
File space @ LRZhttp://www.lrz.de/services/compute/backup/
$HOME
25 GB per group, with backup and snapshots
cd $HOME/.snapshot
$OPT_TMP
temporary scratch space (beware!)High Watermark Deletion
When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the
oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary.
$PROJECT
project space (max 1TB), no automatic backup, use dsmc
![Page 31: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/31.jpg)
module system@LRZhttp://www.lrz.de/services/software/utilities/modules/
module avail
module list
module load <name>
e.g. module load matlab
module unload <name>
module show <name>
insert module system into qsub job:
. /etc/profile
or
. /etc/profile.d/modules.sh
![Page 32: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/32.jpg)
What our user do: Usage 2010 by Research Area
![Page 33: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/33.jpg)
Performance per core by Research area
![Page 34: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/34.jpg)
batch system@LRZhttp://www.lrz.de/services/compute/linux-cluster/batch-parallel
simple slurm script:
#!/bin/bash
#SBATCH -J myjob
#SBATCH --mail-
user=me@my_domain
#SBATCH --time=00:05:00
. /etc/profile
cd mydir
./myprog.exe
echo $JOB_ID
ls -al
pwd
this is ignored by SGE, but could be used if
executed normally
(Placeholder) name of job
(Placeholder) e-Mail address (don't forget!)
maximum run time; this may be increased up to
the queue limit
load the standard environment (see below)
change to working directory
start executable
![Page 35: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/35.jpg)
batch system@LRZhttp://www.lrz.de/services/compute/linux-cluster/batch-parallel
sbatch jobfile.sh submit job to SLURM
squeue -u <userid> get status of my job
scancel <jobid> delete my job
Start interactive shell:
srun --ntasks=32 --partition=uv2_batch xterm
![Page 36: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/36.jpg)
R makes life easier
functional programming matters
![Page 37: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/37.jpg)
How are High-Performance Codes constructed?
●“Traditional” Construction of High-Performance Codes:
oC/C++/Fortran
oLibraries
●“Alternative” Construction of High-Performance Codes:
oScripting for ‘brains’
oGPUs/multicore for ‘inner loops’
●Play to the strengths of each programming environment.
●Hybrid programming: o use cluster and task parallelism at the same time
o cluster parallelism: separated memory
o task parallelism: shared memory
![Page 38: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/38.jpg)
Why scripting?
A scripting language. . .
● is discoverable and interactive.
●has comprehensive built-in functionality.
●manages resources automatically.
● is dynamically typed.
●works well for “glueing” lower-level blocks together.
●examples: tcl/tk, perl, python, ruby, R, MATLAB
![Page 39: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/39.jpg)
Why functional matters...
● for parallel programming:
ono side effects
ocode as data
● for structured programming:
o late binding
o recursion
o lazy evaluation
overy high abstraction
![Page 40: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/40.jpg)
R functions
●R can define named and anonymous functions
●Define a (named or anonymous) function:
todB <- function(X) {10*log10(X)}
●Functions can even return (anonymous) functions
●The last value evaluated is the return value
●Variables from the calling namespace are visible
●All other variables are local unless specified
●Variable number of inputs:
myfunc <- function(...) list(...)
●Variable names and predefined values
myfunc <- function(a,b=1,c=a*b) c+1
![Page 41: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/41.jpg)
Available packages for R
![Page 42: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/42.jpg)
How to use multiple cores with R
●R provides modularization
●R provides high level abstractions
●R provides mixing of programming paradigms
●R provides dynamic libraries
●R provides vector expressions
Use It!
You can write multi-machine, multi-core, GPGPU accelerated, client-
server based, web-enabled applications using R
![Page 43: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/43.jpg)
Parallel R Packages
● foreach
●pnmath/MKL
●multicore
●snow
●Rmpi
●rgpu, gputools
●R webservices
●sqldf
●rredis
●mapReduce
parallel abstraction
parallel intrinsic functions
SMP programming
Simple Network of Workers
Message Passing Interface
GPGPU programming
client/server webservices
SQL server for R
noSQL server for R
large scale parallelization
![Page 44: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/44.jpg)
Parallel programming with R
●Parallel APIs:
oSMP - multicore
oMPP/MPI - mpi
ossh/sockets - snow
●Abstraction:
o foreach package
doMC
doMPI
doSNOW
doREDIS
Example:
library(doMC)
registerDoMC(cores=5)
foreach(i=1:10) %dopar%
sqrt(i)
roots -> foreach(i=1:10)
%dopar% sqrt(i)
![Page 45: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/45.jpg)
SMP programming
![Page 46: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/46.jpg)
library(multicore)
● send tasks into the background with parallel
● wait for completion and gather results with collect
library(multicore)
# spawn two tasks
p1 <- parallel(sum(runif(10000000)))
p2 <- parallel(sum(runif(10000000)))
# gather results blocking
collect(list(p1,p2))
# gather results non-blocking
collect(list(p1,p2),wait=F)
![Page 47: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/47.jpg)
library(multicore)
● Extension of the apply function family in R
● function-function or functional
● utilizes SMP:
library(multicore)
doit <- function(x,np)sum(sort(runif(np)))
# single call
system.time( doit(0,10000000) )
# serial loop
system.time( lapply(1:16,doit,10000000))
# parallel loop
system.time( mclapply(1:16,doit,10000000,mc.cores=4 ))
![Page 48: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/48.jpg)
doMC
# R
> library(foreach)
> library(doMC)
> registerDoMC(cores=4)
> foreach(i=1:10) %do% sum(runif(10000000))
user system elapsed
9.352 2.652 12.002
> foreach(i=1:10) %dopar% sum(runif(10000000))
user system elapsed
7.228 7.216 3.296
![Page 49: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/49.jpg)
multithreading with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doMC)
registerDoMC()
foreach(i=1:N) %dopar%
{
mmult.f()
}
# thread execution
![Page 50: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/50.jpg)
Cluster Programming
![Page 51: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/51.jpg)
doSNOW
# R
> library(doSNOW)
> registerDoSNOW(makeSOCKcluster(4))
> foreach(i=1:10) %do% sum(runif(10000000))
user system elapsed
15.377 0.928 16.303
> foreach(i=1:10) %dopar% sum(runif(10000000))
user system elapsed
4.864 0.000 4.865
![Page 52: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/52.jpg)
SNOW with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doSNOW)
registerDoSNOW()
foreach(i=1:N) %dopar%
{
mmult.f()
}
# cluster execution
![Page 53: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/53.jpg)
Job Scheduler
![Page 54: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/54.jpg)
noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
http://www.redis.io
Clients are available for C, C++, C#, Objective-C, Clojure, Common
Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,
smalltalk, tcl
![Page 55: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/55.jpg)
doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
The workers can be distributed over the internet
> startRedisWorkers(100)
![Page 56: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/56.jpg)
doRedis
# R
> library(doRedis)
> registerDoRedis("jobs")
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
![Page 57: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/57.jpg)
doMC
# R
> library(doMC)
> registerDoMC(cores=4)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
9.352 2.652 12.002
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
7.228 7.216 3.296
![Page 58: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/58.jpg)
doSNOW
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
![Page 59: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/59.jpg)
redis and R: rredis, doREDIS
redisConnect()
redisSet('x',runif(5))
redisGet('x')
redisClose()
redisAuth(pwd)
redisConnect()
redisLPush('x',1)
redisLPush('x',2)
redisLPush('x',3)
redisLRange('x',0,2)
# connect to redis store
# store a value
# retrieve value from store
# close connection
# simple authentication
# push numbers into list
# retrieve list
![Page 60: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/60.jpg)
Calling external binary code
![Page 61: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/61.jpg)
One R to rule them all
●C/C++/objectiveC
●Fortran
●java
●Mpi
●Threads
●opengl
●ssh
●web server/client
●linux mac mswin
●R shell
●R gui
●math notebook
●automatic latex/pdf
●vtk
![Page 62: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/62.jpg)
One R to bind them
●C/C++/objectiveC
●Fortran
●java
●R objects
●R objects
●.C("funcname", args...)
●.Fortran("test", args...)
●.jcall("class", args...)
●.Call
●.External
![Page 63: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/63.jpg)
Use R as scripting language
R can dynamically load shared objects:
dyn.load("lib.so")
these functions can then be called via
.C("fname", args)
.Fortran("fname", args)
![Page 64: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/64.jpg)
C integration
●shared object libraries can be
used in R out of the box
●R arrays are mapped to C
pointers
R
Cinteger int*
numeric double*
character char*
Example:
R CMD SHLIB -o test.so test.c
use in R:
> dyn.load("test.so")
> .C("test", args)
![Page 65: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/65.jpg)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program
![Page 66: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/66.jpg)
Fortran Compiler
use Intel fortran compiler
$ ifort -o myprog.exe myprog.f90
$ time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: -fast, -O3)
![Page 67: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/67.jpg)
R subroutine
subroutine mysub(x,v,nstep)
! simulate harmonic oscillator
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
return
end subroutine
![Page 68: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/68.jpg)
Matrix Multipl. in FORTRAN
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np) a(i,j) = a(i,j) +
b(i,k)*c(k,j)
end do
return
end subroutine
![Page 69: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/69.jpg)
Call FORTRAN from R
# compile f90 to shared object library
system("ifort -shared -fPIC -o mmult.so mmult.f90");
# dynamically load library
dyn.load("mmult.so")
# define multiplication function
mmult.f <- function(a,b,c)
.Fortran("mmult",a=a,b=b,c=c,np=as.integer(dim(a)[1]
))
![Page 70: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/70.jpg)
Call FORTRAN binary
np=100
system.time(
mmult.f(
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
)
)
Exercise: make a plot system-time vs matrix-dimension
![Page 71: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/71.jpg)
Disk
Big Memory
R R
MEM MEM
Logical Setup of Node
without shared memory
R R
MEM
Logical Setup of Node
with shared memory
DiskDisk
R R
MEM
Logical Setup of Node
with file-backed memory
R R
MEM
Logical Setup of Node
with network attached file-
backed memory
Network Network Network
![Page 72: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/72.jpg)
library(bigmemory)
● shared memory regions for several
processes in SMP
● file backed arrays for several node over
network file systems
library(bigmemory)
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))
sum(x[1,1:1000])
![Page 73: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/73.jpg)
Part II
Applications
![Page 74: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/74.jpg)
Potential Problems on Big Data Sets
1. many small tasks have to be performed for
each of many thousands of variables (long
run time)
2. analysis/ processing needs more main
memory than available
3. several R processes on a node need to
process the same big data set and each
process creates its own big R-object
4. data set cannot be loaded into R because
the R-object representing it would be too big
for the main memory available (worst case)
![Page 75: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/75.jpg)
Approaches for Big Data Problems
1. C-function (shared library)
2. Accelerators (gpgpu, MICs)
3. SMP parallelisation
4. Cluster parallelisation
5. distributed data
6. in memory data files (arrays as big as
available memory)
7. parallel file systems (file backed arrays, no
size limit)
8. hierarchical and heterogeneous file systems
![Page 76: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/76.jpg)
Problem 1: Example (Microarray Data)
● gene expressions for approximately 20000 genes
● influence of each variable on a Survival response shall be tested
Compute a Cox-Survival-Model for each variable
S(t|x) = S (t)
● In R: function coxph() in package Surv (already part of package base)
● even more challenging problem: test all second order interactions
(all pairs, 20000 choose 2)
exp(bx)
0
![Page 77: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/77.jpg)
Problem 1: Example (Microarray Data)
First approach: for-loop in R using function coxph() [which actually calls a C-function using dyn.load to
compute the Cox-Model ]:
library(survHD)
data(beer.survival)
data(beer.exprs)
set.seed(123)
X<-t(as.matrix(beer.exprs))
y<-Surv(beer.survival[,2],beer.survival[,1])
coefs<-c()
system.time(
for(j in 1:ncol(X)){
fit <- coxph( y ~ X[,j])
coefs<-rbind(coefs,summary(fit)$coefficients[ 1 , c(1, 3, 5) ])})
Second Approach: using apply
system.time(output <- apply(t(X),1,function(xrow){
fit <- coxph( y ~ xrow )
summary(fit)$coefficients[ 1 , c(1, 3, 5) ]
}))
User System elapsed
34.635 0.002 34.686
User System elapsed
26.531 0.020 26.676
![Page 78: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/78.jpg)
Problem 1: Example (Microarray Data)
2nd Approach:
● Passing a matrix to C and perform the for-loop inside C
● only coefficients and cooresponding p-values are returned for each variable
● function rowCoxTests in R-package survHD
time <- y[,1]
status <- y[,2]
sorted <- order(time)
time <- time[sorted]
status <- status[sorted]
X <- X[sorted,]
##compute columnwise coxmodels
#dynload not necessary, because 'coxmat.so' is integrated into survHD
system.time(out<-
.C('coxmat',regmat=as.double(X),ncolmat=as.integer(ncol(X)),nrowmat=as.integer(n
row(X)),reg=as.double(X[,1]),zscores=as.double(numeric(ncol(X))),coefs=as.double
(numeric(ncol(X))),maxiter=as.integer(20),...))
● performing computations in C/Fortran, i.e. optimizing sequential code, often yields significant speed-up
● principally difficult to program and quite error prone
● C-functions for single variables are usually available and wrappers are usually easy to program
User System elapsed
0.229 0.000 0.229
max(abs(out$coefs-coefs[,1]))
[1] 1.004459e-07
![Page 79: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/79.jpg)
Comparison to parallel programming:
Parallelization of for-loop using snow:
#create cluster
library(snow)
cl<-makeSOCKcluster(10)
#broadcast X
Z<-X
clusterExport(cl=cl,list=list('Z'))
#function to be applied in parallel
parcoxph<-function(ind,y){
require(survHD)
zcol<-Z[,ind]
fit<-coxph( y ~ zcol )
summary(fit)$coefficients[ 1 , c(1, 3, 5) ]}
#run function on 10 cores
system.time(result <- parLapply(cl=cl,x=1:ncol(Z),fun=parcoxph,y=y))
● parallelization of very small and short tasks usually not efficient
● possible improvement: rewrite code such that bunches of tests are performed
User System elapsed
0.031 0.003 3.474
![Page 80: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/80.jpg)
Combining both approaches:
For really big data sets (>100000 variables) one can combine both approaches?
X2<-X
for(i in 30){
X2<-cbind(X2,X)}
colnames(X2) <- 1:ncol(X2)
system.time(tt<-rowCoxTests(t(X2),y,option='fast'))
system.time(rowCoxTests(t(X),y,option='fast'))
##using snow
#create cluster
library(snow)
cl<-makeSOCKcluster(10)
#function to be applied in parallel
parfun<-function(ind,Z,y){
require(survHD)
rowCoxTests(X=t(Z),y=y,option='fast')}
#run function on 10 cores
system.time(result<-parLapply(cl=cl,x=1:30,fun=parfun,Z=X,y=y))
X2<-cbind(X,X,X)
system.time(result<-parLapply(cl=cl,x=1:10,fun=parfun,Z=X,y=y))
User System elapsed
0.593 0.010 0.606
User System elapsed
0.303 0.000 0.303
User System elapsed
1.825 0.291 7.215
User System elapsed
2.255 0.206 3.436
![Page 81: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/81.jpg)
Combining both approaches: Exercise
In the current example, however, parallel computing is less effective anyway
Exercise:
1. Create a large data set by concatenating the gene-expression matrix 20 times
(use cbind)
2. apply the function rowCoxTests() and measure the runtime.
3. use snow in order to sent the expression matrix to 20 cores and let each core
perform rowCoxTests() on its own matrix.
4. Measure the runtime.
![Page 82: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/82.jpg)
Problem 2: Example
Normalization of Gene-Expression-Microarrays:
● approximately 500k measurements per array
● background correction has to be performed
● ca. 50 measurements have to be summarized to a single value representing one gene expression
(summarization step)
● R functions: rma() or vsn() in Bioconductor package affy
● high memory requirements as soon as number of observations exceeds 100 arrays (>10GB RAM)
Distributed Data Approach (Bioconductor Package affyPara)
![Page 83: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/83.jpg)
Problem 2: Example
source: Markus Schmidberger (): Parallel Computing for Biological Data, Dissertation
Distributed Data Approach for backgound correction
![Page 84: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/84.jpg)
AffyPara: Code Example
#load packages and initialize snow-cluster (for affyPara)
library(snow) #parallelization
library(affyPara) #parallel preprocessing
library(affy) #for reading in affy batches
ncpusaffy<-7 #number of cpus
cl<-makeSOCKcluster(ncpusaffy) #create cluster
#reading AffyBatch from cel-files
setwd('~/dataCEL/wang05/cel') #directory containing cel files
aboall<-ReadAffy() #reading
#create subcluster of length ncores
ncores<-7
cll<-cl[1:ncores]
#perform preprocessing using subcluster cll
res<-system.time(arrs.out<-
preproPara(aboall,bgcorrect=T,bgcorrect.method='rma',normalize=T,normalize.method='quantiles
',pmcorrect.method='pmonly',summary.method='avgdiff',cluster=cll))
###stop cluster/ finalize MPI
stopCluster(cl)
single core RAM > 6GB 7 cores: ca. 1.5GB/core minor speedup
![Page 85: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/85.jpg)
Problem 2: Exercise
Exercise for you:
1. Perform a microarray background correction using serial code (ReadAffy() ,bg.correct() in package
affy)
2. use top to observe the memory consumption of the process.
3. Additionally, measure its runtime.
4. Perform the background correction as a distributed data approach using snow
(you can pass a character-vector of filenames in ReadAffy() in order to load specific cel-files)
1. Compare memory consumption and runtime to the sequential code
![Page 86: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/86.jpg)
Problem 3/4: Data set too large for
RAM
● R cannot handle data indices which are larger than 2 Billion (16GB double, 4GB in Windows XP)
● modern biological data can have several dozen GB (e.g. Next Generation Sequencing)
● If the R-object representing the data set grows larger than the available RAM, R stops throwing an
error reading "Cannot allocate vector of xx byte".
Possible solution: R package bigmemory (based on C++-libraries for big data objects)
2 areas of usage:
● if several processes operate on the same big matrix
● file-backed-matrices if data sets are larger than available main memory
and the combination of both situations
![Page 87: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/87.jpg)
R-Package bigmemory
Essential functions:
● bigmatrix(): for creating a big matrix (useful if RAM is large enough but several processes have to
access the matrix)
● filebacked.big.matrix: for creating a file backed matrix (necessary if main memory is too small)
● describe(): creates a descriptor file for an existing (filebacked)bigmatrix-object
● bigmatrix[i1,i2]: the bigmatrix objects can be handled in R code as normal matrix objects, i.e. their
elements can be accessed using brackets
![Page 88: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/88.jpg)
bigmemory: code example
###write
data(golub)
library(bigmemory)
setwd('~/tmp/bigmem')
X<-as.matrix(golub[,-1])
#create filebacked.bigmatrix and write data into its elements
z<-
filebacked.big.matrix(nrow=30*5000,ncol=ncol(X),type='double',backingfile="m
agolub.bin",descriptorfile="magolub.desc")
k<-0
for(i in 1:5000){
inds<-sample(1:nrow(X),30)
z[(1:30)+(k*30),]<-X[inds,]
k<-k+1}
#create and save descriptorfile for later usage
desc<-describe(z)
save(desc,file='desc_z.RData')
![Page 89: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/89.jpg)
bigmemory: code example
###read
library(bigmemory)
setwd('tmp/bigmem')
#load descriptorfile
load('desc_z.RData')
#attach bigmatrix object using the descriptor file
y<-attach.big.matrix(desc)
#access elements
y[1:10,7]
#read element 7 in the 5th row
b<-y[5,7]
#compute sum of a submatrix
(sum1<-sum(y[1:10,5:20]))
![Page 90: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/90.jpg)
bigmemory: exercise
Exercise for you:
1. create a bigmatrix object using big.matrix()
2. create a descriptor and save it
3. start another R-session on the same node
4. load the descriptor file and attach the bigmatrix
5. use the bigmatrix object for communication between both R processes
![Page 91: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/91.jpg)
Gaining Flexibility: doRedis
● separates job administration and execution
● subtasks are stored in a redis data base
o master process sends subtasks of a computation to the server
o worker can log in and request the tasks
o all necessary R objects are stored in the redis server, too
● necessary software:
o R-packages: rredis, doRedis
o data base: redis-server (debian-package)
![Page 92: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/92.jpg)
doRedis: essential functionality
● Master process:
o registerDoRedis(jobqueue,host): connects to the redis-server at 'host' and specifies a jobqueue
for the tasks to come o foreach(j=1:n) %dopar% {FUN(j)}: sends subtasks to redis data base
o redisFlushAll(): clears the data base
o removeQueue(): removes a queue from the data base
● Worker process:
o registerDoRedis(jobqueue,host): registers a jobqueue whose taks shall be precessed
o startLocalWorkers(n,jobqueue,hoste): starts n local worker processes which process the tasks
specified in jobqueue (uses multicore)
o redisWorker(jobqueue,host): useful in mpi-environments
usually users do not request or set the data base values directly
typical parallelization as known from other "Do-packages"
![Page 93: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/93.jpg)
Worker processes can run on any R-compatible hardware and can connect at any time
redis-server
master:
doRedis
sends
jobs +objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z
distributes
jobs and objects
eventually returns results
● robust
● flexible
● dynamic
![Page 94: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/94.jpg)
doRedis: code example
Master (sending subtasks to redis-server and wait for results):#redis-server ~/redis/redis-2.2.14/redis.conf (in linux shell, starts the
redis-server)
#cross-validation of classification on microarray data
library(CMA)
X <- as.matrix(golub[,-1])
y <- golub[,1]
ls <- GenerateLearningsets(y=y,method='CV',
fold=10,niter=10000)
#function to be applied on each node
cl2 <- function(j){
require(CMA)
ttt<-system.time(cl<-svmCMA(y=y,X=X,learnind=ls@learnmatrix[j,],cost=10))
list(cl,ttt,Sys.info())}
#connect to redis-server, sent subtasks and wait for results
library(doRedis)
redisFlushAll()
registerDoRedis('jobscmanew')
numtodo<-nrow(ls@learnmatrix)
lll3<-foreach(j=1:numtodo) %dopar% {cl2(j)}
![Page 95: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/95.jpg)
doRedis: code example
Worker processes (connect to server, receive subtasks and objects, return results):
###using multicore (just two lines)
#register jobqueue from redis-server
registerDoRedis('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')
#start 10 local workers
startLocalWorkers(n=10, queue='jobscmanew')
###using MPI
#function to be run by each mpi-process
startdr<-function(ll){
library(doRedis)
redisWorker('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')
}
#start rmpi
library(Rmpi)
numworker<-mpi.universe.size()
mpi.spawn.Rslaves()
#let each mpi-process connect to redis-server and perform subtasks
mpi.apply(1:numworker,startdr)
![Page 96: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/96.jpg)
doRedis: exercise
1. connect to the redis server in R
2. submit a job queue
3. start workers to perform the subtasks
4. set a value for variable xnewinteger (use)
5. request the value of variable xnewinteger (use)
![Page 97: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/97.jpg)
● redis and doRedis provide high flexibility for performing independent subtasks
o worker processes can connect at any time
o errors in individual processes do not stop the entire computation (robustness)
o worker processes can run on totally different architectures
o worker processes can run all around the world
● disadvantage: database can become a bottleneck if large R objects have to be stored/sent
solution: separation of large data objects (bigmemory) and job tasks (redis)
Combining doRedis and bigmemory
![Page 98: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/98.jpg)
Separate task and data channel:
Combining doRedis and bigmemory
![Page 99: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/99.jpg)
doredis/bigmemory: Code Example
worker process:
redisbigreadwrite<-function(procind){
require(CMA)
require(bigmemory)
j<-procind
setwd('~/tmp/bigmemlrz')
load('desc_z.RData') #big data object containing many gene expression sets
load('desc_out.RData') #big data file for misclassification rates
z<-attach.big.matrix(desc)
out<-attach.big.matrix(descout)
load('descresmat.RData')
resmat<-attach.big.matrix(descresmat) #big data object for simulating large
writing operation
for(iter in 1:10){
start<-(j-1)*30*10*10+(iter-1)*30*10+1
X<-z[start:(start+299),] #read gene expression matrix
cl<-svmCMA(y=sample(c(1,2),nrow(X),replace=T),X=X,learnind=1:25,cost=10))
#construct classifier
out[(j-1)*10+iter]<-mean(abs(cl@y-cl@yhat)) #compute misclassification rate
resmat[start:(start+299)]<-X #write X
}
#flush
flush(resmat);flush(out)}
![Page 100: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/100.jpg)
doredis/bigmemory: Code Example
master process:###create bigmatrix (gene expressions)
library(bigmemory)
setwd('~/tmp/bigmemlrz')
X<-as.matrix(golub[,-1])
z<-
filebacked.big.matrix(nrow=30*1500,ncol=ncol(X),type='double',backingfile="magolu
b.bin",
descriptorfile="magolub.desc")
for(i in 1:1500){
inds<-sample(1:nrow(X),30)
z[(1:30)+(i*30),]<-X[inds,]}
#create descrptor file and save it for other processes
desc<-describe(z)
save(desc,file='desc_z.RData')
###doredis part
library(doRedis)
registerDoRedis('rwbigmem')
lll3<-foreach(j=1:1500) %dopar% redisbigreadwrite{(j)}
results are returned in a file-backed object so master could quit
![Page 101: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/101.jpg)
doredis/bigmemory: code example
main difference: underlying network and network file system
IBE (NFS)LRZ (NAS)
![Page 102: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/102.jpg)
comparison to standard MPIIO-
approach
Difference: MPI less flexible
● not robust
● collective open/close calls
Fortran90 - MPIIO - Implementation R - bigmemory - implementation
![Page 103: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/103.jpg)
Exercise:
1. run the previous example using only two doredis-workers which perform only a single task
2. rewrite the previous example such that the proportion of class 1 predictions is returned
3. try to rewrite the previous example such that each worker process reads 10 subdatasets at a time
and then constructs a classifier for each of the ten read in subdatasets
4. create a larger bigmemory matrix of gene expression data (e.g. 1500 matrices of dimension
200x10000 ) using random numbers and run the previous example using that input 'bigmatrix'
doRedis/bigmemory: Exercise
![Page 104: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/104.jpg)
Thanks for your attention.
Further questions?
The End
![Page 105: Lrz kurs: big data analysis](https://reader034.fdocuments.in/reader034/viewer/2022042615/55a6950b1a28ab5c148b45fe/html5/thumbnails/105.jpg)
Worker processes can run on any R-compatible hardware and can connect at any time
redis-server
master:
doRedis
sends
jobs +objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z
distributes
jobs and objects
eventually returns results
● robust
● flexible
● dynamic