Seattle useR Group - R + Scala

Shouheng Yi Data Scientist
Seattle useR Group 05/05/2015

Shouheng Yi

Data Scientist [email protected]


Seattle useR Group 05/05/2015

R is Hard to Scale• Architectural Parallelism: most R’s parallelism is

done on CPU level using MPI

• Data Parallelism: data must have full presents in RAM during an R session

• Why?

RC and Fortran

Debugging Deadlocks - Good Times

Scientists vs. Developers

• Scientists and researchers love R, because most of their computing tasks are iterative/procedural

• Software engineers are less impressed, because they need to develop concurrent, reactive and robust applications

To be exact: Akka + Rserve

Why I Found Scala Useful• Lives on JVM (most devs are comfortable with JVM)

• Great distributed frameworks - Akka, Slick, Spark, etc.

• Syntactic sugar (less typing) -> easier to debug -> rapid development


vec <- 1:100

sum <- 0

for(i in vec){ sum <- sum + i }


val vec = 1 to 100

val sum = (0 /: vec)((a, b) => a + b)

Intro to Akka’s Actor Model






Therefore the form of parallelism is not limited

Code Dump!

A Simple Task• Step 1: read from a CSV file that has 100,000,000

double elements (~1.7G).

read.csv() freaked out on my MacBook Air. It had been like this for 20+ hours > vector <- read.csv(“./vector.csv”, quote = F, row.names = F)

• Step 2: calculate its sum

There are existing R packages like ff, bigmemory to address these out-of-memory issues, but I want to demonstrate an alternative method that is much more generic, robust and scalable

case ProcessData(sum: Double, isEnd: Boolean)



case DoWork(ind: Int, size: Int)


sender ! doWork(ind, size - 1)

sender ! processData(sum, isEnd)

Producer Classclass Producer extends Actor with ActorLogging { // Some inputs var (size, nworker) = (1000000, 10) // Some counters and result holder var (ind, ncorpse, sum_total): (Int, Int, Double) = (0, 0, 0.0) // Create the router val workerRouter = context.actorOf( Props(new Worker(self, sum_total)).withRouter(RoundRobinRouter(nworker)), name = "workerRouter" ) // Read File and Chop It into Pieces val iterator = Source.fromFile(“./vector.csv”).getLines.grouped(size) // What to do when it enters override def preStart() = println(s"Producer $self is alive") // What to do when it exits override def postStop() = println(s"Producer $self is dead. The sum is $sum_total") // What mssgs to be received override def receive = { case ProcessData(sum) => sum_total += sum if(iterator.hasNext) { sender ! DoWork( } else { ncorpse += 1 context.stop(sender) } if(ncorpse == nworker) context.stop(self) }}

Worker Classclass Worker(master: ActorRef, sum_total: Double) extends Actor with ActorLogging { override def preStart() = { println(s"Worker $self is alive!!!") master ! ProcessData(sum_total) } override def receive = { case DoWork(iter) => // Rserve val c: RConnection = new RConnection() c.assign("x", iter.toArray) val sum: Double = c.eval("sum(as.numeric(x))").asDouble() c.close() // Asking for more println(s"$self => Partial Sum: $sum, Size: ${iter.length}") sender ! ProcessData(sum) }}

Mainobject Application extends App{ override def main(arg: Array[String]){ val system = ActorSystem("ClusterSystem") system.actorOf(Props[Producer], name = "producer") }}

object ClusterMessageProtocol { sealed trait Message // Producer side case class InitiateWorker(worker: ActorRef) extends Message case class ProcessData(sum: Double) extends Message // Actor side case class DoWork(iter: List[String]) extends Message}

Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$h#504275836] is alive!!!

Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1071584906] is alive!!!

Producer Actor[akka://ClusterSystem/user/producer#1272599354] is alive

Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -964.3282348781046, Size: 1000000

Actor[akka://ClusterSystem/user/producer/workerRouter/$f#500982456] => Partial Sum: -177.85266733478048, Size: 1000000

Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1850062035] => Partial Sum: -547.8233029081448, Size: 1000000

Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -660.0674912837135, Size: 1000000

Producer Actor[akka://ClusterSystem/user/producer#1420020857] is dead. The sum is -13615.40143829277

> sum(vector) [1] -13615.4

Applications1. Optimization Problems

Evaluating objective function, simulation in parallel (Differential Evolution!)

2. Distributed Matrix Operations

Product, transpose, inverse of distributed matrices, quadratic programming in large dimensional space

3. Real-time machine learning

Linear/logistic regression (see 2), Random Forest, Neural network

4. Statistical Inference

Bootstrap, sampling, log-likelihood estimation, Bayesian

Thank You! Any Questions?

Email: [email protected] LinkedIn:

知乎:  伊⾸首衡