Seattle useR Group - R + Scala

18
Shouheng Yi Data Scientist [email protected] www.linkedin.com/in/shouhengyi + Seattle useR Group 05/05/2015

Transcript of Seattle useR Group - R + Scala

Shouheng Yi

Data Scientist [email protected]

www.linkedin.com/in/shouhengyi

+

Seattle useR Group 05/05/2015

R is Hard to Scale• Architectural Parallelism: most R’s parallelism is

done on CPU level using MPI

• Data Parallelism: data must have full presents in RAM during an R session

• Why?

RC and Fortran

Debugging Deadlocks - Good Times

Scientists vs. Developers

• Scientists and researchers love R, because most of their computing tasks are iterative/procedural

• Software engineers are less impressed, because they need to develop concurrent, reactive and robust applications

To be exact: Akka + Rserve

Why I Found Scala Useful• Lives on JVM (most devs are comfortable with JVM)

• Great distributed frameworks - Akka, Slick, Spark, etc.

• Syntactic sugar (less typing) -> easier to debug -> rapid development

R

vec <- 1:100

sum <- 0

for(i in vec){ sum <- sum + i }

Scala

val vec = 1 to 100

val sum = (0 /: vec)((a, b) => a + b)

Intro to Akka’s Actor Model

Actor

Inbox

Actor

Inbox

Eventually…

Therefore the form of parallelism is not limited

Code Dump!

A Simple Task• Step 1: read from a CSV file that has 100,000,000

double elements (~1.7G).

read.csv() freaked out on my MacBook Air. It had been like this for 20+ hours > vector <- read.csv(“./vector.csv”, quote = F, row.names = F)

• Step 2: calculate its sum

There are existing R packages like ff, bigmemory to address these out-of-memory issues, but I want to demonstrate an alternative method that is much more generic, robust and scalable

Rserve> library(Rserve) > Rserve()

Starting Rserve: /Library/Frameworks/R.framework/Resources/bin/R CMD /Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rserve/libs//Rserve

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Rserv started in daemon mode.

Producer

case ProcessData(sum: Double, isEnd: Boolean)

Inbox

Worker

case DoWork(ind: Int, size: Int)

Inbox

sender ! doWork(ind, size - 1)

sender ! processData(sum, isEnd)

Producer Classclass Producer extends Actor with ActorLogging { // Some inputs var (size, nworker) = (1000000, 10) // Some counters and result holder var (ind, ncorpse, sum_total): (Int, Int, Double) = (0, 0, 0.0) // Create the router val workerRouter = context.actorOf( Props(new Worker(self, sum_total)).withRouter(RoundRobinRouter(nworker)), name = "workerRouter" ) // Read File and Chop It into Pieces val iterator = Source.fromFile(“./vector.csv”).getLines.grouped(size) // What to do when it enters override def preStart() = println(s"Producer $self is alive") // What to do when it exits override def postStop() = println(s"Producer $self is dead. The sum is $sum_total") // What mssgs to be received override def receive = { case ProcessData(sum) => sum_total += sum if(iterator.hasNext) { sender ! DoWork(iterator.next) } else { ncorpse += 1 context.stop(sender) } if(ncorpse == nworker) context.stop(self) }}

Worker Classclass Worker(master: ActorRef, sum_total: Double) extends Actor with ActorLogging { override def preStart() = { println(s"Worker $self is alive!!!") master ! ProcessData(sum_total) } override def receive = { case DoWork(iter) => // Rserve val c: RConnection = new RConnection() c.assign("x", iter.toArray) val sum: Double = c.eval("sum(as.numeric(x))").asDouble() c.close() // Asking for more println(s"$self => Partial Sum: $sum, Size: ${iter.length}") sender ! ProcessData(sum) }}

Mainobject Application extends App{ override def main(arg: Array[String]){ val system = ActorSystem("ClusterSystem") system.actorOf(Props[Producer], name = "producer") }}

object ClusterMessageProtocol { sealed trait Message // Producer side case class InitiateWorker(worker: ActorRef) extends Message case class ProcessData(sum: Double) extends Message // Actor side case class DoWork(iter: List[String]) extends Message}

Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$h#504275836] is alive!!!

Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1071584906] is alive!!!

Producer Actor[akka://ClusterSystem/user/producer#1272599354] is alive

Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -964.3282348781046, Size: 1000000

Actor[akka://ClusterSystem/user/producer/workerRouter/$f#500982456] => Partial Sum: -177.85266733478048, Size: 1000000

Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1850062035] => Partial Sum: -547.8233029081448, Size: 1000000

Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -660.0674912837135, Size: 1000000

Producer Actor[akka://ClusterSystem/user/producer#1420020857] is dead. The sum is -13615.40143829277

> sum(vector) [1] -13615.4

Applications1. Optimization Problems

Evaluating objective function, simulation in parallel (Differential Evolution!)

2. Distributed Matrix Operations

Product, transpose, inverse of distributed matrices, quadratic programming in large dimensional space

3. Real-time machine learning

Linear/logistic regression (see 2), Random Forest, Neural network

4. Statistical Inference

Bootstrap, sampling, log-likelihood estimation, Bayesian

Thank You! Any Questions?

Email: [email protected] LinkedIn: www.linkedin.com/in/shouhengyi

知乎:  伊⾸首衡