Real time machine learning visualization with spark -- Hadoop Summit 2016

41
Real Time Machine Learning Visualization with Spark Chester Chen, Ph.D Sr. Manager, Data Science & Engineering GoPro, Inc. Hadoop Summit, San Jose 2016

Transcript of Real time machine learning visualization with spark -- Hadoop Summit 2016

Page 1: Real time machine learning visualization with spark -- Hadoop Summit 2016

Real Time Machine Learning Visualization with SparkChester Chen, Ph.DSr. Manager, Data Science & EngineeringGoPro, Inc.

Hadoop Summit, San Jose 2016

Page 2: Real time machine learning visualization with spark -- Hadoop Summit 2016

Who am I ?• Sr. Manager of Data Science & Engineering at GoPro• Founder and Organizer of SF Big Analytics Meetup (4500+ members)• Previous Employment:

– Alpine Data, Tinga, Clearwell/Symantec, AltaVista, Ascent Media, ClearStory Systems, WebWare.

• Experience with Spark– Exposed to Spark since Spark 0.6– Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x

• Hadoop Distribution– CDH, HDP and MapR

Page 3: Real time machine learning visualization with spark -- Hadoop Summit 2016
Page 4: Real time machine learning visualization with spark -- Hadoop Summit 2016

Growing data needs

Page 5: Real time machine learning visualization with spark -- Hadoop Summit 2016

Lightning-fast cluster computing

Real Time ML Visualization with Spark

http://spark.apache.org/

Page 6: Real time machine learning visualization with spark -- Hadoop Summit 2016

Iris data set, K-Means clustering with K=3Cluster 2

Cluster 1

Cluster 0

Centroids

Sepal width vs Petal length

Page 7: Real time machine learning visualization with spark -- Hadoop Summit 2016

Iris data set, K-Means clustering with K=3

distance

Page 8: Real time machine learning visualization with spark -- Hadoop Summit 2016

What is K-Means ?Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk}The clusters are determined by minimizing the inter-cluster sum of squares (ICSS) (sum of distance functions of each point in the cluster to the K center). In other words, the objective is to find

where μi is the mean of points in Si.https://en.wikipedia.org/wiki/K-means_clustering

Page 9: Real time machine learning visualization with spark -- Hadoop Summit 2016

Visualization Cost

0 5 10 15 20 2534

34.5

35

35.5

36

36.5

37

37.5

38

38.5

Cost vs Iteration

Cost

Page 10: Real time machine learning visualization with spark -- Hadoop Summit 2016

Real Time ML VisualizationUse Cases

Use visualization to determine whether to end the training early

Need a way to visualize the training process including the convergence, clustering or residual plots, etc. Need a way to stop the training and save current modelNeed a way to disable or enable the visualization

Page 11: Real time machine learning visualization with spark -- Hadoop Summit 2016

Real Time ML Visualization with Spark

DEMO

Page 12: Real time machine learning visualization with spark -- Hadoop Summit 2016

How to Enable Real Time ML Visualization ?

A callback interface for Spark Machine Learning Algorithm to send messages

Algorithms decide when and what message to sendAlgorithms don’t care how the message is delivered

A task channel to handle the message delivery from Spark Driver to Spark Client

It doesn’t care about the content of the message or who sent the messageThe message is delivered from Spark Client to Browser

We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH) Pull is possible, but requires a message Queue

Visualization using JavaScript Frameworks Plot.ly and D3

Page 13: Real time machine learning visualization with spark -- Hadoop Summit 2016

Spark Job in Yarn-Cluster mode

Spark Client

Hadoop Cluster

Yarn-ContainerSpark Driver

Spark Job

Spark Context

Spark ML algorithm

Command Line

Rest API

Servlet

Application Host

Page 14: Real time machine learning visualization with spark -- Hadoop Summit 2016

Spark Job in Yarn-Cluster mode

Spark Client

Hadoop Cluster

Command Line

Rest API

Servlet

Application Host

Spark Job

App Context Spark ML Algorithms

ML Listener

Message Logger

Page 15: Real time machine learning visualization with spark -- Hadoop Summit 2016

Spark Client

Hadoop Cluster

Application Host

Spark Job

App Context Spark ML Algorithms

ML Listener

Message Logger

Spark Job in Yarn-Cluster mode

Web/Rest API

Server

Akka

Browser

Page 16: Real time machine learning visualization with spark -- Hadoop Summit 2016

Enable Real Time ML Visualization

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

Page 17: Real time machine learning visualization with spark -- Hadoop Summit 2016

Enable Real Time ML Visualization

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

Page 18: Real time machine learning visualization with spark -- Hadoop Summit 2016

Machine Learning Listeners

Page 19: Real time machine learning visualization with spark -- Hadoop Summit 2016

Callback Interface: ML Listener

trait MLListener { def onMessage(message: => Any)}

Page 20: Real time machine learning visualization with spark -- Hadoop Summit 2016

Callback Interface: MLListenerSupport

trait MLListenerSupport { // rest of codedef sendMessage(message: => Any): Unit = { if (enableListener) { listeners.foreach(l => l.onMessage(message)) }}

Page 21: Real time machine learning visualization with spark -- Hadoop Summit 2016

KMeansEx: KMeans with MLListener

class KMeansExt private (…) extends Serializable with Logging with MLListenerSupport { ... }

Page 22: Real time machine learning visualization with spark -- Hadoop Summit 2016

KMeansEx: KMeans with MLListenercase class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double )

private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = { ... while (!stopIteration && iteration < maxIterations && !activeRuns.isEmpty) {

...if (listenerEnabled()) {

sendMessage(KMeansCoreStats(…)) }...

}}

Page 23: Real time machine learning visualization with spark -- Hadoop Summit 2016

KMeans ML Listener class KMeansListener(columnNames: List[String], data : RDD[Vector], logger : MessageLogger) extends MLListener{ var sampleDataOpt : Option[Array[Vector]]= None

override def onMessage(message : => Any): Unit = { message match { case coreStats :KMeansCoreStats => if (sampleDataOpt.isEmpty)

sampleDataOpt = Some(data.takeSample(withReplacement = false, num=100)) //use the KMeans model of the current iteration to predict sample cluster indexes val kMeansModel = new KMeansModel(coreStats.centers) val cluster=sampleDataOpt.get.map(vector => (vector.toArray, kMeansModel.predict(vector))) val msg = KMeansStats(…) logger.sendBroadCastMessage(MLConstants.KMEANS_CENTER, msg) case _ => println(" message lost")}

Page 24: Real time machine learning visualization with spark -- Hadoop Summit 2016

KMeans Spark Job SetupVal appCtxOpt : Option[ApplicationContext] = …val kMeans = new KMeansExt().setK(numClusters) .setEpsilon(epsilon) .setMaxIterations(maxIterations) .enableListener(enableVisualization) .addListener( new KMeansListener(...))

appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger)))

kMeans.run(vectors)

Page 25: Real time machine learning visualization with spark -- Hadoop Summit 2016

ML Task ObserverReceives command from User to update running Spark JobOnce receives UpdateTask Command from notify call, it preforms the necessary update operation

trait TaskObserver { def notify (task: UpdateTaskCmd)}

class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger ) extends TaskObserver { //implement notify }

Page 26: Real time machine learning visualization with spark -- Hadoop Summit 2016

Logistic Regression MLListenerclass LogisticRegression(…) extends MLListenerSupport { def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= {

// initialization code val (rawWeights, loss) = OWLQN.runOWLQN( …) generateLORModel(…) }

}

Page 27: Real time machine learning visualization with spark -- Hadoop Summit 2016

Logistic Regression MLListenerobject OWLQN extends Logging { def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector, Array[Double]) = {

val costFun=new CostFun(data, mlSupport, IterationState(), /*other args */)val states : Iterator[lbfgs.State] = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector ) …}

Page 28: Real time machine learning visualization with spark -- Hadoop Summit 2016

Logistic Regression MLListenerIn Cost function :

override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = {

val shouldStop = mlSupport.exists(_.stopIteration)

if (!shouldStop) { … mlSupport.filter(_.listenerEnabled()).map { s=> s.sendMessage( (iState.iteration, w, loss)) }

… } else { … }}

Page 29: Real time machine learning visualization with spark -- Hadoop Summit 2016

Task Communication Channel

Page 30: Real time machine learning visualization with spark -- Hadoop Summit 2016

Task Channel : Akka Messaging

Spark Application Application

Context

Actor System

MessagerActor

Task ChannelActor

SparkContext Spark tasks

Akka

Akka

Page 31: Real time machine learning visualization with spark -- Hadoop Summit 2016

Task Channel : Akka messaging

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

Page 32: Real time machine learning visualization with spark -- Hadoop Summit 2016

Push To The Browser

Page 33: Real time machine learning visualization with spark -- Hadoop Summit 2016

HTTP Chunked Response and SSE

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

Page 34: Real time machine learning visualization with spark -- Hadoop Summit 2016

HTML5 Server-Sent Events (SSE)Server-sent Events (SSE) is one-way messaging

An event is when a web page automatically get update from ServerRegister an event source (JavaScript) var source = new EventSource(url);The Callback onMessage(data)

source.onmessage = function(message){...}Data Format:

data: { \ndata: “key” : “value”, \n\ndata: } \n\n

Page 35: Real time machine learning visualization with spark -- Hadoop Summit 2016

HTTP Chunked ResponseSpray Rest Server supports Chunked Response

val responseStart = HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Start\n"))requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack)

val nextChunk = MessageChunk(s"data: $r \n\n")requestCtx.responder ! nextChunk.withAck(Messages.Ack)

requestCtx.responder ! MessageChunk(s"data: Finished \n\n")requestCtx.responder ! ChunkedMessageEnd

Page 36: Real time machine learning visualization with spark -- Hadoop Summit 2016

Push vs. PullPushPros

The data is streamed (pushed) to browser via chunked responseThere is no need for data queue, but the data can be lost if not consumedMultiple pages can be pushed at the same time, which allows multiple visualization views

ConsFor slow network, slow browser and fast data iterations, the data might all show-up in browser at once, rather showing a nice iteration-by-iteration displayIf you control the data chunked response by Network Acknowledgement, the visualization may not show-up at all as the data is not pushed due to slow network acknowledgement

Page 37: Real time machine learning visualization with spark -- Hadoop Summit 2016

Push vs. PullPullPros

Message does not get lost, since it can be temporarily stored in the message queueThe visualization will render in an even pace

Cons Need to periodically send server request for update,We will need a message queue before the message is consumedHard to support multiple pages rendering with simple message queue

Page 38: Real time machine learning visualization with spark -- Hadoop Summit 2016

Visualization: Plot.ly + D3

Cost vs. IterationCost vs. Iteration

ArrTime vs. DistanceArrTime vs. DepTime

Alpine Workflow

Page 39: Real time machine learning visualization with spark -- Hadoop Summit 2016

Use Plot.ly to render graph

function showCost(dataParsed) { var costTrace = { … }; var data = [ costTrace ]; var costLayout = { xaxis: {…}, yaxis: {…}, title: … }; Plotly.newPlot('cost', data, costLayout);}

Page 40: Real time machine learning visualization with spark -- Hadoop Summit 2016

Real Time ML Visualization: Summary• Training machine learning model involves a lot of

experimentation, we need a way to visualize the training process.

• We presented a system to enable real time machine learning visualization with Spark: – Gives visibility into the training of a model– Allows us monitor the convergence of the algorithms during

training– Can stop the iterations when convergence is good enough.

Page 41: Real time machine learning visualization with spark -- Hadoop Summit 2016

Thank YouChester Chen [email protected]

LinkedInhttps://www.linkedin.com/in/chester-chen-3205992

SlideSharehttp://www.slideshare.net/ChesterChen/presentations

demo videohttps://youtu.be/DkbYNYQhrao