Real time machine learning visualization with spark -- Hadoop Summit 2016
-
Upload
chester-chen -
Category
Software
-
view
349 -
download
1
Transcript of Real time machine learning visualization with spark -- Hadoop Summit 2016
Real Time Machine Learning Visualization with SparkChester Chen, Ph.DSr. Manager, Data Science & EngineeringGoPro, Inc.
Hadoop Summit, San Jose 2016
Who am I ?• Sr. Manager of Data Science & Engineering at GoPro• Founder and Organizer of SF Big Analytics Meetup (4500+ members)• Previous Employment:
– Alpine Data, Tinga, Clearwell/Symantec, AltaVista, Ascent Media, ClearStory Systems, WebWare.
• Experience with Spark– Exposed to Spark since Spark 0.6– Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x
• Hadoop Distribution– CDH, HDP and MapR
Growing data needs
Lightning-fast cluster computing
Real Time ML Visualization with Spark
http://spark.apache.org/
Iris data set, K-Means clustering with K=3Cluster 2
Cluster 1
Cluster 0
Centroids
Sepal width vs Petal length
Iris data set, K-Means clustering with K=3
distance
What is K-Means ?Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk}The clusters are determined by minimizing the inter-cluster sum of squares (ICSS) (sum of distance functions of each point in the cluster to the K center). In other words, the objective is to find
where μi is the mean of points in Si.https://en.wikipedia.org/wiki/K-means_clustering
Visualization Cost
0 5 10 15 20 2534
34.5
35
35.5
36
36.5
37
37.5
38
38.5
Cost vs Iteration
Cost
Real Time ML VisualizationUse Cases
Use visualization to determine whether to end the training early
Need a way to visualize the training process including the convergence, clustering or residual plots, etc. Need a way to stop the training and save current modelNeed a way to disable or enable the visualization
Real Time ML Visualization with Spark
DEMO
How to Enable Real Time ML Visualization ?
A callback interface for Spark Machine Learning Algorithm to send messages
Algorithms decide when and what message to sendAlgorithms don’t care how the message is delivered
A task channel to handle the message delivery from Spark Driver to Spark Client
It doesn’t care about the content of the message or who sent the messageThe message is delivered from Spark Client to Browser
We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH) Pull is possible, but requires a message Queue
Visualization using JavaScript Frameworks Plot.ly and D3
Spark Job in Yarn-Cluster mode
Spark Client
Hadoop Cluster
Yarn-ContainerSpark Driver
Spark Job
Spark Context
Spark ML algorithm
Command Line
Rest API
Servlet
Application Host
Spark Job in Yarn-Cluster mode
Spark Client
Hadoop Cluster
Command Line
Rest API
Servlet
Application Host
Spark Job
App Context Spark ML Algorithms
ML Listener
Message Logger
Spark Client
Hadoop Cluster
Application Host
Spark Job
App Context Spark ML Algorithms
ML Listener
Message Logger
Spark Job in Yarn-Cluster mode
Web/Rest API
Server
Akka
Browser
Enable Real Time ML Visualization
SSE
PlotlyD3
Browser
Rest API
Server
Web Server
Spark Client
Hadoop Cluster
Spark Job
App Context
Message Logger
Task Channel
Spark ML Algorithms
ML Listener
AkkaChunked Response
Akka
Enable Real Time ML Visualization
SSE
PlotlyD3
Browser
Rest API
Server
Web Server
Spark Client
Hadoop Cluster
Spark Job
App Context
Message Logger
Task Channel
Spark ML Algorithms
ML Listener
AkkaChunked Response
Akka
Machine Learning Listeners
Callback Interface: ML Listener
trait MLListener { def onMessage(message: => Any)}
Callback Interface: MLListenerSupport
trait MLListenerSupport { // rest of codedef sendMessage(message: => Any): Unit = { if (enableListener) { listeners.foreach(l => l.onMessage(message)) }}
KMeansEx: KMeans with MLListener
class KMeansExt private (…) extends Serializable with Logging with MLListenerSupport { ... }
KMeansEx: KMeans with MLListenercase class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double )
private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = { ... while (!stopIteration && iteration < maxIterations && !activeRuns.isEmpty) {
...if (listenerEnabled()) {
sendMessage(KMeansCoreStats(…)) }...
}}
KMeans ML Listener class KMeansListener(columnNames: List[String], data : RDD[Vector], logger : MessageLogger) extends MLListener{ var sampleDataOpt : Option[Array[Vector]]= None
override def onMessage(message : => Any): Unit = { message match { case coreStats :KMeansCoreStats => if (sampleDataOpt.isEmpty)
sampleDataOpt = Some(data.takeSample(withReplacement = false, num=100)) //use the KMeans model of the current iteration to predict sample cluster indexes val kMeansModel = new KMeansModel(coreStats.centers) val cluster=sampleDataOpt.get.map(vector => (vector.toArray, kMeansModel.predict(vector))) val msg = KMeansStats(…) logger.sendBroadCastMessage(MLConstants.KMEANS_CENTER, msg) case _ => println(" message lost")}
KMeans Spark Job SetupVal appCtxOpt : Option[ApplicationContext] = …val kMeans = new KMeansExt().setK(numClusters) .setEpsilon(epsilon) .setMaxIterations(maxIterations) .enableListener(enableVisualization) .addListener( new KMeansListener(...))
appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger)))
kMeans.run(vectors)
ML Task ObserverReceives command from User to update running Spark JobOnce receives UpdateTask Command from notify call, it preforms the necessary update operation
trait TaskObserver { def notify (task: UpdateTaskCmd)}
class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger ) extends TaskObserver { //implement notify }
Logistic Regression MLListenerclass LogisticRegression(…) extends MLListenerSupport { def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= {
// initialization code val (rawWeights, loss) = OWLQN.runOWLQN( …) generateLORModel(…) }
}
Logistic Regression MLListenerobject OWLQN extends Logging { def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector, Array[Double]) = {
val costFun=new CostFun(data, mlSupport, IterationState(), /*other args */)val states : Iterator[lbfgs.State] = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector ) …}
Logistic Regression MLListenerIn Cost function :
override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = {
val shouldStop = mlSupport.exists(_.stopIteration)
if (!shouldStop) { … mlSupport.filter(_.listenerEnabled()).map { s=> s.sendMessage( (iState.iteration, w, loss)) }
… } else { … }}
Task Communication Channel
Task Channel : Akka Messaging
Spark Application Application
Context
Actor System
MessagerActor
Task ChannelActor
SparkContext Spark tasks
Akka
Akka
Task Channel : Akka messaging
SSE
PlotlyD3
Browser
Rest API
Server
Web Server
Spark Client
Hadoop Cluster
Spark Job
App Context
Message Logger
Task Channel
Spark ML Algorithms
ML Listener
AkkaChunked Response
Akka
Push To The Browser
HTTP Chunked Response and SSE
SSE
PlotlyD3
Browser
Rest API
Server
Web Server
Spark Client
Hadoop Cluster
Spark Job
App Context
Message Logger
Task Channel
Spark ML Algorithms
ML Listener
AkkaChunked Response
Akka
HTML5 Server-Sent Events (SSE)Server-sent Events (SSE) is one-way messaging
An event is when a web page automatically get update from ServerRegister an event source (JavaScript) var source = new EventSource(url);The Callback onMessage(data)
source.onmessage = function(message){...}Data Format:
data: { \ndata: “key” : “value”, \n\ndata: } \n\n
HTTP Chunked ResponseSpray Rest Server supports Chunked Response
val responseStart = HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Start\n"))requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack)
val nextChunk = MessageChunk(s"data: $r \n\n")requestCtx.responder ! nextChunk.withAck(Messages.Ack)
requestCtx.responder ! MessageChunk(s"data: Finished \n\n")requestCtx.responder ! ChunkedMessageEnd
Push vs. PullPushPros
The data is streamed (pushed) to browser via chunked responseThere is no need for data queue, but the data can be lost if not consumedMultiple pages can be pushed at the same time, which allows multiple visualization views
ConsFor slow network, slow browser and fast data iterations, the data might all show-up in browser at once, rather showing a nice iteration-by-iteration displayIf you control the data chunked response by Network Acknowledgement, the visualization may not show-up at all as the data is not pushed due to slow network acknowledgement
Push vs. PullPullPros
Message does not get lost, since it can be temporarily stored in the message queueThe visualization will render in an even pace
Cons Need to periodically send server request for update,We will need a message queue before the message is consumedHard to support multiple pages rendering with simple message queue
Visualization: Plot.ly + D3
Cost vs. IterationCost vs. Iteration
ArrTime vs. DistanceArrTime vs. DepTime
Alpine Workflow
Use Plot.ly to render graph
function showCost(dataParsed) { var costTrace = { … }; var data = [ costTrace ]; var costLayout = { xaxis: {…}, yaxis: {…}, title: … }; Plotly.newPlot('cost', data, costLayout);}
Real Time ML Visualization: Summary• Training machine learning model involves a lot of
experimentation, we need a way to visualize the training process.
• We presented a system to enable real time machine learning visualization with Spark: – Gives visibility into the training of a model– Allows us monitor the convergence of the algorithms during
training– Can stop the iterations when convergence is good enough.
Thank YouChester Chen [email protected]
LinkedInhttps://www.linkedin.com/in/chester-chen-3205992
SlideSharehttp://www.slideshare.net/ChesterChen/presentations
demo videohttps://youtu.be/DkbYNYQhrao