Spark + H20 = Machine Learning at scale
-
Upload
mateusz-dymczyk -
Category
Data & Analytics
-
view
118 -
download
4
Transcript of Spark + H20 = Machine Learning at scale
![Page 1: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/1.jpg)
Spark + H2O = Machine Learning at scale
Mateusz Dymczyk Software Engineer
Machine Learning with Spark Tokyo 30.06.2016
![Page 2: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/2.jpg)
Agenda
• Spark introduction • H2O introduction • Spark + H2O = Sparkling Water • Demos
![Page 3: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/3.jpg)
Spark
![Page 4: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/4.jpg)
What is Spark?
• Fast and general engine for large-scale data processing. • API in Java, Scala, Python and R • Batch and streaming APIs • Based on immutable data structure
*http://spark.apache.org/
![Page 5: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/5.jpg)
Architecture
*http://spark.apache.org/docs/latest/cluster-overview.html
![Page 6: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/6.jpg)
Why Spark?
• In-memory computation (fast) • Ability to cache (intermediate) results in memory (or on
disk) • Easy API • Plenty of out-of-the box libraries
*http://spark.apache.org/docs/latest/mllib-guide.html
![Page 7: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/7.jpg)
MLlib
• Spark’s machine learning library • Supports: • basic statistics • classification and regression • clustering • dimensionality reduction • evaluations • … *http://spark.apache.org/docs/latest/mllib-guide.html
![Page 8: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/8.jpg)
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
![Page 9: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/9.jpg)
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
![Page 10: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/10.jpg)
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
![Page 11: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/11.jpg)
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
![Page 12: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/12.jpg)
But…
• Are the implementations fast enough? • Are the implementations accurate enough? • What about other algorithms (i.e. where’s my
DeepLearning!)? • What about visualisations?
*http://spark.apache.org/docs/latest/mllib-guide.html
![Page 13: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/13.jpg)
H2O
![Page 14: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/14.jpg)
Math platform
What is H2O?
• Open source • Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
![Page 15: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/15.jpg)
• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API
Math platform
API
What is H2O?
• Open source • Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
![Page 16: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/16.jpg)
• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API
• Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Based on mutable data structures
Math platform
API
Big data focused
What is H2O?
• Open source • Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
![Page 17: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/17.jpg)
![Page 18: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/18.jpg)
![Page 19: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/19.jpg)
![Page 20: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/20.jpg)
FlowUI
• Notebook style open source interface for H2O
• Allows you to combine code execution, text, mathematics, plots, and rich media in a single document
![Page 21: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/21.jpg)
Why H2O?
• Speed and accuracy
• Algorithms/functionality not present in MLlib
• Access to FlowUI
• Possibility to generate dependency free (Java) models
• Option to checkpoint models (though not all) and continue learning in the future
![Page 22: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/22.jpg)
Sparkl ing Water
![Page 23: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/23.jpg)
What is Sparkl ing Water?
• Framework integrating Spark and H2O • Transparent use of H2O data structures and algorithms
with Spark API and vice versa
![Page 24: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/24.jpg)
![Page 25: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/25.jpg)
![Page 26: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/26.jpg)
![Page 27: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/27.jpg)
Common use-cases
![Page 28: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/28.jpg)
Modeling
ETL
Data Source
Modell ing Predict ions
Deep learning, GBM, DRF, GLM, PCA, Ensembles
etc.
![Page 29: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/29.jpg)
ETL
ETL
Data Source
Modell ing Predict ions
![Page 30: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/30.jpg)
Stream Processing
ETL
Data Source
Modell ing
Predict ions
Data Stream
Spark Streaming/ Storm/Flink etc.
![Page 31: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/31.jpg)
Demo #1 Sparkl ing Shell
![Page 32: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/32.jpg)
REQUIREMENTS • Windows/Linux/MacOS • Java 1.7+ • Spark 1.3+ • SPARK_HOME set
INSTALLATION 1. http://www.h2o.ai/download 2. set MASTER env 3. unzip 4. run bin/sparkling-shell
![Page 33: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/33.jpg)
DEV FLOW 1. create a script file containing application code
2. run with bin/sparkling-shell -i script_name.script.scala OR
1. run bin/sparkling-shell and simply use the REPL
import org.apache.spark.h2o._
// sc - SparkContext already provided by the shell
val h2oContext = new H2OContext(sc).start() import h2oContext._
// Application logic
![Page 34: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/34.jpg)
Air l ine delay classif ication
Model predicting flight
delays
ETL Modell ing Predict ions
• load data from CSVs
• use Spark APIs to filter and join data
Model using H2O’s GBM
*https://github.com/h2oai/sparkling-water/tree/master/examples/scripts
![Page 35: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/35.jpg)
Gradient Boosting Machines
• Classification and regression predictive modelling • Ensemble of multiple weak models (usually decision trees) • Iteratively solves residuals (gradient boosted) • Stochastic
![Page 36: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/36.jpg)
Demo #2 FlowUI
![Page 37: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/37.jpg)
Demo #3 Standalone app
![Page 38: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/38.jpg)
REQUIREMENTS
• git • editor of choice (IntelliJ/eclipse support)
![Page 39: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/39.jpg)
BOOTSTRAP
1. git clone https://github.com/h2oai/h2o-droplets.git 2. cd h2o-droplets/sparkling-water-droplet 3. if using IntelliJ or Eclipse: – ./gradlew idea – ./gradlew eclipse – import project in the IDE
4. develop your app
![Page 40: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/40.jpg)
DEPLOYMENT
1. build ./gradlew build shadowJar 2. submit with:
$SPARK_HOME/bin/spark-submit \ --class water.droplets.SWTokyoDemo \ --master local[*] \ --packages ai.h2o:sparkling-water-core_2.10:1.6.5 \ build/libs/sparkling-water-droplet-app.jar
![Page 41: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/41.jpg)
Open Source
• Github:
https://github.com/h2oai/sparkling-water
• JIRA:
http://jira.h2o.ai
• Google groups:
https://groups.google.com/forum/?hl=en#!forum/h2ostream
![Page 42: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/42.jpg)
More Info
• Documentation and booklets:
http://www.h2o.ai/docs/
• H2O.ai blog:
http://h2o.ai/blog
• H2O.ai YouTube channel:
https://www.youtube.com/user/0xdata
@h2oai
http://www.h2o.ai
![Page 44: Spark + H20 = Machine Learning at scale](https://reader034.fdocuments.in/reader034/viewer/2022052514/586f78591a28ab10258b6b65/html5/thumbnails/44.jpg)
Q&A