Sparkling Water Webinar October 29th, 2014
-
Upload
sri-ambati -
Category
Technology
-
view
7.196 -
download
2
Transcript of Sparkling Water Webinar October 29th, 2014
![Page 1: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/1.jpg)
OCT 29th, 2014 WEBINAR
H2O
Fast. Scalable. Machine LearningFor Smarter Applications
“Fluids are In, Animals are Out.”
~ Svetlana Sicular, Gartner
![Page 2: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/2.jpg)
SpeakersJoel Horwitz
Joel is a caffeine, data, and laughter driven product strategist. He is an active community member having founded Bay Area Analytics, Tweets regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks regularly at industry events. Always eager to learn and lend a helping hand makes him an invaluable asset to 0xdata.
Michal Malohlava
Michal is a geek, developer, Java, Linux, programming languages enthusiast developing software for over 10 years. He obtained PhD from the Charles University in Prague in 2012 and post-doc at Purdue University.
H2O World Register at http://www.0xdata.com/h2o-world
![Page 3: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/3.jpg)
Time is the only non-renewable resource.
Speed Matters!
![Page 4: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/4.jpg)
Today
• Why Are We Here
• Who We Are
• How Do We Do It
• Who We Work With
• What We Believe
• Demo and Q&A
![Page 5: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/5.jpg)
A New Interpretation of Moore’s Law
“Like the physical universe, the digital universe is large - by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe -
the data we create and copy annually - will reach 44 zettabytes, or 44 trillion gigabytes.”
- IDC 2014
![Page 6: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/6.jpg)
An Evolving Market to Meet the Demand
RDBMS MPP
BusinessIntelligence
Data Science
MachineLearning
H2O
DistributedFile System
![Page 7: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/7.jpg)
Decreasing Cost of Data is Driving Demand
$ / GB (-50% every 18 months)
Algorithm Sophistication
1970 1980 1990 2000 2010 2020
H2O
![Page 8: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/8.jpg)
H2O is the First DedicatedMachine Learning Open Source PlatformH2O is for application developers and analysts who
need scalable and fast machine learning. H2O is an
open source predictive analytics platform. Unlike
traditional analytics tools, H2O provides a combination
of extraordinary math, a high performance parallel
architecture, and unrivaled ease of use.
![Page 9: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/9.jpg)
Who Are We?
H2O
![Page 10: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/10.jpg)
H2O Awards and Accolades
• Top R Project of UserR Conference 2014
• Fortune Big Data All-Stars 2014, Arno Candel
• 100+ Meetups
• 6000+ Users
![Page 11: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/11.jpg)
H2O is Built for Speed and Scale
• OpenSource
• REST API
• Native R Support
• NanoFastTM Scoring Engine
• Sophisticated Algorithms
![Page 12: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/12.jpg)
H2O Seamlessly Integrates with Your Workflow
• 20X Faster Imports and 3X Compression w/ .hex format.
• 4 Billion Row Regression in Seconds.
• Deploy in POJO or with our REST API
![Page 13: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/13.jpg)
Code is incomplete without Community!
Open Source Drives Innovation.
![Page 14: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/14.jpg)
Law of Large Numbers Triumphs!
![Page 15: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/15.jpg)
Every Generation Needs to Invent Its Own Math.
ML is the new SQL!
![Page 16: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/16.jpg)
What do our customers say about us?"The platform can generate Jar files to deploy models into production. This alone is a milestone!" - Hassan Namarvar, ShareThis
“I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.” – Nachum Shacham, Paypal
“I analyzed 1 million rows training set, fitting a logistic regression with elastic penalty, and doing a grid search on parameters with 10-fold cross validation for each parameter combination… doing this analysis was a breeze… orders of magnitude faster than R.” - Antonio Molins, Netflix
“Never have we had such a quick, simple, scalable and cost effective deployment solution for predictive modeling.” – Lou Carvalheira, Cisco
![Page 17: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/17.jpg)
AdvertisingBetter Conversions
Conversion ReachBrand ROI
Overall, I would say that the H2O platform is the most elegant open source in-memory ML engine.
~ Hassan Namarvar, Principal Data Scientist
![Page 18: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/18.jpg)
FraudBetter Detections
Purchase
I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.
~ Nachum Shacham, Principal Data Scientist
Shopping Theft Passwords
![Page 19: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/19.jpg)
MarketingBetter Spend
ROI
H2O has established a new equilibrium point for performance,accuracy and cost for statistics and machine learning.
~ Lou Carvalheira, Principal Data Scientist
Network Segments Measure
![Page 20: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/20.jpg)
Select Customers Powered by H2O
![Page 21: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/21.jpg)
Sparkling Water“Killer App for Spark”
@hexadata & @mmalohlavapresents
![Page 22: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/22.jpg)
Memory efficient
Performance of computation
Machine learning algorithms
Parser, GUI, R-interface
User-friendly API
Large and active community
Platform components - SQL
Multitenancy
![Page 23: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/23.jpg)
Sparkling Water
+RDD
immutableworld
DataFramemutableworld
![Page 24: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/24.jpg)
Sparkling Water
RDD DataFrame
![Page 25: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/25.jpg)
Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and algorithms with Spark API
Excels in Spark workflows requiring advanced Machine Learning algorithms
![Page 26: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/26.jpg)
Sparkling Water Design
spark-submitSpark
MasterJVM
SparkWorker
JVM
SparkWorker
JVM
SparkWorker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Contains applicationand Sparkling Water
classes
Sparkling Appjar file
implements
![Page 27: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/27.jpg)
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMDataSource
(e.g. HDFS)
H2ORDD
Spark Executor JVM
Spark Executor JVM
SparkRDD
RDDs and DataFramesshare same memory
space
![Page 28: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/28.jpg)
Demo Time
![Page 29: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/29.jpg)
Flight delay prediction
“Build a model using weather and flight data to predict
delays of flights arriving to Chicago O’Hare International
Airport”
![Page 30: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/30.jpg)
Example Outline
Load & Parse CSV data from 2 data sources
Use Spark API to filter data, do SQL query for join
Create a regression model
Use model for delay prediction
Plot residual plot from R
![Page 31: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/31.jpg)
Sparkling Water Requirements
Linux or Mac OS X
Oracle Java 1.7
![Page 32: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/32.jpg)
Downloadhttp://0xdata.com/download/
![Page 33: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/33.jpg)
Install and Launch
Unpack zip file
and
Point SPARK_HOME to your Spark installation
and
Launch h2o-examples/sparkling-shell
![Page 34: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/34.jpg)
What is Sparkling Shell?
Standard spark-shell
With additional Sparkling Water classes
export MASTER=“local-cluster[3,2,1024]”
spark-shell \ —-jars sparkling-water.jar JAR containing
Sparkling Water
Spark Masteraddress
![Page 35: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/35.jpg)
Lets play with Sparkling shell…
![Page 36: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/36.jpg)
Create H2O Client
import org.apache.spark.h2o._import org.apache.spark.examples.h2o._
val h2oContext = new H2OContext(sc).start(3)import h2oContext._
Regular Spark contextprovided by Spark shell
Size of demandedH2O cloud
Contains implicit utility functions Demo specificclasses
![Page 37: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/37.jpg)
Is Spark Running?Go to http://localhost:4040
![Page 38: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/38.jpg)
Is H2O running?http://localhost:54321/steam/index.html
![Page 39: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/39.jpg)
Load Data #1Load weather data into RDD
val weatherDataFile = “examples/smalldata/weather.csv"
val wrawdata = sc.textFile(weatherDataFile,3).cache()
val weatherTable = wrawdata.map(_.split(“,")).map(row => WeatherParse(row)).filter(!_.isWrongRow())
Regular Spark API
Ad-hoc Parser
![Page 40: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/40.jpg)
Weather Data
case class Weather( val Year : Option[Int], val Month : Option[Int], val Day : Option[Int], val TmaxF : Option[Int], // Max temperatur in F val TminF : Option[Int], // Min temperatur in F val TmeanF : Option[Float], // Mean temperatur in F val PrcpIn : Option[Float], // Precipitation (inches) val SnowIn : Option[Float], // Snow (inches) val CDD : Option[Float], // Cooling Degree Day val HDD : Option[Float], // Heating Degree Day val GDD : Option[Float]) // Growing Degree Day
Simple POSO to hold one row of weather data
![Page 41: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/41.jpg)
Load Data #2Load flights data into H2O frame
import java.io.File
val dataFile = “examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))
![Page 42: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/42.jpg)
Where is the data?Go to http://localhost:54321/steam/index.html
![Page 43: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/43.jpg)
Use Spark API for Data Filtering
// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines]
= toRDD[Airlines](airlinesData)
// And use Spark RDD API directlyval flightsToORD = airlinesTable
.filter( f => f.Dest == Some(“ORD") )
Regular SparkRDD call
Create a cheap wrapperaround H2O DataFrame
![Page 44: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/44.jpg)
Use Spark SQL to Data Join
import org.apache.spark.sql.SQLContext// We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")weatherTable.registerTempTable("WeatherORD")
![Page 45: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/45.jpg)
Join Data based on Flight Date
val bigTable = sql( """SELECT | f.Year,f.Month,f.DayofMonth, | f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime, | f.UniqueCarrier,f.FlightNum,f.TailNum, | f.Origin,f.Distance, | w.TmaxF,w.TminF,w.TmeanF, | w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD, | f.ArrDelay | FROM FlightsToORD f | JOIN WeatherORD w | ON f.Year=w.Year AND f.Month=w.Month | AND f.DayofMonth=w.Day""".stripMargin)
![Page 46: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/46.jpg)
Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel
.DeepLearningParameters
// Setup deep learning parametersval dlParams = new DeepLearningParameters()dlParams._train = bigTabledlParams._response_column = 'ArrDelaydlParams._classification = false
// Create a new model builderval dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
Result of SQL query
Blocking call
![Page 47: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/47.jpg)
Make a prediction
// Use model to score dataval prediction = dlModel.score(result)(‘predict)
// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
![Page 48: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/48.jpg)
Generate Residuals Plot# Import H2O library and initialize H2O clientlibrary(h2o)
h = h2o.init()
# Fetch prediction and actual data, use remembered keyspred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columnspredDelay = pred$predictactDelay = act$ArrDelay
# Make sure that number of rows is same nrow(actDelay) == nrow(predDelay)
# Compute residuals residuals = predDelay - actDelay
# Plot residuals compare = cbind(
as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))
plot( compare[,1:2] )
Referencesof data
![Page 49: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/49.jpg)
More infoCheckout 0xdata Blog for Sparkling Water tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/sparkling-water
![Page 50: Sparkling Water Webinar October 29th, 2014](https://reader038.fdocuments.in/reader038/viewer/2022102817/55d578d7bb61ebb32f8b45fd/html5/thumbnails/50.jpg)
Learn more about H2O at0xdata.com
or
Thank you!
Follow us at @hexadata
neo> for r in sparkling-water; do git clone “[email protected]:0xdata/$r.git”done