Sri Ambati – CEO, 0xdata at MLconf ATL

41
H2O.ai Open Source Machine Learning for Intelligent Applications H 2 O.ai Machine Intelligence

description

"Comparing Variable Importance from Ensemble and Deep Learning Methods for AdTech Data" Variable Importance brings interpretability to popular black box modeling techniques. In this talk we study performance of popular ensemble techniques like Random Forest, Gradient Boosting with GLM. We observe certain traits that get magnified by non-linear techniques like Deep Learning that are otherwise missed by GBM or Random Forest. We describe Open Source Scalable Machine Learning package, H2O which through ease-of-use and speed makes comparisons and picking best-of-breed and ensembles more natural. H2O's implementation of these algorithms tracks popular open source and text book implementations closely.

Transcript of Sri Ambati – CEO, 0xdata at MLconf ATL

Page 1: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiOpen Source

Machine Learningfor Intelligent Applications

H2O.aiMachine Intelligence

Page 2: Sri Ambati – CEO, 0xdata at MLconf ATL

Time is the only non-renewable resource

Speed Matters!

H2O.aiMachine Intelligence

Page 3: Sri Ambati – CEO, 0xdata at MLconf ATL

Law of Large Numbers

Sampling

Page 4: Sri Ambati – CEO, 0xdata at MLconf ATL

Per Node2M Row ingest/sec

50M Row Regression/sec

750M Row Aggregates / sec

On PremiseOn / Off HadoopOn EC2

Tabl

eau

RJSON

Scal

aJa

va

H2O Prediction Engine

ensembles

Deep learningCl

uste

r

Nano Fast Scoring Engine

Memory Manager Columnar Compression

Query Processor R-engine

In-Mem Map ReduceDistributed fork/join

Pyth

on

HDFS S3 SQL NoSQL

Regr

essi

onCl

assi

fy

Tree

s

Boos

ting

Fore

sts

Solv

ers

Gra

dien

ts

SDK / API

Exce

l

H2O.aiMachine Intelligence

Page 5: Sri Ambati – CEO, 0xdata at MLconf ATL

Infrastructure

ParallelismData Parallel Chunking Express!Algorithm Parallel

Parallel Code blocksMath Parallelism

ADMM, HogWild

DistributionZero-Serialization –

endian wars have ended

Page 6: Sri Ambati – CEO, 0xdata at MLconf ATL

Scalable Machine LearningFor Smarter Applications

H2O.aiMachine Intelligence

H2O.ai

Page 7: Sri Ambati – CEO, 0xdata at MLconf ATL

Programmable Internet

H2O.aiMachine Intelligence

Page 8: Sri Ambati – CEO, 0xdata at MLconf ATL

Programmable Devices

H2O.aiMachine Intelligence

Page 9: Sri Ambati – CEO, 0xdata at MLconf ATL

AdSense Sense

H2O.aiMachine Intelligence

Page 10: Sri Ambati – CEO, 0xdata at MLconf ATL

Correlation Causality

H2O.aiMachine Intelligence

Page 11: Sri Ambati – CEO, 0xdata at MLconf ATL

Data

SensorsDevices

Semi-structured data. json. High velocity. High dimensions.

Events. Signals. TimeSeries

H2O.aiMachine Intelligence

Page 12: Sri Ambati – CEO, 0xdata at MLconf ATL

Streaming Data

Scoring from predictionAnomaly and Outliers DetectionUnsupervised Learning

Historical Data

H2O.aiMachine Intelligence

Page 13: Sri Ambati – CEO, 0xdata at MLconf ATL

Streaming Data

Anomaly and Outliers Detection

Historical Data

mod

el

Scoring from prediction

H2O.aiMachine Intelligence

Page 14: Sri Ambati – CEO, 0xdata at MLconf ATL

Streaming Data

Clustering / Unsupervise Learning

Historical Data

mod

el

Scoring from prediction

H2O.aiMachine Intelligence

Page 15: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiMachine Intelligence https://developer.nest.com/documentation/api-reference/devices

Page 16: Sri Ambati – CEO, 0xdata at MLconf ATL

Take Models to Production in Java

H2O.aiMachine Intelligence

Page 17: Sri Ambati – CEO, 0xdata at MLconf ATL

Onset of Rita

H2O.aiMachine Intelligence

Page 18: Sri Ambati – CEO, 0xdata at MLconf ATL

Common ensemble techniquesBayesian Classifiers

Ensembles of all hypotheses in hypothesis-space.

Bagging Each model votes with equal weight.

Bagging trains models on randomly drawn subset

Boosting Incrementally build an ensemble of each new model

H2O.aiMachine Intelligence

Page 19: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiMachine Intelligence

Page 20: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiMachine Intelligence

Page 21: Sri Ambati – CEO, 0xdata at MLconf ATL

Gradient Boosting Machine

H2O.aiMachine Intelligence

Page 22: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiMachine Intelligence

Page 23: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiMachine Intelligence

Page 24: Sri Ambati – CEO, 0xdata at MLconf ATL

Variable Importance Comparison

Random Forest, 50 trees

Gradient Boosting Machine, 50 trees

H2O.aiMachine Intelligence

Page 25: Sri Ambati – CEO, 0xdata at MLconf ATL

Generalized Linear Modeling – Variable Importance

GLM, Elastic Net (Binomial)Categorical expansion on Age

GLM, Elastic Net (Binomial)

H2O.aiMachine Intelligence

Page 26: Sri Ambati – CEO, 0xdata at MLconf ATL

Variable Importance Comparison

Deep Learning (Tanh / 4-layer)

Deep Learning (Tanh / 3-layer)

H2O.aiMachine Intelligence

Page 27: Sri Ambati – CEO, 0xdata at MLconf ATL

every generation needs to invent it’s math.

Our data, our tools!

H2O.aiMachine Intelligence

Page 28: Sri Ambati – CEO, 0xdata at MLconf ATL

Power-Law

Page 29: Sri Ambati – CEO, 0xdata at MLconf ATL

Code is incomplete without Community!

Open Source Matters!

H2O.aiMachine Intelligence

Page 30: Sri Ambati – CEO, 0xdata at MLconf ATL
Page 31: Sri Ambati – CEO, 0xdata at MLconf ATL

CommunityCommitters 30Meet ups 90

in 12 months

Coverage

Conference Speakers

CurriculumStanford, MIT, CSU, SUNY, SJSU, Purdue

Page 32: Sri Ambati – CEO, 0xdata at MLconf ATL

Data Driven Decision Making is hard!

Courage Matters!

H2O.aiMachine Intelligence

Page 33: Sri Ambati – CEO, 0xdata at MLconf ATL

ThanksCourtney, Nick & MLConf

for bringing us to ATL

Page 34: Sri Ambati – CEO, 0xdata at MLconf ATL

Sparkling Water Application Life Cycle

Sparkling App

jar file

SparkMaster

JVM

spark-submit

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

(1)

(2)

(3)

(1) User submits App to Spark cluster Master node(2) App distributed to Spark cluster Worker nodes(3) Spark Executor JVMs start for App(4) H2O instance starts within each Executor JVM(5) App’s Scala main program runs

Sparkling Water Cluster

Spark Executor JVM

H2O(4)

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 35: Sri Ambati – CEO, 0xdata at MLconf ATL

Sparkling Water Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source(e.g.

HDFS) (1)

(2)

(3)

(1) Use Spark SQL to read data into a Spark RDD

(2) Convert Spark RDD to H2O RDD; H2O RDD is column-based and highly compressed

(Not shown) Run modeling and prediction workflows with H2O

(3) Convert H2O RDD (e.g. predictions) back to Spark RDD

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

Page 36: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O

HHDFS

H2O

YARN

HHDFS

Hadoop MR

H2O

HHDFS

Standalone YARN H2O in MR

HortonWorks, Cloudera, MapR, Intel H2O.aiMachine Intelligence

Page 37: Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiMachine Intelligence

H2O – The Killer-App for Spark

Sparkling Water

HDFS=DATA

MLlib H2O SQLH2ORDD

In-Memory Big Data, ColumnarML 100x faster AlgosR CRAN, API, fast engineAPI Spark API, Java MMCommunity Devs, Data Science

Page 38: Sri Ambati – CEO, 0xdata at MLconf ATL

examples

H2O.aiMachine Intelligence

Page 39: Sri Ambati – CEO, 0xdata at MLconf ATL
Page 40: Sri Ambati – CEO, 0xdata at MLconf ATL

Fraud / No-fraud1/1000 unbalanced

Click-Stream Browse / Click / Buy

H2O.aiMachine Intelligence

Page 41: Sri Ambati – CEO, 0xdata at MLconf ATL

Propensity ModelsMerchants –to- Users

Lifetime Value of CustomerPricing Engines

H2O.aiMachine Intelligence