Sri Ambati – CEO, 0xdata at MLconf ATL

Post on 18-Nov-2014

1.074 views 0 download

description

"Comparing Variable Importance from Ensemble and Deep Learning Methods for AdTech Data" Variable Importance brings interpretability to popular black box modeling techniques. In this talk we study performance of popular ensemble techniques like Random Forest, Gradient Boosting with GLM. We observe certain traits that get magnified by non-linear techniques like Deep Learning that are otherwise missed by GBM or Random Forest. We describe Open Source Scalable Machine Learning package, H2O which through ease-of-use and speed makes comparisons and picking best-of-breed and ensembles more natural. H2O's implementation of these algorithms tracks popular open source and text book implementations closely.

Transcript of Sri Ambati – CEO, 0xdata at MLconf ATL

H2O.aiOpen Source

Machine Learningfor Intelligent Applications

H2O.aiMachine Intelligence

Time is the only non-renewable resource

Speed Matters!

H2O.aiMachine Intelligence

Law of Large Numbers

Sampling

Per Node2M Row ingest/sec

50M Row Regression/sec

750M Row Aggregates / sec

On PremiseOn / Off HadoopOn EC2

Tabl

eau

RJSON

Scal

aJa

va

H2O Prediction Engine

ensembles

Deep learningCl

uste

r

Nano Fast Scoring Engine

Memory Manager Columnar Compression

Query Processor R-engine

In-Mem Map ReduceDistributed fork/join

Pyth

on

HDFS S3 SQL NoSQL

Regr

essi

onCl

assi

fy

Tree

s

Boos

ting

Fore

sts

Solv

ers

Gra

dien

ts

SDK / API

Exce

l

H2O.aiMachine Intelligence

Infrastructure

ParallelismData Parallel Chunking Express!Algorithm Parallel

Parallel Code blocksMath Parallelism

ADMM, HogWild

DistributionZero-Serialization –

endian wars have ended

Scalable Machine LearningFor Smarter Applications

H2O.aiMachine Intelligence

H2O.ai

Programmable Internet

H2O.aiMachine Intelligence

Programmable Devices

H2O.aiMachine Intelligence

AdSense Sense

H2O.aiMachine Intelligence

Correlation Causality

H2O.aiMachine Intelligence

Data

SensorsDevices

Semi-structured data. json. High velocity. High dimensions.

Events. Signals. TimeSeries

H2O.aiMachine Intelligence

Streaming Data

Scoring from predictionAnomaly and Outliers DetectionUnsupervised Learning

Historical Data

H2O.aiMachine Intelligence

Streaming Data

Anomaly and Outliers Detection

Historical Data

mod

el

Scoring from prediction

H2O.aiMachine Intelligence

Streaming Data

Clustering / Unsupervise Learning

Historical Data

mod

el

Scoring from prediction

H2O.aiMachine Intelligence

H2O.aiMachine Intelligence https://developer.nest.com/documentation/api-reference/devices

Take Models to Production in Java

H2O.aiMachine Intelligence

Onset of Rita

H2O.aiMachine Intelligence

Common ensemble techniquesBayesian Classifiers

Ensembles of all hypotheses in hypothesis-space.

Bagging Each model votes with equal weight.

Bagging trains models on randomly drawn subset

Boosting Incrementally build an ensemble of each new model

H2O.aiMachine Intelligence

H2O.aiMachine Intelligence

H2O.aiMachine Intelligence

Gradient Boosting Machine

H2O.aiMachine Intelligence

H2O.aiMachine Intelligence

H2O.aiMachine Intelligence

Variable Importance Comparison

Random Forest, 50 trees

Gradient Boosting Machine, 50 trees

H2O.aiMachine Intelligence

Generalized Linear Modeling – Variable Importance

GLM, Elastic Net (Binomial)Categorical expansion on Age

GLM, Elastic Net (Binomial)

H2O.aiMachine Intelligence

Variable Importance Comparison

Deep Learning (Tanh / 4-layer)

Deep Learning (Tanh / 3-layer)

H2O.aiMachine Intelligence

every generation needs to invent it’s math.

Our data, our tools!

H2O.aiMachine Intelligence

Power-Law

Code is incomplete without Community!

Open Source Matters!

H2O.aiMachine Intelligence

CommunityCommitters 30Meet ups 90

in 12 months

Coverage

Conference Speakers

CurriculumStanford, MIT, CSU, SUNY, SJSU, Purdue

Data Driven Decision Making is hard!

Courage Matters!

H2O.aiMachine Intelligence

ThanksCourtney, Nick & MLConf

for bringing us to ATL

Sparkling Water Application Life Cycle

Sparkling App

jar file

SparkMaster

JVM

spark-submit

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

(1)

(2)

(3)

(1) User submits App to Spark cluster Master node(2) App distributed to Spark cluster Worker nodes(3) Spark Executor JVMs start for App(4) H2O instance starts within each Executor JVM(5) App’s Scala main program runs

Sparkling Water Cluster

Spark Executor JVM

H2O(4)

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Sparkling Water Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source(e.g.

HDFS) (1)

(2)

(3)

(1) Use Spark SQL to read data into a Spark RDD

(2) Convert Spark RDD to H2O RDD; H2O RDD is column-based and highly compressed

(Not shown) Run modeling and prediction workflows with H2O

(3) Convert H2O RDD (e.g. predictions) back to Spark RDD

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

H2O

HHDFS

H2O

YARN

HHDFS

Hadoop MR

H2O

HHDFS

Standalone YARN H2O in MR

HortonWorks, Cloudera, MapR, Intel H2O.aiMachine Intelligence

H2O.aiMachine Intelligence

H2O – The Killer-App for Spark

Sparkling Water

HDFS=DATA

MLlib H2O SQLH2ORDD

In-Memory Big Data, ColumnarML 100x faster AlgosR CRAN, API, fast engineAPI Spark API, Java MMCommunity Devs, Data Science

examples

H2O.aiMachine Intelligence

Fraud / No-fraud1/1000 unbalanced

Click-Stream Browse / Click / Buy

H2O.aiMachine Intelligence

Propensity ModelsMerchants –to- Users

Lifetime Value of CustomerPricing Engines

H2O.aiMachine Intelligence