We l c ome ' t o ' H 2O 'Wo r l d
Sri'&'H2O'Team'
Data Science is a Team Sport!
Culture Matters!
Open Source Breeds Courage!
Community Matters!
Every generation needs to make its own history!
Code is conversation with Customer!
Great Product Matters!
Accuracy with Speed and Scale
HDFS%
S3%
SQL%%
NoSQL%
CLASSIFICATION%REGRESSION%
FEATURE%ENGINEERING%
IN4MEMORY%
MAP%REDUCE/FORK%JOIN%
COLUMNAR%COMPRESSION%
DEEP%LEARNING%
PCA,%GLM,%COX%
RANDOM%FOREST%/%GBM%ENSEMBLES%
FA S T %MODE L ING % ENG INE %
Streaming% NANO % FA ST % JAVA % S COR ING % ENG INES %
MATRIX%FACTORIZATION% CLUSTERING%
MUNGING%
What ’s New in H2O-‐3
H2O-‐3 vs H2O-‐2: • Total rewrite of the core in Java: built for data scientists AND developers! • Unique Flow GUI (Notebook and more) • REST Schemas for self-‐describing API for all methods/algos • New R client: cleaner, faster • Sparkling Water: H2O is the Killer App on Spark • Fully featured Python client (incl. Pipelines, scikit-‐learn look&feel) • New expression parser & backend execution engine for R, Py, Flow • New Algo: GLRM -‐ Generalized Low Rank Modeling(unifies PCA, K-‐Means, Matrix Factorization, Imputation, etc.)
• New Solvers for GLM: Coordinate Descent and L-‐BFGScontinued…
What ’s New in H2O-‐3
Additional New Features: • Grid Search for all Algorithms (R/Py/Flow) • N-‐fold Cross-‐Validation for all Algorithms • Early Stopping (check for convergence) for GBM/DRF/DL • Stochastic GBM (row/col sampling) • Distributions (Gaussian, Laplace, Poisson, Gamma, Tweedie) for GBM/DL • Improved sparse data handling for DL • Multi-‐node auto-‐tuning for DL • Multinomial GLM • Scalable Scatter Plots for numeric and categorical data • Big-‐Big Joins (“distributed data.table”) -‐ in QA
…and many more!
Convergence-‐Based Early Stopping in H2O
Before: trains too long, but at least overwrite_with_best_model=true prevents overfitting (returns the model with lowest validation error)
Now: specify additional convergence criterion: E.g. stopping_rounds=5, stopping_metric=“MSE”, stopping_tolerance=1e-‐3, to stop as soon as the moving average (length 5) of the validation MSE does not improve by at least 0.1% for 5 consecutive scoring events
validation error
training error
overwrite_with_best_model=true
training time / epochs
training time / epochsUse Flow to inspect the model
Early stopping saves tons of time
Best Model
Deep Learning with Higgs data
What do these st ickers mean?
I have H2O Installed
I have Python installed
I have R installed
I have the H2O World data sets
P i ck up s t i cke rs o r get i n s ta l l he lp a t the in fo rmat ion booth
Top Related