Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Post on 16-Apr-2017

432 views 2 download

Transcript of Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

CONFIDENTIAL

Ken Stanford and Fonda IngramJuly 25, 2016

Open Source or Closed Source

Trending Now?

2015 SAS vs. R Survey Results – Burch Works

Trending Now?

• Buy vs Buildo Business are: “Engingeering/Technology/Innovation

companies”• Build yourself and potentially sell

• Companies are hesitant to make a LARGE upfront Capital Investment before they see proven value

Analytics POV Lifecycle

Discovery• Determine Business

Objective• Determine Modeling

Goal

Data Prep -Understanding• Data Collection• Data Exploration• Data Quality• Data Transformation

Model Building• Build Models• Model Assessment

Evaluation• Model Performance• Success criteria

Deployment• Monitoring and

maintenance• Model Management

Why Open Source?

W hy u s e O p e n S o u rc e ?

• Reduce vendor dependencyo Run the program for any purposeo Customize the program - use cutting edge analytics

NOW• Reduce cost

o Freedom does not imply FREE• Responsive and Competitive

o Innovate in Real Timeo Rebuild in-house expertise and regain control

Why use H20?

• Capital Investment upfront is minimalo Download H20 – use it and continue to learn, once you mature we can help

you• Algorithms and Accuracy

o Distributed implementation of cutting edge ML algorithms• Building components that touch all facets of the Analytics POV Lifecycle• Flexible API available in R, Python, Scala, REST/JSON• Community driven

Why customers hesitate?

• Difficult to convert all SAS software to open source and keep my sanity ..

• I have been a SAS programmer for years..• What I have is working – why change..• I need a throat to choke if something goes wrong..• I like long product install times..• No one gets fired for buying SAS..

What do I need to start?

• Migration Strategyo Analytical Tool? ( R or Python..)o Analytics Platform? (Hadoop, S3, etc.)

• Start small and get your feet wet (with H20)• May need to create a hybrid environment

How to get started?

• Existing Use Case o Review data requirements• Get your data into H2O

o Select existing model to migrate• Identify algorithms – start small• Transition should be gradual

Language Translation

SAS H20-Rdataset dataframeobservation rowvariable columnBY-Group By functionif else H20.ifelse. (missing value) na (missing value)

How to Import data?

Export SAS dataset to CSV file

proc export data=work.Wheaderdataset outfile='/folders/myfolders/wheader.csv' dbms=csvreplace;run;

Import to H20

library(h2o)h2o.init()h2odf = h2odf = h2o.importFile(path = "h2o/data/iris_wheader.csv")stopifnot(nrow(h2odf) == 150)

Munging – How to s l ice columns?Slicing Columns in a

SAS dataset/* Slice 1 column SepalLength - keep or drop */data iris2; set sashelp.iris; keep SepalLength;run;

/* Slice all columns except Species – keep or drop */ data iris2; set sashelp.iris; keep PetalLength PetalWidth SepalLength SepalWidth;run;

Slicing Columnsin a H20 dataframe

# slice 1 column by namec1_1 <- h2odf[, "sepal_len"]

# slice cols by vector of namescols_1 <- h2odf[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]

Munging – How to s l ice rows?Slicing Rows in a

SAS dataset/* Slicing obs 15 from a SAS dataset */data subset1; set sashelp.iris (firstobs=15 obs= 15);run;

/* Slicing a range of obs from a SAS dataset */data subset2; set sashelp.iris (firstobs=25 obs= 49);run;run;

Slicing Rows from a H20 dataframe

# slice 1 row by indexc1 <- h2odf[15,]

# slice a range of rowsc1_1 <- h2odf[25:49,]

Munging – How to s l ice rows?Slicing Rows in a

SAS dataset/* Slicing with a value */data subset3; set sashelp.iris; if SepalLength > 50;run;

/* Filter out missing values from a SAS dataset*/data subset3; set sashelp.iris; if SeptalLenght = . then delete;run;

Slicing Rows from a H20 dataframe

# slice with a boolean maskmask <- h2odf[,"sepal_len"] < 4.4cols <- h2odf[mask,]

# filter out missing valuesmask <- is.na(h2odf[,"sepal_len"])cols <- h2odf[!mask,]

Munging – How to replacing values?Replacing values in a

SAS dataset/* Replace a single numerical datum */ data iris ; obsnum = 15; modify iris point= obsnum; SepalWidth = 2; replace; stop;run;

/* Replace a whole column*/data iris ; modify iris; SepalWidth = SepalWidth * 3; replace;run;

Replacing values in a H20 dataframe

# replace a single numerical datumh2odf[15,3] <- 2

# replace a whole columnh2odf[,1] <- 3*h2odf[,1]

Munging – How to replacing values?Replacing values in a

SAS dataset/* replacement with if */data iris1 ;modify iris1;if SepalLenght < 4.4 then SeptalLenght = 22;replace;run;

/*Replace missing values with 0*/data iris1 ;modify iris1;if SepalLenght = . then SeptalLenght = 0;replace;run;

Replacing values in a H20 dataframe

# replacement with ifelseh2odf[,"sepal_len"] <- h2o.ifelse(h2odf[,"sepal_len"] < 4.4, 22, h2odf[,"sepal_len"])

# replace missing values with 0h2odf[is.na(h2odf[,"sepal_len"]), "sepal_len"] <- 0

Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes • Distributed Random Forest:

Classification or regression models• Gradient Boosting Machine: Produces

an ensemble of decision trees with increasing refined approximations

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis

Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means: Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

GLM proc glmproc regproc logisticproc genmod

glmnetlm

h2o.glm

PCA proc princomp princomp h2o.prcomp

Factor Analysis proc factor factanalfactor.pa

SVD proc hptmineproc hpsvm

svd h2o.svd

Clustering proc fastclusproc hpclus

kmeans h2o.kmeans

Random Forest proc hpforest (EM Node) randomforest h2o.randomForest

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Gradient Boosting proc arboretum gbm h2o.gbm

Neural Networks proc hpneural (EM Node)autoneural (EM node)proc neuralproc dmneural

nnet h2o.deeplearning

Ensemble (Stacking) proc ensemble (EM Node) h2o.ensemble, h2o.metalearn, h2o.stack (in dev)

GLRM (Cluster Analysis, Recommendation Engines)

NA NA h2o.glrm

Gradient Boosting proc arboretum gbm h2o.gbm

Kernel Density Estimation proc kde density

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Variable Clustering proc varclus varclus

ARIMA proc arima arima

Autoregressive Models proc autoreg ar

Correlation proc corr corr

Survival Models proc phreg coxph Not currently available -- h2o.coxph

Linear Mixed Effects Models

proc mixedglimmix

lme

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Summary proc summaryproc means

summarymeanmedianmax/minquantilevariance

Grouping/ Sort/ Rank proc sortproc rank

aggregateddplyorderdatatable

Exploratory Data Analysis proc univariateproc hpbin

moments (package)histecdfqqnormpnorm

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Plots gplotsgplot

ggplot2ggivsrglhtmlwidgets

Sampling proc surveyselect runifsample

h2o.runif

Questions