From H2O to Steam - Dr. Bingwei Liu, Sr. Data Engineer, Aetna

28

Transcript of From H2O to Steam - Dr. Bingwei Liu, Sr. Data Engineer, Aetna

Sr. Data [email protected]

Bingwei Liu

Proprietary and Confidential

Engineering for H2O at Aetna

From H2O to Steam

Proprietary and Confidential

4©2017 Aetna Inc.

Agenda

H2O at Aetna

Enterprise Steam

Packrat

Demo

Q&A

History

A small group of employees

went to H2O World 2015

2015

Enterprise support. Early

adoption in various projects. A

couple production models.

2016

Enterprise Steam in production.

Regular training and webinar

series. A boost of usage.

2017

The journey continues….

2018

We learn and grow together with H2O team

• In person

• In-depth

• Frequent

• On demand

• Add new Features to H2O and Steam

• Secure impersonation in KerberizedCluster

Contribution Improvement

TrainingWebinar

Analytics Pipeline

Credit: Aleksandar Lazarevic

Engineering Pipeline

ETL Dataset Modeling ExportModel Produc7on

Hive

Spark

Pig

HiveTable H2OCluster

YARN

RStudio

JavaApp

HiveUDF

RESTAPI

Streaming

JavaPOJO

Use Cases

ProblemsER revisit Overpayment Customer Experience

Algorithms

Dataset Size

GBM RF

> 50x

Agenda

H2O at Aetna

Enterprise Steam

Packrat

Demo

Q&A

To Use H2O on YARN

RStudio

Connect to H2O Cluster

Browser

Access Flow URL

Linux CLI

Download H2O Create H2O cluster

Danger

!

X

A Simple Fix

• Create a username and password for flow

• Create a properties file with username and MD5 of password

• Use the properties file in hadoop jar command

Enterprise Steam

• YARN queue

• Resource limitations

• Easy to use Web UI

• Don’t need UI if you don’t want to

• Integrate with Active Directory Service

• Secure impersonation in Hadoop

SecurityUser

Identities

Customized Profile

User Experience

Enterprise Steam Web UI

User Authentication with AD

Launching a Cluster

Limit the size of H2O Cluster

How much data each node can fit?

YARN Queue integration

Support multiple version of H2O

Secured H2O Flow

• Steam uses proxy to secure Flow for each cluster

• https://steam.server.com:9999/username_clustername/flow/index.html

• Only the user who created the cluster will be allowed to open flow

Click

Connecting to an H2O Cluster

• After an H2O Cluster is created from Steam UI

Danger

!

Encrypt User Password

• Use digest package to encrypt/decrypt passwords

• Encrypted password saved in a fix location under the user’s home folder

• Load the encrypted password and decrypt

User Experience

RStudio

Connect to H2O Cluster

Browser

Access Flow URL

Browser

Create H2O cluster

RStudio

Connect to H2O Cluster

Browser

Access Flow URL

Linux CLI

Download H2O

Create H2O cluster

RStudioConnect to H2O Cluster

BrowserAccess Flow URL

RStudioCreate H2O Cluster

RStudioLogin to steam

No Browser Needed!

Agenda

H2O at Aetna

Enterprise Steam

Packrat

Demo

Q&A

About Packrat

• Dependency management system for R

R Project

Isolated

PortableReproducible

Use Packrat

• Create a RStudio with Packrat enabled

• Install h2o and h2osteam packages, likely installed from source

• Take a snapshot using packrat

• Put any R scripts needed for the users

• Bundle the R project

• Users:

oRun one shell script to unpack the project

oNo more manual installation of packages

Use Packrat

• But…

o What if you need a specific version of h2o?

• Workaround:

o Create a local repository under the project folder

o Modify packrat.lock file for destination location

Agenda

H2O at Aetna

Enterprise Steam

Packrat

Demo

Q&A

DEMOEnterprise Steam

R Project

Agenda

H2O at Aetna

Enterprise Steam

Packrat

Demo

Q&A

Questions

Proprietary and Confidential