Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

44

Transcript of Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Cost Effectively Scaling Machine Learning Systems in the Cloud

Agenda:

● Background on me, Sailthru & Sightlines (mercifully short)

● Cost effective resources in the AWS cloud

● Efficient(ish) application design

● Easy maintenance and evolution

● Machine learning details

Online, Offline, Mobile, Email, Socialwww.sailthru.com

@jeremystanCapitalism

Idealism

IndirectValue

DirectValue

Graduate studentMath2000

ConsultantFinance2005 CTO

Ad Tech2010

Chief Data ScientistMar Tech2015

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Sailthru

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Sightlines

Analytics- Segmentation- Forecasting

Personalization- Recommendations- Discounting

Optimization- Frequency- Channel

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Requirements

1. ~5 million users per client

2. JSON formatted user data, siloed across clients

3. Predict varying outcomesnormal, poisson, binomial, quantile, ...

4. Update models & predictions daily

5. Only really care about predictive performance

6. Scale to 1,000+ clients

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Our Cost Effective Scaling Strategy

1. Get really cheap computing power

2. Make it work really, really hard

3. Optimize apps for ease of evolution

4. Setup identical A/B environments

Iterate aggressively based on data:✓ Features

✓ Efficiency

✓ Scale

10x

3x

0.6x =

0.5x

= 9x

JSON to Features

GBM in Memory

1 x0.2xHalf our

processingHalf our

processing

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Cost Effective Resources in the AWS Cloud

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Cost Effective r3.8xlarge32 vCPU, 244GB RAM

Resource Utilization

30%(typical cloud)

10%(data center)

90%(highly efficient)

CostPer

Hour

$2.80(on demand)

$1.76(reserved 1yr)

$1.05(reserved 3yr)

$0.28(spot instance)

Cloud$9.80

Data Center$10.50

Spot + Mesos + Relay$0.30

30x more cost efficient!

($10.50 = $1.05 / 10%)

Online, Offline, Mobile, Email, Socialwww.sailthru.com

AWS Spot Instances

Your bid

What you pay

All instances died!

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Mesos

81 “slaves” 4 availability zones2 instance types

1,360 CPUs10TB of RAM

94% utilized

$11.90 per hour$104,244 per year

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Mesos + Marathon

Zone 1 Zone 2 Zone 3 Zone 4

Mesos Slave(16 CPU)

Mesos Slave(8 CPU)

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Mesos + Marathon

Zone 1 Zone 2 Zone 3 Zone 4

Mesos Slave(16 CPU)

Mesos Slave(8 CPU)

Mesos Master

App AApp BApp C

Queue Size

Applications must be:

● Distributed to be scheduled wherever Mesos wants

● Fine Grained to maximize utilization in Mesos

● Idempotent to handle duplicate runs in case network

is partitioned

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Mesos + Marathon

Zone 1 Zone 2 Zone 3 Zone 4

Mesos Slave(16 CPU)

Mesos Slave(8 CPU)

Mesos Master

App AApp BApp C

Queue Size

Time

Available Mesos

CPUJiffies

Doesn’t work for apps with highly variable load

Idle

User

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Mesos + Relay

Available Mesos

CPUJiffies

User

Idle

Available Mesos

CPUJiffies

User

Idle

Relay.MesosAuto-scaler for distributed applicationsgithub.com/sailthru/relay.mesos

● Allocates resources based on queue size● Wraps applications inside Mesos slaves● Can significantly improve cluster utilization

Before Relay

AfterRelay

App AApp BApp C

Queue Size

Mesos Master

Time

After Relay

Relay.Mesos

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Efficient(ish) Application Design

Online, Offline, Mobile, Email, Socialwww.sailthru.com

StolosDistributed task dependency managergithub.com/sailthru/stolos

● Directed acyclic graph● Parameterizable templates● Handles queueing● Ensures idempotent

Application Pipeline (simplified)

Assembly GBMs

Analyze

Models

JSON

SailthruUserAPI

Predict Upload Mongo

Reports

Actually much more complex● ~1,000 clients● ~10 models● ~10 steps● ~100 sub-tasks

ETL

Mongo

Online, Offline, Mobile, Email, Socialwww.sailthru.com

shard 1

shard 1,000

Sampling Strategy

JSON

Day1

Mongo

S3JSON sharded on hash(user)

Online, Offline, Mobile, Email, Socialwww.sailthru.com

shard 1

shard 1,000

Sampling Strategy

JSON

DayN

Mongo

Day1

S3

Online, Offline, Mobile, Email, Socialwww.sailthru.com

DayN

Day1

shard 1

shard 1,000

Sampling Strategy

JSON

Consistent 0.1% of data to a Mesos Slave CPU

Mongo

S3

Online, Offline, Mobile, Email, Socialwww.sailthru.com

DayN

Day1

shard 1

shard 1,000

Sampling Strategy

JSON

Apps sample more as needed

Mongo

S3

Online, Offline, Mobile, Email, Socialwww.sailthru.com

User Profile JSON Data

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Each User Radically Different

User

Feature

???

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Each User Radically Different

User

Feature

tidyjsonTurn JSON into data framesgithub.com/sailthru/tidyjson

● Arbitrary JSON into R data.frames● Guarantees deterministic structure● Seamless with dplyr and %>%

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Why GBMs?

● Predict varying outcomes normal, poisson, binomial, quantile, …

● Flexible enough to capture non-linearity & complex interactionsno need to feature engineer for each client

● Minimal number of hyper-parametersdepth, shrinkage, number of trees

● Robust to missing valuesno need to impute

Online, Offline, Mobile, Email, Socialwww.sailthru.com

+ … + αK *

Distributing a GBM

α1 *

tree 1 tree 2 tree 3 tree K

+ α2 * + α3 *

Online, Offline, Mobile, Email, Socialwww.sailthru.com

+ … + αK *

Distributing a GBM

α1 *

tree 1 tree 2 tree 3 tree K

1. Across the sumGives bagging, not boosting (iterative) => less accurate

+ α2 * + α3 *

Zone 1 Zone 2 Zone 3 Zone 4

Mesos Slaves

Online, Offline, Mobile, Email, Socialwww.sailthru.com

+ … + αK *

Distributing a GBM

α1 *

tree 1 tree 2 tree 3 tree K

1. Across the sumGives bagging, not boosting (iterative) => less accurate

2. Within each tree (Spark MLLib, H20)A lot of overhead and coordination=> not efficient for many small GBMs

+ α2 * + α3 *

Zone 1 Zone 2 Zone 3 Zone 4

Mesos Slaves

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Distributing a GBM

1. Across the sumGives bagging, not boosting (iterative) => less accurate

2. Within each tree (Spark MLLib, H20)A lot of overhead and coordination=> not efficient for many small GBMs

3. Across the GBMs50,000 GBMs to build=> each can be built independently

Zone 1 Zone 2 Zone 3 Zone 4

Mesos Slaves

+ … + αK *α1 *

tree 1 tree 2 tree 3 tree K

+ α2 * + α3 * + … + αK *α1 *

tree 1 tree 2 tree 3 tree K

+ α2 * + α3 *…

GBM 1 GBM 50,00050,000 = 1,000 clients * 10 models * 5-fold CV

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Grid Search

+ … + αK *α1 *

tree 1 tree 2 tree 3 tree K

+ α2 * + α3 *

For each client & model:

1. Grid search over:

a. Depth: size of trees

b. Shrinkage: λ “learning rate” for {αi}

2. Cross-validate for optimal # of trees

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Easy Maintenance & Evolution

Online, Offline, Mobile, Email, Socialwww.sailthru.com

Tools Used

RModeling

PythonETL

AWS S3Batch

Applications

State

Frameworks

ZookeeperCoordination

SparkMap Reduce

MarathonRunning Apps

Cluster

MesosSharing

Maintenance

ELKLog Mgmt

ConsulDiscovery

Configuration

ChefAutomation

LibratoMonitoring

SensuAlerting

AsgardAuto Scaling

AWS SpotCompute

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

JSON

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

JSON

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

JSON

v1.0.0

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

JSON

v1.0.0

v1.0.1

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

JSON

v1.0.0

v1.0.1

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

JSON

v1.0.0

v1.0.1

v1.0.2

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

✓ Check monitoring

JSON

v1.0.0

v1.0.1

v1.0.2

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

✓ Check monitoring✓ Check logging

JSON

v1.0.0

v1.0.1

v1.0.2

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

✓ Check monitoring✓ Check logging✓ Check performance

JSON

v1.0.0

v1.0.1

v1.0.2

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

✓ Check monitoring✓ Check logging✓ Check performance

JSON

v1.0.0

v1.0.1

v1.0.2

Online, Offline, Mobile, Email, Socialwww.sailthru.com

How we Iterate A

B

SailthruUserAPI

Mongo

● Tools● Configuration● Applications

✓ Check monitoring✓ Check logging✓ Check performance

JSON

v1.0.0

v1.0.1

v1.0.2

Thank You! Our team:

Divyanshu Vats Alex Gaudio Andras Kerekes Jeremy Stanley