Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
-
Upload
sessionsevents -
Category
Technology
-
view
959 -
download
1
Transcript of Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Cost Effectively Scaling Machine Learning Systems in the Cloud
Agenda:
● Background on me, Sailthru & Sightlines (mercifully short)
● Cost effective resources in the AWS cloud
● Efficient(ish) application design
● Easy maintenance and evolution
● Machine learning details
Online, Offline, Mobile, Email, Socialwww.sailthru.com
@jeremystanCapitalism
Idealism
IndirectValue
DirectValue
Graduate studentMath2000
ConsultantFinance2005 CTO
Ad Tech2010
Chief Data ScientistMar Tech2015
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Sailthru
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Sightlines
Analytics- Segmentation- Forecasting
Personalization- Recommendations- Discounting
Optimization- Frequency- Channel
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Requirements
1. ~5 million users per client
2. JSON formatted user data, siloed across clients
3. Predict varying outcomesnormal, poisson, binomial, quantile, ...
4. Update models & predictions daily
5. Only really care about predictive performance
6. Scale to 1,000+ clients
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Our Cost Effective Scaling Strategy
1. Get really cheap computing power
2. Make it work really, really hard
3. Optimize apps for ease of evolution
4. Setup identical A/B environments
Iterate aggressively based on data:✓ Features
✓ Efficiency
✓ Scale
10x
3x
0.6x =
0.5x
= 9x
JSON to Features
GBM in Memory
1 x0.2xHalf our
processingHalf our
processing
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Cost Effective Resources in the AWS Cloud
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Cost Effective r3.8xlarge32 vCPU, 244GB RAM
Resource Utilization
30%(typical cloud)
10%(data center)
90%(highly efficient)
CostPer
Hour
$2.80(on demand)
$1.76(reserved 1yr)
$1.05(reserved 3yr)
$0.28(spot instance)
Cloud$9.80
Data Center$10.50
Spot + Mesos + Relay$0.30
30x more cost efficient!
($10.50 = $1.05 / 10%)
Online, Offline, Mobile, Email, Socialwww.sailthru.com
AWS Spot Instances
Your bid
What you pay
All instances died!
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Mesos
81 “slaves” 4 availability zones2 instance types
1,360 CPUs10TB of RAM
94% utilized
$11.90 per hour$104,244 per year
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos Slave(16 CPU)
Mesos Slave(8 CPU)
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos Slave(16 CPU)
Mesos Slave(8 CPU)
Mesos Master
App AApp BApp C
Queue Size
Applications must be:
● Distributed to be scheduled wherever Mesos wants
● Fine Grained to maximize utilization in Mesos
● Idempotent to handle duplicate runs in case network
is partitioned
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos Slave(16 CPU)
Mesos Slave(8 CPU)
Mesos Master
App AApp BApp C
Queue Size
Time
Available Mesos
CPUJiffies
Doesn’t work for apps with highly variable load
Idle
User
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Mesos + Relay
Available Mesos
CPUJiffies
User
Idle
Available Mesos
CPUJiffies
User
Idle
Relay.MesosAuto-scaler for distributed applicationsgithub.com/sailthru/relay.mesos
● Allocates resources based on queue size● Wraps applications inside Mesos slaves● Can significantly improve cluster utilization
Before Relay
AfterRelay
App AApp BApp C
Queue Size
Mesos Master
Time
After Relay
Relay.Mesos
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Efficient(ish) Application Design
Online, Offline, Mobile, Email, Socialwww.sailthru.com
StolosDistributed task dependency managergithub.com/sailthru/stolos
● Directed acyclic graph● Parameterizable templates● Handles queueing● Ensures idempotent
Application Pipeline (simplified)
Assembly GBMs
Analyze
Models
JSON
SailthruUserAPI
Predict Upload Mongo
Reports
Actually much more complex● ~1,000 clients● ~10 models● ~10 steps● ~100 sub-tasks
ETL
Mongo
Online, Offline, Mobile, Email, Socialwww.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day1
Mongo
S3JSON sharded on hash(user)
Online, Offline, Mobile, Email, Socialwww.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
DayN
Mongo
Day1
S3
Online, Offline, Mobile, Email, Socialwww.sailthru.com
DayN
Day1
shard 1
shard 1,000
Sampling Strategy
JSON
Consistent 0.1% of data to a Mesos Slave CPU
Mongo
S3
Online, Offline, Mobile, Email, Socialwww.sailthru.com
DayN
Day1
shard 1
shard 1,000
Sampling Strategy
JSON
Apps sample more as needed
Mongo
S3
Online, Offline, Mobile, Email, Socialwww.sailthru.com
User Profile JSON Data
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Each User Radically Different
User
Feature
???
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Each User Radically Different
User
Feature
tidyjsonTurn JSON into data framesgithub.com/sailthru/tidyjson
● Arbitrary JSON into R data.frames● Guarantees deterministic structure● Seamless with dplyr and %>%
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Why GBMs?
● Predict varying outcomes normal, poisson, binomial, quantile, …
● Flexible enough to capture non-linearity & complex interactionsno need to feature engineer for each client
● Minimal number of hyper-parametersdepth, shrinkage, number of trees
● Robust to missing valuesno need to impute
Online, Offline, Mobile, Email, Socialwww.sailthru.com
+ … + αK *
Distributing a GBM
α1 *
tree 1 tree 2 tree 3 tree K
+ α2 * + α3 *
Online, Offline, Mobile, Email, Socialwww.sailthru.com
+ … + αK *
Distributing a GBM
α1 *
tree 1 tree 2 tree 3 tree K
1. Across the sumGives bagging, not boosting (iterative) => less accurate
+ α2 * + α3 *
Zone 1 Zone 2 Zone 3 Zone 4
Mesos Slaves
Online, Offline, Mobile, Email, Socialwww.sailthru.com
+ … + αK *
Distributing a GBM
α1 *
tree 1 tree 2 tree 3 tree K
1. Across the sumGives bagging, not boosting (iterative) => less accurate
2. Within each tree (Spark MLLib, H20)A lot of overhead and coordination=> not efficient for many small GBMs
+ α2 * + α3 *
Zone 1 Zone 2 Zone 3 Zone 4
Mesos Slaves
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Distributing a GBM
1. Across the sumGives bagging, not boosting (iterative) => less accurate
2. Within each tree (Spark MLLib, H20)A lot of overhead and coordination=> not efficient for many small GBMs
3. Across the GBMs50,000 GBMs to build=> each can be built independently
Zone 1 Zone 2 Zone 3 Zone 4
Mesos Slaves
+ … + αK *α1 *
tree 1 tree 2 tree 3 tree K
+ α2 * + α3 * + … + αK *α1 *
tree 1 tree 2 tree 3 tree K
+ α2 * + α3 *…
GBM 1 GBM 50,00050,000 = 1,000 clients * 10 models * 5-fold CV
✓
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Grid Search
+ … + αK *α1 *
tree 1 tree 2 tree 3 tree K
+ α2 * + α3 *
For each client & model:
1. Grid search over:
a. Depth: size of trees
b. Shrinkage: λ “learning rate” for {αi}
2. Cross-validate for optimal # of trees
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Easy Maintenance & Evolution
Online, Offline, Mobile, Email, Socialwww.sailthru.com
Tools Used
RModeling
PythonETL
AWS S3Batch
Applications
State
Frameworks
ZookeeperCoordination
SparkMap Reduce
MarathonRunning Apps
Cluster
MesosSharing
Maintenance
ELKLog Mgmt
ConsulDiscovery
Configuration
ChefAutomation
LibratoMonitoring
SensuAlerting
AsgardAuto Scaling
AWS SpotCompute
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
JSON
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
JSON
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
JSON
v1.0.0
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
JSON
v1.0.0
v1.0.1
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
JSON
v1.0.0
v1.0.1
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
✓ Check monitoring
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
✓ Check monitoring✓ Check logging
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
✓ Check monitoring✓ Check logging✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
✓ Check monitoring✓ Check logging✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Socialwww.sailthru.com
How we Iterate A
B
SailthruUserAPI
Mongo
● Tools● Configuration● Applications
✓ Check monitoring✓ Check logging✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2