A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel...
Transcript of A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel...
![Page 1: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/1.jpg)
Presenter: Alexey TumanovDaniel Crankshaw
Corey Zumar, Guilio Zhou, Xin WangMichael Franklin, Joseph Gonzalez, Ion Stoica
BDD/RISE Mini-Retreat 2017May 2, 2017
A Low-Latency Online Prediction Serving System
Clipper
![Page 2: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/2.jpg)
BigData
Complex Model
Training
Learning
![Page 3: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/3.jpg)
BigData
Complex Model
Training
Learning
![Page 4: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/4.jpg)
Learning Produces a Trained Model
“CAT”
Query Decision
Model
![Page 5: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/5.jpg)
BigData
Training
Learning
Application
Decision
Query
?
Serving
Model
![Page 6: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/6.jpg)
BigData
Training
Learning
Application
Decision
Query
Model
Prediction-Serving for interactive applicationsTimescale: ~10s of milliseconds
Serving
![Page 7: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/7.jpg)
Google Translate
Serving
82,000 GPUs running 24/7
[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
140 billion words a day1Invented New Hardware!Tensor Processing Unit
(TPU)
![Page 8: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/8.jpg)
Prediction-Serving Raises New Challenges
![Page 9: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/9.jpg)
Prediction-Serving Challenges
Support low-latency, high-throughput serving workloads
???Create VWCaffe 9
Large and growing ecosystem of ML models and frameworks
![Page 10: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/10.jpg)
Models getting more complexØ 10s of GFLOPs [1]
Support low-latency, high-throughput serving workloads
Using specialized hardware for predictions
Deployed on critical pathØ Maintain SLOs under heavy load
[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
![Page 11: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/11.jpg)
Prediction-Serving Challenges
Support low-latency, high-throughput serving workloads
???Create VWCaffe 11
Large and growing ecosystem of ML models and frameworks
![Page 12: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/12.jpg)
Large and growing ecosystem of ML models and frameworks
???
ContentRec.
FraudDetection
PersonalAsst.
RoboticControl
MachineTranslation
Create VWCaffe 12
![Page 13: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/13.jpg)
Big Companies Build One-Off Systems
Problems:Ø Expensive to build and maintain
Ø Highly specialized and require ML and systems expertise
Ø Tightly-coupled model and applicationØ Difficult to change or update model
Ø Only supports single ML framework
![Page 14: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/14.jpg)
???
ContentRec.
FraudDetection
PersonalAsst.
RoboticControl
MachineTranslation
Create VWCaffe
Varying physicalresource requirements
Difficult to deploy andbrittle to manage
Large and growing ecosystem of ML models and frameworks
![Page 15: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/15.jpg)
But most companies can’t build new
serving systems…
![Page 16: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/16.jpg)
BigData
Batch Analytics
Model
Training
Use existing systems: Offline Scoring
![Page 17: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/17.jpg)
BigData
Model
Training
Batch Analytics
ScoringX Y
DatastoreUse existing systems: Offline Scoring
![Page 18: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/18.jpg)
Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y
Use existing systems: Offline Scoring
![Page 19: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/19.jpg)
Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y
Problems:Ø Requires full set of queries ahead of time
Ø Small and bounded input domainØ Wasted computation and space
Ø Can render and store unneeded predictionsØ Costly to update
Ø Re-run batch job
Use existing systems: Offline Scoring
![Page 20: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/20.jpg)
Prediction-Serving Challenges
Support low-latency, high-throughput serving workloads
???Create VWCaffe 20
Large and growing ecosystem of ML models and frameworks
![Page 21: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/21.jpg)
How does Clipper address these challenges?
![Page 22: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/22.jpg)
q Simplifies deployment through layered architecture
q Serves many models across ML frameworks concurrently
q Employs caching, batching, scale-out for high-performance serving
Clipper Solutions
![Page 23: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/23.jpg)
Clipper
Predict FeedbackRPC/REST Interface
CaffeMC MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications
Model Container (MC)
![Page 24: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/24.jpg)
Clipper Architecture
Clipper
Caffe
ApplicationsPredict ObserveRPC/REST Interface
MC MC MCRPC RPC RPC RPC
Model Abstraction LayerProvide a common interface to modelswhile bounding latency and maximizing throughput.
Model Selection LayerImprove accuracy through bandit methods and ensembles, online learning, and personalization
Model Container (MC)
![Page 25: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/25.jpg)
Clipper Architecture
Clipper
Caffe
ApplicationsPredict ObserveRPC/REST Interface
MC MC MCRPC RPC RPC RPC
Model Selection LayerSelection Policy
Model Abstraction LayerCaching
Adaptive Batching
Model Container (MC)
![Page 26: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/26.jpg)
ClipperModel Selection LayerSelection Policy
Selection policies supported by ClipperØExploit multiple models to estimate confidenceØ Use multi-armed bandit algorithms to learn
optimal model-selection onlineØ Online personalization across ML frameworks
*See paper for details [NSDI 2017]
![Page 27: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/27.jpg)
A single page load may generatemany queries
Batching to Improve ThroughputØ Optimal batch depends on:
Ø hardware configurationØ model and frameworkØ system load
Ø Why batching helps:
HardwareAcceleration
Helps amortizesystem overhead
![Page 28: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/28.jpg)
A single page load may generatemany queries
Clipper Solution:
Adaptively tradeoff latency and throughput…
Ø Inc. batch size until the latency objective is exceeded (Additive Increase)
Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)
Ø Why batching helps:
HardwareAcceleration
Helps amortizesystem overhead
Ø Optimal batch depends on:Ø hardware configurationØ model and frameworkØ system load
Batching to Improve ThroughputAdaptive
![Page 29: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion](https://reader034.fdocuments.in/reader034/viewer/2022042214/5eb989eab8b8c65a3c4a3276/html5/thumbnails/29.jpg)
ConclusionØ Prediction-serving is an important and challenging area for systems
researchØ Support low-latency, high-throughput serving workloadsØ Serve large and growing ecosystem of ML frameworks
Ø Clipper is a first step towards addressing these challengesØ Simplifies deployment through layered architectureØ Serves many models across ML frameworks concurrentlyØ Employs caching, adaptive batching, container scale-out to meet interactive
serving workload demandsØ Beyond academic prototype to build a real, open-source system
https://github.com/ucbrise/[email protected]