Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging...
Transcript of Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging...
![Page 1: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/1.jpg)
May 20, 2019
Machine Learning Models as a Service
Tobias WenzelVigith Maurice
![Page 2: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/2.jpg)
![Page 3: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/3.jpg)
“We spend more time bringing the model to production than developing and training it”
Data Science Operations is not easy
![Page 4: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/4.jpg)
“We spend more time bringing the model to production than developing and training it”
Data Science Operations is not easy
![Page 5: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/5.jpg)
Let's build a platform
Centralized or Decentralized?
![Page 6: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/6.jpg)
Let's build a platform
Centralized!
![Page 7: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/7.jpg)
Let's build a centralized platform
The platform controls the environment in which models are deployed
![Page 8: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/8.jpg)
Let's build a centralized platform
The platform can implement actions of the machine learning model life cycle centrally
![Page 9: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/9.jpg)
Let's build a centralized platform
Create self service APIs for every interaction with the platform
![Page 10: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/10.jpg)
Security for an ML Platform
https://news.ycombinator.com/item?id=15256121
![Page 11: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/11.jpg)
Security for an ML Platform
https://medium.com/@bertusk/detecting-cyber-attacks-in-the-python-package-index-pypi-61ab2b585c67
![Page 12: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/12.jpg)
Let's build a centralized platform
Runtime security can be taken care of by the platform
![Page 13: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/13.jpg)
Let's build a centralized platform
Updates can be rolled out to every running model by updating the platform
![Page 14: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/14.jpg)
Machine Learning Platform
Model Training Model Execution
Secure Data Access (Intuit Data Lake)
ContinuousTraining(Argo)
Access Control(Authn, Authz)
LoggingTracing(Splunk)
Monitoring(Wavefront)
ML Compute (via Sagemaker)(GPUs etc.)
Notebooks (via Sagemaker)(Data Exploration, Visualization)
DataAggregation(Pre-fetch, Cache, Optimize)
Auto Scaling(Cost Control)
Billing(Chargeback)
Prediction Quality Feedback(Beacons)
Self Service(Model Lifecycle Management)
Storage(Training Data, Model Artifacts)
User Interfaces (Web, API, CLI)
Intuit’s ML Platform
![Page 15: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/15.jpg)
And Now for Something Completely Different
![Page 16: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/16.jpg)
ObservabilityBeaconing / Monitoring / Logging
Let's build a centralized platform
![Page 17: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/17.jpg)
IntrospectionInspect prediction response (Monitoring Contd.)
Let's build a centralized platform
![Page 18: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/18.jpg)
Meaning of life = 43I expected 42!
Introspection
![Page 19: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/19.jpg)
BeaconingRe-training and Monitoring
Introspection
![Page 20: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/20.jpg)
What is a beacon?
{"modelName": "meaning-of-life","modelVersion": "1","environment": "PROD","requestBody": "What is the meaning of life?","responseBody": "42"
… Metadata …}
![Page 21: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/21.jpg)
Realtime Beacons for Predict Request and Responses
KafkaData Lake
Hive, MR, Spark, Flink
Kafka Consumer
Realtim
e
Micro Batchingencrypt
Beaconing - BYOC
Labelling / Re-training
Anomaly / Alerting
Near Real Time
Long Term
Monitoring Feedback Loop
{"modelName": "meaning-of-life","modelVersion": "1","environment": "PROD","requestBody": "What is the meaning of life?","responseBody": "42"}.encode()
![Page 22: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/22.jpg)
Monitoringthe bottom layer of the Hierarchy of Production Needs, is fundamental to
running a stable service[1].
Observability
![Page 23: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/23.jpg)
Core Building Blocks
Model Hosting Service
HostedModels
Store
Beacon
Cache
Counters Timers Distribution
Requests Core Process Time Request Size
HTTP Status Total Response Time Response Size
Store Requests Store Process Time Store get/put size
Cache Hit/Miss Cache Process Time Cache get/put size
Beacons Beacon Emit Time Beacon size
Authentication LMA Process Time
● Cardinality of each is increased by tagging● Availability is formulated from granular metrics● Most alerting is based on P99
Monitoring - Platform on Platform
![Page 24: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/24.jpg)
Monitoring - Request Latency
![Page 25: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/25.jpg)
Logging
Let's build a centralized platform
![Page 26: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/26.jpg)
● Accessibility (easy to access)
● Compartmentalized (logs are grouped by Model)
● Traceability (transactional id flows from end user app all the way to Model)
● Near real time
● Log Retention
Centralized Logging
![Page 27: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/27.jpg)
Machine Learning Models as a Service
![Page 28: Machine Learning Models as a Service · Training (Argo) Access Control (Authn, Authz) Logging Tracing (Splunk) Monitoring (Wavefront) ML Compute (via Sagemaker) (GPUs etc.) Notebooks](https://reader030.fdocuments.in/reader030/viewer/2022040619/5f2b3bf599b9e372b17049db/html5/thumbnails/28.jpg)
1. Make actions in ML lifecycle self service
2. Take the operations burden off of the data scientist
3. Make sure models are run securely
4. Provide common functionalities to all models at scale
5. Provide logging, tracing and monitoring out of the box
Running ML models is just like running a Service