Machine Learning Lifecycle - boss-workshop.github.io
Transcript of Machine Learning Lifecycle - boss-workshop.github.io
Simplifying the Machine Learning Lifecycle
Agenda/ Broad Adoption of ML … and its issues
/ The need for standardization
/ ML development challenges
/ How MLflow tackles these
Login to Databricks Community Edition
• Sign up for Databricks Community Edition for free • We will use this for the tutorial• Once you sign up, you can continue to use it to learn and
experiment on a dedicated data sciences engineering environment
https://databricks.com/try
Go to databricks.com/try
Sign up for Community Edition
Sign up for Community Edition
Log into DBCE
Create a Cluster on DBCE
Create a Cluster on DBCE
Create a Cluster on DBCE
Create a Cluster on DBCE
Attach a Notebook to your Cluster
Attach a Notebook to your Cluster
Attach a Notebook to your Cluster
Attach a Notebook to your Cluster
Broad Adoption of ML
and many many more customers in different industries and segments
Internet of ThingsDigital PersonalizationHealthcare and Genomics Fraud Prevention
Huge disruptive innovations are affecting most enterprises on the planet
MLCode
ConfigurationData Collection
Data Verification
Feature Extraction
Machine Resource
Management
Analysis Tools
ProcessManagement Tools
ServingInfrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex.
Hardest Part of ML isn’t ML, it’s Data
Data & ML Tech and People are in Silos
DATA ENGINEERS
xDATA SCIENTISTS
ML Lifecycle is Manual, Inconsistent and Disconnected
● Ad hoc approach to track experiments
● Very hard to reproduce experiments
Prep Data● Multiple tightly coupled
deployment options ● Different monitoring approach
for each framework
Build Model Deploy Model● Low level integrations for
Data and ML● Difficult to track data used
for a model
The need for standardization
Day in the life of a data scientist (tracking edition)
Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357
Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??
Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659
Day in the life of a data scientist (tracking edition)
Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357
Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??
Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659
Did anything change in the feature engineering?
Day in the life of a data scientist (tracking edition)
Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357
Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??
Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659
How did the hyperparameters change?
Day in the life of a data scientist (tracking edition)
Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357
Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??
Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659
What data was this model trained on?
Day in the life of a data scientist (tracking edition)
Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357
Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??
Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659
How did the offline metrics change?
Day in the life of a data scientist (tracking edition)
Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357
Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??
Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659
What else am I missing?
The difference between releasing Software and deploying ML Models
Write code
Software
Write unit tests
Send for review
Get approvals
Commit
Release testing
Release
Write code
Software ML Models
Write unit tests
Send for review
Get approvals
Commit
Release testing
Release
Analyze data
Put data into the right format
Write model code
Train and evaluate model
Experiment with params, model structure
Deploy … by email?
Monitor performance and trigger retraining
Write code
Software ML Models
Write unit tests
Send for review
Get approvals
Commit
Release testing
Release
Analyze data
Put data into the right format
Write model code
Train and evaluate model
Experiment with params, model structure
Deploy … by email?
Monitor performance and trigger retraining
Meet a functional specification
Optimize a metric, e.g. CTR
Goal
Write code
Software ML Models
Write unit tests
Send for review
Get approvals
Commit
Release testing
Release
Analyze data
Put data into the right format
Write model code
Train and evaluate model
Experiment with params, model structure
Deploy … by email?
Monitor performance and trigger retraining
Meet a functional specification
Optimize a metric, e.g. CTR
Goal
Depends on code Depends on data, code, model, params,
...
Quality
Write code
Software ML Models
Write unit tests
Send for review
Get approvals
Commit
Release testing
Release
Analyze data
Put data into the right format
Write model code
Train and evaluate model
Experiment with params, model structure
Deploy … by email?
Monitor performance and trigger retraining
Meet a functional specification
Optimize a metric, e.g. CTR
Goal
Depends on code Depends on data, code, model, params,
...
Quality
Typically one software stack
Combination of many libraries, tools,
...
Tools
Write code
Software ML Models
Write unit tests
Send for review
Get approvals
Commit
Release testing
Release
Analyze data
Put data into the right format
Write model code
Train and evaluate model
Experiment with params, model structure
Deploy … by email?
Monitor performance and trigger retraining
Meet a functional specification
Optimize a metric, e.g. CTR
Goal
Depends on code Depends on data, code, model, params,
...
Quality
Typically one software stack
Combination of many libraries, tools,
...
Tools
Works deterministically
Keeps changing with data, etc.
Outcome
In summary, deploying ML Models is hard!
ML Lifecycle and Challenges
Delta
Tuning Model Mgmt
Raw Data ETL TrainFeaturize Score/ServeBatch + Realtime
Monitor Alert, Debug
Deploy
AutoML, Hyper-p. search
Experiment Tracking
Remote Cloud Execution
Project Mgmt(scale teams)
Model Exchange
DataDrift
ModelDrift
Orchestration (Airflow, Jobs)
A/BTesting
CI/CD/Jenkins push to prod
Feature Repository
Lifecycle mgmt.
RetrainUpdate FeaturesProduction Logs
Zoo of Ecosystem Frameworks
Collaboration Scale Governance
An open source platform for the machine learning lifecycle
Introducing MLflowUnveiled in June 2018, MLflow is the only open source framework designed to manage the complete Machine Learning Lifecycle.
ModelRegistry
Introducing MLflowUnveiled in June 2018, MLflow is the only open source framework designed to manage the complete Machine Learning Lifecycle.
140120100806040200
0 5 10 15 20 25 30 35 40 45
Months since Project Launch
# of
Con
trib
utor
s
Components
Tracking
Record and queryexperiments: code,
data, config, results
Projects
Packaging formatfor reproducible
runs on any platform
Models
General format that standardizes
deployment paths
Model Registry
Centralized and collaborative
model lifecycle management
mlflow.org github.com/mlflow twitter.com/MLflow
Components
Tracking
Record and queryexperiments: code,
data, config, results
Projects
Packaging formatfor reproducible
runs on any platform
Models
General format that standardizes
deployment paths
Model Registry
Centralized and collaborative
model lifecycle management
new
mlflow.org github.com/mlflow twitter.com/MLflow
Tracking
Notebooks
Local Apps
Cloud Jobs
UI
API
Tracking Server
Parameters Metrics Artifacts
ModelsMetadata Spark Data Source
Key Concepts in Tracking
Parameters: key-value inputs to your codeMetrics: numeric values (can update over time)Artifacts: arbitrary files, including modelsSource: what code ran?
# Scikit Learn Linear Regression via ElasticNet lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42) lr.fit(train_x, train_y)
# Predict predicted_qualities = lr.predict(test_x)
# Evaluate Metrics (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
with mlflow.start_run() as run:
# Scikit Learn Linear Regression via ElasticNet lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42) lr.fit(train_x, train_y)
# Predict predicted_qualities = lr.predict(test_x)
# Evaluate Metrics (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
# Log mlflow.log_param("alpha", alpha) ...
GitHub Demohttps://github.com/dennyglee/mlflow-diabetes-example
Comparing Runs Contour Plot
Projects
Project Spec
Code MetadataConfig
Local Execution
Remote Execution
Example MLflow Projectmy_project/├── MLproject│ │ │ │ │├── conda.yaml├── main.py└── model.py ...
conda_env: conda.yaml
entry_points: main: parameters: training_data: path lambda: {type: float, default: 0.1} command: python main.py {training_data} {lambda}
$ mlflow run git://<my_project>
mlflow.run(“git://<my_project>”, ...)
Model Format
Flavor 2Flavor 1
Simple model flavors usable by many tools
Containers
Batch & Stream Scoring
Cloud Inference Services
In-Line Code
Models
Example MLflow Modelmy_model/├── MLmodel│ │ │ │ │└── estimator/ ├── saved_model.pb └── variables/ ...
Usable by tools that understandTensorFlow model format
Usable by any tool that can runPython (Docker, Spark, etc!)
run_id: 769915006efd4c4bbd662461time_created: 2018-06-28T12:34flavors: tensorflow: saved_model_dir: estimator signature_def_key: predict python_function: loader_module: mlflow.tensorflow
Automated Jobs
REST Serving
Downstream Users
Reviewers + CI/CD Tools
Model Registry
Experimental Staging A/B Tests Production
Model RegistryData Scientists Deployment Engineers
Model Registry: Benefits
One Collaborative Hub
● Central Model Repository
● Overview of versions in Staging/Production/etc.
● Search/filter/pagination1
2
3
3
1
2
3
Model Registry: Benefits
2
1
Management of the entire ML Lifecycle (MLOps)
● Overview of active model versions and their deployment stage
● Request/Approval workflow for transitioning deployment stages
1
2
1
Visibility
● Full activity log of stage transition requests, approvals, etc.
1
Model Registry: Benefits
1.a
1.b
1.c
Governance and Auditability
● Full provenance from Model marked production in the Registry to …
●○ Run that produced the model○ Notebook that produced the
run○ Exact revision history of the
notebook that produced the run
1.a
1.b
1.c
Model Registry: Benefits
Notebook Demohttps://github.com/dennyglee/tech-talks/blob/master/sa
mples/MLflow%20Diabetes%20Example%20(with%20MLflow%20Registry).ipynb
Towards more principled Data Science and ML
: An Open Source ML Platform
mlflow.org github.com/mlflow twitter.com/MLflow
Hands-on Workshop
bit.ly/mlflow-boss-2020