Machine Learning Lifecycle - boss-workshop.github.io

Simplifying the Machine Learning Lifecycle

Agenda/ Broad Adoption of ML … and its issues

/ The need for standardization

/ ML development challenges

/ How MLflow tackles these

Login to Databricks Community Edition

• Sign up for Databricks Community Edition for free • We will use this for the tutorial• Once you sign up, you can continue to use it to learn and

experiment on a dedicated data sciences engineering environment

https://databricks.com/try

Go to databricks.com/try

Sign up for Community Edition

Log into DBCE

Create a Cluster on DBCE

Attach a Notebook to your Cluster

Broad Adoption of ML

and many many more customers in different industries and segments

Internet of ThingsDigital PersonalizationHealthcare and Genomics Fraud Prevention

Huge disruptive innovations are affecting most enterprises on the planet

MLCode

ConfigurationData Collection

Data Verification

Feature Extraction

Machine Resource

Management

Analysis Tools

ProcessManagement Tools

ServingInfrastructure

Monitoring

“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015

Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex.

Hardest Part of ML isn’t ML, it’s Data

Data & ML Tech and People are in Silos

DATA ENGINEERS

xDATA SCIENTISTS

ML Lifecycle is Manual, Inconsistent and Disconnected

● Ad hoc approach to track experiments

● Very hard to reproduce experiments

Prep Data● Multiple tightly coupled

deployment options ● Different monitoring approach

for each framework

Build Model Deploy Model● Low level integrations for

Data and ML● Difficult to track data used

for a model

The need for standardization

Day in the life of a data scientist (tracking edition)

Elasticnet model (alpha=0.01, l1_ratio=1.0): RMSE: ?? MAE: 51.051828604086325 R2: 0.3951809598912357

Elasticnet model (alpha=?, l1_ratio=0.75): RMSE: 65.28994906390733 MAE: 53.759148284349266 R2: ??

Elasticnet model (alpha=0.01, l1_ratio=?): RMSE: 71.40362571026475 MAE: ?? R2: 0.2291130640003659





Did anything change in the feature engineering?





How did the hyperparameters change?





What data was this model trained on?





How did the offline metrics change?





What else am I missing?

The difference between releasing Software and deploying ML Models

Write code

Software

Write unit tests

Send for review

Get approvals

Commit

Release testing

Release

Write code

Software ML Models

Write unit tests

Send for review

Get approvals

Commit

Release testing

Release

Analyze data

Put data into the right format

Write model code

Train and evaluate model

Experiment with params, model structure

Deploy … by email?

Monitor performance and trigger retraining

Write code

Software ML Models

Write unit tests

Send for review

Get approvals

Commit

Release testing

Release

Analyze data


Write model code





Meet a functional specification

Optimize a metric, e.g. CTR

Goal

Write code

Software ML Models

Write unit tests

Send for review

Get approvals

Commit

Release testing

Release

Analyze data


Write model code







Goal

Depends on code Depends on data, code, model, params,

...

Quality

Write code

Software ML Models

Write unit tests

Send for review

Get approvals

Commit

Release testing

Release

Analyze data


Write model code







Goal


...

Quality

Typically one software stack

Combination of many libraries, tools,

...

Tools

Write code

Software ML Models

Write unit tests

Send for review

Get approvals

Commit

Release testing

Release

Analyze data


Write model code







Goal


...

Quality

Typically one software stack

Combination of many libraries, tools,

...

Tools

Works deterministically

Keeps changing with data, etc.

Outcome

In summary, deploying ML Models is hard!

ML Lifecycle and Challenges

Delta

Tuning Model Mgmt

Raw Data ETL TrainFeaturize Score/ServeBatch + Realtime

Monitor Alert, Debug

Deploy

AutoML, Hyper-p. search

Experiment Tracking

Remote Cloud Execution

Project Mgmt(scale teams)

Model Exchange

DataDrift

ModelDrift

Orchestration (Airflow, Jobs)

A/BTesting

CI/CD/Jenkins push to prod

Feature Repository

Lifecycle mgmt.

RetrainUpdate FeaturesProduction Logs

Zoo of Ecosystem Frameworks

Collaboration Scale Governance

An open source platform for the machine learning lifecycle

Introducing MLflowUnveiled in June 2018, MLflow is the only open source framework designed to manage the complete Machine Learning Lifecycle.

ModelRegistry

Introducing MLflowUnveiled in June 2018, MLflow is the only open source framework designed to manage the complete Machine Learning Lifecycle.

140120100806040200

0 5 10 15 20 25 30 35 40 45

Months since Project Launch

# of

Con

trib

utor

s

Components

Tracking

Record and queryexperiments: code,

data, config, results

Projects

Packaging formatfor reproducible

runs on any platform

Models

General format that standardizes

deployment paths

Model Registry

Centralized and collaborative

model lifecycle management

mlflow.org github.com/mlflow twitter.com/MLflow

Components

Tracking

Record and queryexperiments: code,

data, config, results

Projects

Packaging formatfor reproducible

runs on any platform

Models

General format that standardizes

deployment paths

Model Registry

Centralized and collaborative

model lifecycle management

new


Tracking

Notebooks

Local Apps

Cloud Jobs

UI

API

Tracking Server

Parameters Metrics Artifacts

ModelsMetadata Spark Data Source

Key Concepts in Tracking

Parameters: key-value inputs to your codeMetrics: numeric values (can update over time)Artifacts: arbitrary files, including modelsSource: what code ran?

# Scikit Learn Linear Regression via ElasticNet lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42) lr.fit(train_x, train_y)

# Predict predicted_qualities = lr.predict(test_x)

# Evaluate Metrics (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

with mlflow.start_run() as run:

# Scikit Learn Linear Regression via ElasticNet lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42) lr.fit(train_x, train_y)

# Predict predicted_qualities = lr.predict(test_x)

# Evaluate Metrics (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

# Log mlflow.log_param("alpha", alpha) ...

GitHub Demohttps://github.com/dennyglee/mlflow-diabetes-example

Comparing Runs Contour Plot

Projects

Project Spec

Code MetadataConfig

Local Execution

Remote Execution

Example MLflow Projectmy_project/├── MLproject│ │ │ │ │├── conda.yaml├── main.py└── model.py ...

conda_env: conda.yaml

entry_points: main: parameters: training_data: path lambda: {type: float, default: 0.1} command: python main.py {training_data} {lambda}

$ mlflow run git://<my_project>

mlflow.run(“git://<my_project>”, ...)

Model Format

Flavor 2Flavor 1

Simple model flavors usable by many tools

Containers

Batch & Stream Scoring

Cloud Inference Services

In-Line Code

Models

Example MLflow Modelmy_model/├── MLmodel│ │ │ │ │└── estimator/ ├── saved_model.pb └── variables/ ...

Usable by tools that understandTensorFlow model format

Usable by any tool that can runPython (Docker, Spark, etc!)

run_id: 769915006efd4c4bbd662461time_created: 2018-06-28T12:34flavors: tensorflow: saved_model_dir: estimator signature_def_key: predict python_function: loader_module: mlflow.tensorflow

Automated Jobs

REST Serving

Downstream Users

Reviewers + CI/CD Tools

Model Registry

Experimental Staging A/B Tests Production

Model RegistryData Scientists Deployment Engineers

Model Registry: Benefits

One Collaborative Hub

● Central Model Repository

● Overview of versions in Staging/Production/etc.

● Search/filter/pagination1

2

3

3

1

2

3


2

1

Management of the entire ML Lifecycle (MLOps)

● Overview of active model versions and their deployment stage

● Request/Approval workflow for transitioning deployment stages

1

2

1

Visibility

● Full activity log of stage transition requests, approvals, etc.

1


1.a

1.b

1.c

Governance and Auditability

● Full provenance from Model marked production in the Registry to …

●○ Run that produced the model○ Notebook that produced the

run○ Exact revision history of the

notebook that produced the run

1.a

1.b

1.c


Notebook Demohttps://github.com/dennyglee/tech-talks/blob/master/sa

mples/MLflow%20Diabetes%20Example%20(with%20MLflow%20Registry).ipynb

Towards more principled Data Science and ML

: An Open Source ML Platform


Hands-on Workshop

bit.ly/mlflow-boss-2020

Machine Learning Lifecycle - boss-workshop.github.io

Documents

Transcript of Machine Learning Lifecycle - boss-workshop.github.io