Data Versioning and Reproducible ML with DVC and MLflow

15
Data Versioning and Reproducible ML with DVC and MLflow Petra Kaferle Devisschere Data Scientist, Adaltas

Transcript of Data Versioning and Reproducible ML with DVC and MLflow

Page 1: Data Versioning and Reproducible ML with DVC and MLflow

Data Versioning and Reproducible ML with DVC and MLflowPetra Kaferle DevisschereData Scientist, Adaltas

Page 2: Data Versioning and Reproducible ML with DVC and MLflow

IntroductionAbout me:

▪ Data Scientist

▪ 13 years of experience in Data Analytics in Life Sciences

▪ Developed solutions for automating frequently used protocols

About Adaltas:▪ Involved with Big Data since 2010

▪ 16 experts in France, 2 in Morocco

▪ Partner with Cloudera, Databricks, D2IQ

▪ Actor in Open Source community

▪ Teaching Big Data at 6 universities

Page 3: Data Versioning and Reproducible ML with DVC and MLflow

Agenda

▪ Versioning in Data Science

▪ Data Versioning

▪ Intro to DVC and MLflow

▪ Demo with DVC and MLflow

Page 4: Data Versioning and Reproducible ML with DVC and MLflow

Problem definition

▪ Dataset▪ Tabular▪ Images, video, sound, text

▪ Code▪ Functionality, Hyper-parameters▪ Data pre-processing, Modeling

▪ Results▪ Numeric (metrics)▪ Plots

▪ Environment▪ Dependencies

What do we need to track in Data Science project to be able to reproduce the results?

Page 5: Data Versioning and Reproducible ML with DVC and MLflow

Data versioning use cases

▪ Data engineering during data cleaning and dataset preparation

▪ Test data in software engineering to assure the functionality of the software

▪ Development and testing of database applications

▪ Ontology versioning for Semantic Web

▪ ...

Data beyond Data Science

Page 6: Data Versioning and Reproducible ML with DVC and MLflow

Data is in the heart of any ML project

▪ Data quality management▪ Reproducible training and re-training▪ Automation of testing or deployment▪ Use in applications and web services▪ Audit

Why is the exact version of a dataset important?

Page 7: Data Versioning and Reproducible ML with DVC and MLflow

Classical DV solutions and their downsides

▪ Git ▪ Not suitable for big files

▪ Difficulties with binary files

▪ Git-LFS▪ File size limitation (ex.: 2 GB for GitHub free account and 5 GB for GitHub Enterprise Cloud)▪ Data needs to be stored on servers of Git provider

▪ External storage▪ Checksum calculation for any change in the dataset and placing them under version control

Classical Software Engineering tools are not enough to solve ML reproducibility crisis

Page 8: Data Versioning and Reproducible ML with DVC and MLflow

Our proposed solutionCombining Data Version Control (DVC) and MLflow

&- solves Git’s limitations- easy to learn

- extensive and easy tracking- user-friendly UI

Page 9: Data Versioning and Reproducible ML with DVC and MLflow

What is DVC?https://dvc.org/

▪ Tracks datasets and ML projects▪ Works with many types of storages (cloud,

local, HDFS, HTTP…)▪ Runs on top of a Git repository▪ Supports building and running pipelines

We will focus on dataset tracking.

Page 10: Data Versioning and Reproducible ML with DVC and MLflow

What is MLflow?https://mlflow.org/

We will focus on experiment tracking.

Page 11: Data Versioning and Reproducible ML with DVC and MLflow

Let’s put them together!Best of both worlds

Page 12: Data Versioning and Reproducible ML with DVC and MLflow

Demo

Page 13: Data Versioning and Reproducible ML with DVC and MLflow

RecapLet’s look again at what we just did

Page 14: Data Versioning and Reproducible ML with DVC and MLflow

Thank you for your attention!

Page 15: Data Versioning and Reproducible ML with DVC and MLflow

Feedback

Your feedback is important to us.

Don’t forget to rateand review the sessions.