Infrastructure Agnostic Machine Learning Workload Deployment

Infrastructure Agnostic Machine Learning Workload Deployment

Abi Akogun Data Science Consultant (MavenCode)

Charles Adetiloye ML Platforms Engineer (MavenCode)

About MavenCodeMavenCode is an Artificial Intelligence Solutions company located in Dallas, Texas - We do training, product development, and consulting services in the following areas:

● Provisioning Scalable Data Processing Pipelines on Cloud Infrastructure

● Development & Deployment of Machine Learning and Artificial Intelligence Platforms

● Streaming and Big Data Analytics Edge-IoT and Sensors

About The Presenters

Charles Adetiloye is an ML Platforms Engineer

at MavenCode. He has well over 15 years of

experience building large-scale, distributed

applications. He has extensive experience

working and consulting with several companies

implementing production grade ML and AI

platforms

twitter.com/cadetiloye

Abiodun Akogun is a Machine Learning and Data

Science Consultant at Mavencode. He has extensive

experience building and deploying large-scale Machine

Learning Applications in different industries that

include Healthcare, Finance, Telecommunications, and

Insurance. He has experience solving several business

problems using Data Analytics, Sentiment Analysis,

Topic Modelling, Named Entity Recognition(N.E.R),

Opinion Mining, Data Mining, Time Series, Spatial

Statistics and Marketing Analytics

twitter.com/akogz

Agenda

▪ Overview of Machine Learning Model Deployment Workflow

▪ Various Approaches to model training, management, and serving in the Cloud

▪ Deploying Machine Learning Workloads in the Cloud

▪ Implementing Feature Storage backend for ML model training

▪ Running Spark Workloads for ML training on Kubernetes with Kubeflow

Overview of Machine Learning Deployment Workflow

Data Sourcing

Pre Processing

Feature Engineering

Model Training /

Evaluation

Model Scoring /Management

Model Inferencing

Machine Learning Workload Deployment

Data Sourcing

Pre Processing

Feature Engineering

Model Training /

Evaluation


Model Inferencing

Google Cloud AWS Azure On Prem

Machine Learning Deployment Effort

Data Verification

Configuration

FeatureExtraction

Data ValidationMachine Resource

Management

Serving Infrastructure Monitoring

Analysis Tool

Machine Learning Code

Data Preparation +Storage

Efficient Compute Resource Management

Overview of Machine Learning Deployment Workflow

Data Sourcing

Pre Processing

Feature Engineering

Model Training /

Evaluation


Model Inferencing

32%

10%

36%

2% 4%

16%

A Typical Machine Learning Developer Workflow

Data Sourcing

Pre Processing

Feature Engineering

Model Training /

Evaluation

Model Scoring

/Management

Model Inferencing

Azure Storage

Google Storage

AWS S3 Storage

Raw Data Transformation Processed Data

Storage Compute1 2

Google Cloud AI AWS Sage Maker Azure ML

Data Scientist / ML Engineers works on pulling or processing data first before starting ML training on a Managed Cloud Service

Raw Data Processing and Transformation Pipeline

Cloud Training Platforms

What Enterprise Machine Learning Workflow In the Cloud Looks Like!

Data Sourcing

Pre Processing

Feature Engineering

Azure Storage

Google Storage

AWS S3 Storage

Raw Data Transformation Processed Data

Storage Compute1 2

Team A

Team B

Team C

Team D

Google Cloud AI

AWS SageMaker

AWS SageMaker

Azure ML

Running ML workflow across the enterprise with multiple teams using different Cloud Provider technology stacks

Implementing Machine Learning solutions in the cloud comes at a cost, with cost of Compute and Storage on top of the list.

If we plan to be Cloud Neutral, can we abstract our ● Machine Learning Compute Workload→Kubernetes?● Machine Storage → Feature Store?

Google Cloud AI AWS Sage Maker Azure ML

A Typical Machine Learning Developer Workflow

Data Sourcing

Pre Processing

Feature Engineering

Model Training /

Evaluation


Model Inferencing

Azure Storage

Google Storage

AWS S3 Storage

Data Source Transformation Processed Data

Storage Compute1 2

Towards A Cloud Neutral ML Deployment Environment

Data Sourcing Pre ProcessingFeature Engineering

Model Training / Evaluation


Model Inferencing

Storage Compute1 2

Feature Store

Kubernetes

Why the need for Cloud Agnostic Deployment Infrastructure?

● Makes it easier to migrate workloads in a Hybrid Cloud Environment

● We are not tied to particular Cloud Infrastructure technology stack

● It’s easier to Implement best practice patterns and solutions

● Your team will have a common base denominator for all Enterprise ML workload

● Easy to control cost, manage utilization and forecast demand

Cloud Agnostic Machine Learning Development




Model Inferencing

Storage Compute1 2

Feature Store

Kubernetes

Azure StorageGoogle StorageAWS S3 Storage

What’s Feature Store All about?A Feature is a measurable observable attribute that is part of the input to a

Machine Learning Model.

Model Training

X1

X2

X3

Xn

[Feature Vector]

Model

What’s Feature Store All about?

Model Training

X1

X2

X3

Xn

[Feature Vector]

Model

Model 1

Features are derived from

● Raw Datastore

● Streaming Datasource

● Aggregates of Raw Inputs

● Windows (mins, hourly, daily, weekly)

Features Change Over time!

Model Training

X1

X2

X3

Xn

X1

X2

X3

Xn

X1

X2

X3

Xn

Time

Machine Learning Feature Store● Makes it easy to operationalize our ML workload, most importantly Data

Management and Storage for Model training

● Features can be shared easily amon teams running different Model

training pipelines

● We can get to version of datasets and track changes easily

● Consistency in Feature input attributes between Model Training and

Serving

● Offline Feature Store → Batching Training

● Online Feature Store → Inferencing / Serving

Types Of Feature Store

Implementing Offline Feature Storage with Apache Hudi

Azure Storage

Google StorageAWS S3 Storage

Streaming Source

Batch Job Operations

Datasource with Streaming sources like MQTT, Kafka, Pubsub etc

Batch Operations on Databases, FileStorage, Distributed Storage etc

Feature Store

Workflow Scheduling Orchestration with Kubeflow Pipelines or Airflow Dags on Kubernetes

Feature Store Implementation on any of the Major Cloud Storage

● A need for a Unified Platform where new data can be made available in addition to historical data within minutes.

● The need for a quick computation (or derivation ) of Feature vectors in other to make them available for our model input.

● Incremental Versioning of our Feature collections so that we can time-travel and use a particular set of features for Model training.

● Our Hudi dataset can be stored in Azure, Google Cloud, AWS cloud storage layer.

● Easy to implement all our code and everything we need to do with Spark and PySpark

Why did we use Apache Hudi?

Getting Data into Hudi Feature Store with Kubeflow Pipelineimport kfpfrom kfp import components

KafkaDatastreamer_op = kfp.components.create_component_from_func(KafkaDatastreamer,base_image="python:3.7.1”)

ValidatorOnSchema_op = kfp.components.create_component_from_func(ValidatorOnSchema,base_image="python:3.7.1")

PreProcessor_op = kfp.components.create_component_from_func(PreProcessor,base_image="python:3.7.1")

HudiTableWriter_op= kfp.components.create_component_from_func(HudiTableWriter, base_image="mavencode.io/spark:v3.1.1")

The Hudi Data Store writer

Configure the Spark Session with the packages needed to run hudi and avro

Hudi configuration Options

Writing the data into our Hudi data store in the right format





Model Inferencing

Storage Compute1 2

Feature Store

Kubernetes

Cloud Native ML Workload Deployment with Operators on Kubeflow

Cloud Native ML Training Deployment

● Containerized Workload

● Scalable + Can Run in Distributed Mode

● Efficient Compute Utilization

● Language Agnostic!

Machine Learning Operators with Kubeflow onKubernetes

● An Machine Learning Operator helps the deployment monitoring and management a model training life-cycle

● Some ML Operators found in Kubeflow are:○ TF-operator → Tensorflow Job○ Pytorch-operator → Pytorch Job○ Xgboost-operator → Xgboost Job○ Spark-operator → Spark and Spark ML Jobs


MLOps Model Training and Deployment Platform

Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook Kubeflow Jupyter NoteBook

Namespace Namespace Namespace Namespace

Auto-Scalable CPU Node Pool Auto-Scalable GPU Node Pool

Spark Operator Spark Operator TensorFlow Operator Tensorflow Operator

Cloud Infrastructure Layer Running

Auto Scaling Node Pools Running Kubernetes

Machine Learning Operators running with Kubeflow

Feature Store

Using Spark Operator for Training ML Steps

PySpark ML Code

Containerizethe Python

Code

Create SparkApplication Kubernetes YAML

Deployment

Apply Deployment to

Kubernetes

Spark Operator on Kubernetes

API

Scheduler

OR OR OR

Spark Driver

Executors

Elastic Compute Resource ML Jobs

API

Scheduler

OR OR OR

kubectl apply -f ...

Deployment Configuration YAML

Spark Application Config that describes the job and the namespace where the job will run

Container that will run our Spark ML Code

Spark Drive and Executor Configuration

Connecting to Feature Store with Kubeflow Pipeline

Cost comparison with Managed Cloud service on AWS

30%

100%

15s

66s

Compute Utilization Cost Compute Startup Uptime Team Agility & Productivity

6x Productivity

Managed Services Running on AWS

Kubeflow + S3 Feast Storage ML workload

Summary● Implementing a Cloud neutral ML deployment approach

simplifies most of the complexities in a Multi-Cloud

environment

● After the initial hump, learning curve and the overall

team efficiency improves significantly

● Teams is not locked in to a particular Cloud

Infrastructure stack

● Easy to control cost and forecast future capacity

demands

THANK YOU!

Thank You!

If you are interested in learning more about how to run your Machine Learning Workloads on any Cloud Infrastructure or Onprem reach out to us

Drop us a mail [email protected]

Visit Us Onlinehttps://www.mavencode.com

Follow Ushttps://www.twitter.com/mavencode

Infrastructure Agnostic Machine Learning Workload Deployment

Documents

Transcript of Infrastructure Agnostic Machine Learning Workload Deployment