Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter...

13
Open Data Hub - A Deep Dive OpenShift Commons Gathering San Francisco Sherard Griffin Senior Manager AI CoE Václav Pavlin Principal Software Engineer AI CoE Peter MacKinnon Principal Software Engineer AI CoE 1 Sherard Griffin Senior Manager, Red Hat AI Center of Excellence

Transcript of Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter...

Page 1: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Open Data Hub - A Deep DiveOpenShift Commons Gathering San Francisco

Sherard GriffinSenior ManagerAI CoE

Václav PavlinPrincipal Software EngineerAI CoE

Peter MacKinnonPrincipal Software EngineerAI CoE

1

Sherard GriffinSenior Manager, Red Hat AI Center of Excellence

Page 2: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

What Does a Data Scientist Want?

As a Data Scientist, I want a

“self-service cloud like” experience

for my Machine Learning projects,

where I can access a rich set of

modelling frameworks, data, and

computational resources, share

and collaborate with colleagues,

and deliver my work into

production with speed, agility and

repeatability to drive business

value!

Self service portal to select ML frameworks, data access

Perform ML Modelling

Inferencing w/ hardware acceleration

ML model deployment in app dev

2

Page 3: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Current Situation and Challenges

DATA SCIENTIST DATA SCIENTIST DATA SCIENTIST

➔ Team(s) of Data Scientists and Developers

➔ Sharing and collaboration, if any, is difficult,

manual, error prone and take time

➔ Access to limited non-shared resources

means modeling takes long times or can’t

achieve desired accuracy

➔ Delivering into models into production is a

challenge

3

Page 4: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Why OpenShift And Cloud Platforms for ML Workloads?

EXISTING AUTOMATION

TOOLSETS

SCM(GIT)

CI/CD

SERVICE LAYER

PERSISTENTSTORAGE

REGISTRY

RHEL

NODE

c

RHEL

NODE

RHEL

NODE

RHEL

NODE

RHEL

NODE

RHEL

NODE

C

C

C C

C

C

C CC C

RED HATENTERPRISE LINUX

MASTER

API/AUTHENTICATION

DATA STORE

SCHEDULER

HEALTH/SCALING

PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID

DATA SCIENTIST

ML deployed across clouds, data center,

and edge

ML services, load balanced

and scaled

ML microservices scheduled and

orchestrated on shared resources

ML in Production

4

Page 5: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Connected Drive & Autonomous Driving

Autonomous Driving

Data driven diagnosis of “Sepsis”

Digital banking with personalized services

Oil & Gas Exploration

5

AI/ML on OpenShift Momentum is Strong

Page 6: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Data Acquisition & Preparation

ML Modelling (Selection, Training,

Testing)

ML Model Deployment in

App. Dev. Process

Data Engineer

Data Scientists

App Developer

IT Operations

BusinessObjectives

Data

Business Leadership

Business Leadership

Intelligent applicationsto achieve

business outcomes

6

Machine Learning Pipeline

Page 7: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Open Data Hub Project - OpenDataHub.io● Community meta-operator that integrates best open source AI/ML/Data projects● Blueprint architecture for end to end AI/ML on OpenShift ● Used as Red Hat’s internal data science and AI platform● Open Data Hub Architecture: https://opendatahub.io/docs/architecture.html

Data Acquisition & Preparation

ML Model Selection, Training, Testing

ML Model Deployment in App. Dev. Process

Page 8: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Upstream/Community Projects in the AI/ML Space

ML toolkit and lifecycle Kube

ML-as-a-service platform based on OpenShift, Ceph, Kafka, JupyterHub,

Spark,

Home for k8s community to share operators for various apps/tools

NVIDIA NGC GPU optimized

and curated

Tensorflow

Others

8

Page 9: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Open Data Hub v0.4 OperatorAvailable Now at OpenDataHub.io

● Unified analytics engine

● Large-scale data access

● Multi-user Jupyter● Used for data

science and research

● Monitoring and alerting toolkit

● Used to diagnose problems

● Analytics platform for all metrics

● Query, visualize and alert on metrics

● Deploying machine learning models as micro-services

● Full model lifecycle management

● Distributed Object Store● S3 Interface

● Distributed event streaming

● Pub/Sub Messaging

● Container-native workflow engine

● Declaratively deploy ML pipelines and models

9

Page 10: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Open Data Hub Blueprint

10

Page 11: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

Pooling and Sharing Resources with JupyterHub on OpenShift

Jupyter Hub DATA SCIENTIST

DATA SCIENTIST

DATA SCIENTIST

DATA SCIENTIST DATA SCIENTIST DATA SCIENTIST

11

Page 13: Open Data Hub - A Deep Dive - assets.openshift.com · Large-scale data access Multi-user Jupyter Used for data science and research Monitoring and alerting toolkit Used to diagnose

OpenDataHub.io

Thank you

13