October 28, 2019 OpenShift and Machine Learning at ExxonMobil · OpenShift Commons SF, Oct 2019...

OpenShift and Machine Learning at ExxonMobil

October 28, 2019

Cory Latschkowski

UIS Technology Enablement – Team Lead

OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift

What can you envision and share?


We could use a notebook to combine python code, a GUI, and documentation for sharing with customers.

Jupyter on OCP PoC at ExxonMobil. Could this new technology benefit us in creating a Reproducible & Interactive Data Science environment?

Prize: This would enable the team to not only quickly obtain customer feedback, but also easily utilize Agile Methodology; therefore, quickly delivering MVPs.

Drawback: How does one avoid the setup/configuration issues and reliably deploy the notebook?

Dependencies: pip, anaconda, libraries, etc

Jupyter Notebook Python 3.x (load onto PC – or setup server)

Local admin access

Source Code: Latest

OSSQLServer

PC Setup

Goal: Data Science environment that is interactive, reproducible, and collaborative

Jupyter Notebooks


OpenShift Setup

S2Ibuild

JupyterOpenShift

URLto PoCCode

Goal: Data Science environment that is interactive, reproducible, and collaborative

Dependencies: pip, anaconda, libraries, etc

Jupyter Notebook Python 3.x (load onto PC – or setup server)

Local admin access

Source Code: Latest

OSSQLServer

PC Setup

Local PC vs OpenShift


OCP Data Science Delivery Model

1. Understand theProblem

2. Suggest Solutions

Deliver POC

3. Refine the Problem

Agile Devlopment

How to Deploy?

URLto PoC

Code

GIT push

OpenShiftS2I

buildJupyter

“Interactive” feedback!

NexusImages

Python(pypi)Security

As a user I want to provide frequent feedback!

OpenShift Environment


• Re-useable Data Sources: data location• Re-useable Data Science Images: can they be re-consumed or modified for particular use cases?

We have a base python image that has been modified to provide TensorFlow, SciKit Learn for Data Science projects.• Reusable data access containers: SQL Server, Oracle• Train Models: During (S2I) Build Process

GitRepositoryBUILD APP

(OpenShift) Developer

code

Source-to-Image(S2I)

Builder ImageImage

Registry

BUILD IMAGE(OpenShift)

DEPLOY(OpenShift)

deployApplication Container

Source 2 Image (S2I)


We are seeing an emerging notion of Data ScienceOps workflows - Production CI/CD taking form in reusable templates, existing processes, common practices

GITJenkins build

Package

Jenkins Archive

Artifacts in Nexus

Nexus

OCP build image deploy

to PROD

OCP build image deploy

to TEST

Test build Package

Maturing the CI/CD Pipeline

Challenges experienced include:

1. OnPrem databases in different countries2. Development / Deployment in Jupyter notebooks (without foundational development practices)3. One size does not fit all – focus on simple solutions (basic web hook integrations, OCP Jenkins)


Figure 1. liquid estimates. Marco De Mattia

Unique performance computing requirements for Artificial Intelligence, Machine Learning, Neural Networks and GPUs

Reusable Data Science container images:• TensorFlow• PyTorch• Scikit-learn

In process: Testing GPU (NVidia v100) cluster (OCP)Additional services internal to HPC

Next Steps: • Explore RAPIDS• Explore AI• Execute end-to-end pipelines w/ GPU

Machine Learning on OpenShift


Petro-Physical PoC: Read & analyze petro-physical data. Use ML Algorithms to generate analysis/models on GPU cluster. Vetted models can be pushed to Azure / OCP for deployment.NLP PoC: Text summarization for technical texts. Take algorithms and summarize them into a repository (library).

GPUDB

Data Scientist

URL to ML App

User

ML Algorithms(GIT Repo)

InternalNetwork / Resources

onPremDatabase(s)

Containers

Figure 2. GPU POC workflow, Audrey Reznik

OCP GPU Proof of Concept(s) (PoC)


As a data scientist why do I care about

cloud?

GPU – Private or Public Cloud

Compare buying GPU hardware to public cloud• Public cloud GPU is often billed at a premium

1. Train models on internal GPU resources (if available)

2. Run trained models in public cloud as spike work


Hybrid Cloud

Internal Cloud Services

(OpenShift)

ExxonMobil Network

Data Source (DB)

Apps

Firewall

Google

AWS

Azure

IHSContainers

Considerations of solution placement:

1. Where is your data? (data sovereignty)

2. Where are your customers? (int. / ext.)

3. What is the bandwidth / latency between system elements?


YOUR DIFFERENTIATION DEPENDS ON YOUR ABILITY TO DELIVER INTELLIGENT APPS FASTER

CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS

Innovation Culture

Cloud-native Applications

AI & Machine Learning

Internet of Things

Virtual GPU


Personal Focus Area

• Success is not about doing things perfectly; it’s about willingness to change and being honest about where you are. Ultimately, this is far more important than your current abilities.


As a Data Scientist (all I care about) is… I can now deploy a consistent solution (Jupyter Notebook / Python with all required libraries) in a matter of minutes by using Openshift.

Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery mechanisms. Now that is Democratizing Data Science!

Team Focus Areas• Consulting with data scientists to create production worthy, sustainable solutions• Education: Build Success Skills paving a path for data scientists to be good developers• Collaboration / Partnering organically across technology and business domains• Self-Service for accessing data & data science templates

• One Click Notebooks – Jupyter Hub environment in OCP• Bring OCP GPU to users allows for sharing resources and faster modeling

Upstream Data Science Enablement Team


Delivering Agile Data Science solutions with OpenShift … and providing Business Value!

October 28, 2019 OpenShift and Machine Learning at ExxonMobil · OpenShift Commons SF, Oct 2019...

Documents

Transcript of October 28, 2019 OpenShift and Machine Learning at ExxonMobil · OpenShift Commons SF, Oct 2019...