October 28, 2019 OpenShift and Machine Learning at ExxonMobil · OpenShift Commons SF, Oct 2019...
Transcript of October 28, 2019 OpenShift and Machine Learning at ExxonMobil · OpenShift Commons SF, Oct 2019...
OpenShift and Machine Learning at ExxonMobil
October 28, 2019
Cory Latschkowski
UIS Technology Enablement – Team Lead
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
What can you envision and share?
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
We could use a notebook to combine python code, a GUI, and documentation for sharing with customers.
Jupyter on OCP PoC at ExxonMobil. Could this new technology benefit us in creating a Reproducible & Interactive Data Science environment?
Prize: This would enable the team to not only quickly obtain customer feedback, but also easily utilize Agile Methodology; therefore, quickly delivering MVPs.
Drawback: How does one avoid the setup/configuration issues and reliably deploy the notebook?
Dependencies: pip, anaconda, libraries, etc
Jupyter Notebook Python 3.x (load onto PC – or setup server)
Local admin access
Source Code: Latest
OSSQLServer
PC Setup
Goal: Data Science environment that is interactive, reproducible, and collaborative
Jupyter Notebooks
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
OpenShift Setup
S2Ibuild
JupyterOpenShift
URLto PoCCode
Goal: Data Science environment that is interactive, reproducible, and collaborative
Dependencies: pip, anaconda, libraries, etc
Jupyter Notebook Python 3.x (load onto PC – or setup server)
Local admin access
Source Code: Latest
OSSQLServer
PC Setup
Local PC vs OpenShift
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
OCP Data Science Delivery Model
1. Understand theProblem
2. Suggest Solutions
Deliver POC
3. Refine the Problem
Agile Devlopment
How to Deploy?
URLto PoC
Code
GIT push
OpenShiftS2I
buildJupyter
“Interactive” feedback!
NexusImages
Python(pypi)Security
As a user I want to provide frequent feedback!
OpenShift Environment
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
• Re-useable Data Sources: data location• Re-useable Data Science Images: can they be re-consumed or modified for particular use cases?
We have a base python image that has been modified to provide TensorFlow, SciKit Learn for Data Science projects.• Reusable data access containers: SQL Server, Oracle• Train Models: During (S2I) Build Process
GitRepositoryBUILD APP
(OpenShift) Developer
code
Source-to-Image(S2I)
Builder ImageImage
Registry
BUILD IMAGE(OpenShift)
DEPLOY(OpenShift)
deployApplication Container
Source 2 Image (S2I)
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
We are seeing an emerging notion of Data ScienceOps workflows - Production CI/CD taking form in reusable templates, existing processes, common practices
GITJenkins build
Package
Jenkins Archive
Artifacts in Nexus
Nexus
OCP build image deploy
to PROD
OCP build image deploy
to TEST
Test build Package
Maturing the CI/CD Pipeline
Challenges experienced include:
1. OnPrem databases in different countries2. Development / Deployment in Jupyter notebooks (without foundational development practices)3. One size does not fit all – focus on simple solutions (basic web hook integrations, OCP Jenkins)
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
Figure 1. liquid estimates. Marco De Mattia
Unique performance computing requirements for Artificial Intelligence, Machine Learning, Neural Networks and GPUs
Reusable Data Science container images:• TensorFlow• PyTorch• Scikit-learn
In process: Testing GPU (NVidia v100) cluster (OCP)Additional services internal to HPC
Next Steps: • Explore RAPIDS• Explore AI• Execute end-to-end pipelines w/ GPU
Machine Learning on OpenShift
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
Petro-Physical PoC: Read & analyze petro-physical data. Use ML Algorithms to generate analysis/models on GPU cluster. Vetted models can be pushed to Azure / OCP for deployment.NLP PoC: Text summarization for technical texts. Take algorithms and summarize them into a repository (library).
GPUDB
Data Scientist
URL to ML App
User
ML Algorithms(GIT Repo)
InternalNetwork / Resources
onPremDatabase(s)
Containers
Figure 2. GPU POC workflow, Audrey Reznik
OCP GPU Proof of Concept(s) (PoC)
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
As a data scientist why do I care about
cloud?
GPU – Private or Public Cloud
Compare buying GPU hardware to public cloud• Public cloud GPU is often billed at a premium
1. Train models on internal GPU resources (if available)
2. Run trained models in public cloud as spike work
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
Hybrid Cloud
Internal Cloud Services
(OpenShift)
ExxonMobil Network
Data Source (DB)
Apps
Firewall
AWS
Azure
IHSContainers
Considerations of solution placement:
1. Where is your data? (data sovereignty)
2. Where are your customers? (int. / ext.)
3. What is the bandwidth / latency between system elements?
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
YOUR DIFFERENTIATION DEPENDS ON YOUR ABILITY TO DELIVER INTELLIGENT APPS FASTER
CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS
Innovation Culture
Cloud-native Applications
AI & Machine Learning
Internet of Things
Virtual GPU
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
Personal Focus Area
• Success is not about doing things perfectly; it’s about willingness to change and being honest about where you are. Ultimately, this is far more important than your current abilities.
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
As a Data Scientist (all I care about) is… I can now deploy a consistent solution (Jupyter Notebook / Python with all required libraries) in a matter of minutes by using Openshift.
Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery mechanisms. Now that is Democratizing Data Science!
Team Focus Areas• Consulting with data scientists to create production worthy, sustainable solutions• Education: Build Success Skills paving a path for data scientists to be good developers• Collaboration / Partnering organically across technology and business domains• Self-Service for accessing data & data science templates
• One Click Notebooks – Jupyter Hub environment in OCP• Bring OCP GPU to users allows for sharing resources and faster modeling
Upstream Data Science Enablement Team
OpenShift Commons SF, Oct 2019 – Delivering agile data science solutions with OpenShift
Delivering Agile Data Science solutions with OpenShift … and providing Business Value!