Self-Service Data Science for Leveraging ML & AI on All of Your Data
-
Upload
mapr-data-technologies -
Category
Data & Analytics
-
view
257 -
download
2
Transcript of Self-Service Data Science for Leveraging ML & AI on All of Your Data
© 2017 MapR TechnologiesMapR Confidential 1
Self-Service Data Science for
Leveraging ML & AI on All of Your
Data:Introducing the MapR Data Science Refinery
Rachel SilverProduct Manager – Data Science & Analytics
11/16/17
© 2017 MapR TechnologiesMapR Confidential 2
Summary
• Why Companies Invest In ML/AI
• Winning With a Data First Approach
• Introducing the MapR Data Science Refinery
• Deep Dive & Demos
– Ease of Deployment
– Data Exploration
– Extensibility & Collaboration
© 2017 MapR TechnologiesMapR Confidential 3
Why Companies Invest In ML/AI
© 2017 MapR TechnologiesMapR Confidential 4
Where AI Creates Value In The Value Chain
Produce
Optimized Production &
Maintenance
Provide rich, personal, and convenient
user experiences.
Project
Smarter R&D and
forecasting
Promote
Targeted Sales &
Marketing
Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)
© 2017 MapR TechnologiesMapR Confidential 5
Project Where The Next Threat Will Come FromDeep security analytics and advanced persistent threat (APT) detection
• Centralization and
visibility of all data
from an information
security perspective
• Reduced risk of
data breaches from
DDOS and APT
attacks
• Real-time insights
into what is
happening within
the environment
OBJECTIVE
• Early detection of data breaches and suspicious activity
• Aggregate and retain all security related data into a single central store and
then build statistical models to detect abnormal activity within the
environment.
• Get insights into what are insiders doing within the environment
CHALLENGES
• Existing SIEM solution could not scale
• Current solutions do not work well for “unknown” threats
SOLUTION
• Leverage MapR-DB for fast data ingestion and query performance
• MapR provided the deep storage and machine learning algorithms
• NFS enabled easy integration with the IT ecosystem
Retail
Bank
© 2017 MapR TechnologiesMapR Confidential 6
Source
1
Source
2
Source
1000
Houston
MAPR
Core
Cluster
Time to insight (48 hrs)
Manual Process
Before Edge
Source
1
Source
2
Source
1000
Houston
MAPR
Core
Cluster
Time to insight (<2 hrs)
Automated Process
1000s of
Oil & Drill Sources
Will do Pre Processing locally +at Core
(Custom App + Down Sampling)
After Edge
Produce More EfficientlyML aggregation and processing at the edge optimizes production
Oil & Gas
company
© 2017 MapR TechnologiesMapR Confidential 7
Promote personalized offers in real-timeTargeting credit card customers using Recommendation Engine
A Global Financial Services company wanted to offer real-time localized & personalized recommendations to their credit card holdersusing ML/AI
OBJECTIVE
• Increase revenue and customer loyalty through real-time personalized offers generated by a recommendation engine
CHALLENGES
• In order to be accurate, data had to be updated on a real-time basis• Being a global company, their Platform has to be consistent and 100%
available 24x7 – no downtime• Must be able to simultaneously ingest (stream) and update data in the
same cluster
SOLUTION
• MapR was the only distribution that met the mission critical needs of the customer and also provided the capability to ingest data continuously into the cluster
• Direct NFS allows data to be continuously ingested directly into their cluster• MapR-XD’s self-healing capability allowed them to go into production safely
Leading
Credit Card
Company
© 2017 MapR TechnologiesMapR Confidential 8
Provide Customers With a Customized ExperienceProvide customers with a personalized and convenient experience
Using ML/AI to bring customer understanding to the center of business processes
OBJECTIVE
• Use full knowledge of customer relationship to inform online interactions.
CHALLENGES
• Need to store 20 trillion records• Training sample size is 400 million records• The decision trees contained 2 million possible pathways• Every combination must be evaluated every time a model is used (~15 billion
combinations)
SOLUTION
• The MapR Converged Data Platform centralizes analytics and operational apps on one platform allowing Quantium to make one large infrastructure investment instead of many small silo’d ones. Current cluster has 50TB of memory and 5000 CPUs to process and store 5PB of data
© 2017 MapR TechnologiesMapR Confidential 9
A Winning Approach: Data First
© 2017 MapR TechnologiesMapR Confidential 10
Gartner estimates they solve between 10-100 business problems in three to five years.
Gartner estimates they solve
between 3-20 business
problems in three to five years.
20%
Contemplators Experimenters
41%40%
Adopters
Uncertain about the
benefits of Data Science.
Desire easy entry
Entry Points in the Data Science Journey
20%
Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)Source: Gartner – Magic Quadrant for Data Science Platforms (2017)
© 2017 MapR TechnologiesMapR Confidential 11
Entry Points in the Data Science Journey
Gartner estimates they solve between 10-100 business problems in three to five years.
Gartner estimates they solve
between 3-20 business
problems in three to five years.
Uncertain about the
benefits of Data Science.
Desire easy entry
Adopters
20%
Contemplators Experimenters
41%40%
80%!
Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)Source: Gartner – Magic Quadrant for Data Science Platforms (2017)
© 2017 MapR TechnologiesMapR Confidential 12
Entry Points in the Data Science Journey
Gartner estimates they solve between 10-100 business problems in three to five years.
Gartner estimates they solve
between 3-20 business
problems in three to five years.
Uncertain about the
benefits of Data Science.
Desire easy entry
Adopters
20%
Experimenters
41%
Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)
AI adoption outside of the tech sectoris stuck here and many firms report they are
uncertain of the ROI
Contemplators
40%
Investment in AI is growing at a high rate,
but adoption in 2017 remains low
AI is only deployed into production
12% of the time
© 2017 MapR TechnologiesMapR Confidential 13
Entry Points in the Data Science Journey
Gartner estimates they solve between 10-100 business problems in three to five years.
Gartner estimates they solve
between 3-20 business
problems in three to five years.
Uncertain about the
benefits of Data Science.
Desire easy entry
Contemplators Experimenters
41%40%
Adopters
20%
Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017)
Seamless Data Access
Technical Capabilities (a strong digital foundation)
Leadership From The Top
Key Traits Of A Successful Data Science Approach
© 2017 MapR TechnologiesMapR Confidential 14
If it is ALL about the data,
then it better be about ALL your data.
Seamless Data Access
© 2017 MapR TechnologiesMapR Confidential 15
ML Models Improve when Trained on Larger Datasets
Instead of relying on
assumptions and weak
correlations, presence of
more data results in better
and more accurate models
Source: A Survey of Applications of AI Algorithms in Eco-environmental modelling (2009)
© 2017 MapR TechnologiesMapR Confidential 16
Data Growth Puts A Premium on Efficient Leverage
Source: McKinsey Global Institute: “The Age of Analytics”, Dec. 2016
The amount of data
is predicted to
double every three
years
Data Diversity
EmailsCall Detail
Records
Click
stream
CSV DocumentsData
PDFBilling Data Meta
Data
JSON Network
Data
Mobile
Data
XMLProduct
Catalog
Medical
RecordsText Files VideoText
Messages
Merchant
Listings
Sensor
Data
Server
Logs
Set Top
Box
Social
Media
Audio
4 Zettabytes
of Data
20111986
300 Exabytes
of Data
3 Exabytes
of Data
20192016
2 Zettabytes
of Data
© 2017 MapR TechnologiesMapR Confidential 17
Hadoop + Vendor Approach to Data ScienceRequires yet another cluster
Data Science
cluster
Batch
Cluster
Streaming
Cluster
NoSQL
Cluster
On Premises
© 2017 MapR TechnologiesMapR Confidential 18
© 2017 MapR TechnologiesMapR Confidential 19
A Capable Platform With a Strong Digital Foundation
NFS POSIX REST HDFS
MAPR CONVERGED DATA PLATFORM
ON-PREMISES, MULTI-CLOUD, IoT EDGE
FILESTORE
CONTAINER STORE
CUSTOMFILE APPS
METADATAMANAGEMENT
JSON HBASEKAFKA
HADOOP & SPARK APPS
REAL-TIMEBI APPS
STREAMING APPS
IoT/EDGE
SQL
OPERATIONAL DATA HUB
CDC
CONTEXTUAL USER
EXPERIENCES
CORE BUSINESS
APPS
SINGLE
VIEWIOT
© 2017 MapR TechnologiesMapR Confidential 20
Real-time Machine Learning Pipelines
A Robust Microservices Framework
Event Streams
• Persistent
• Infinitely replicable
• Re-playable
Compare model
results live!
M
Model A
M
Model B Persistent
Client & Application
Containers
© 2017 MapR TechnologiesMapR Confidential 21
Advice For Leadership
Avoid
• Creating new silos
• Looking for a one-trick pony
• Adopting tools that have
unwieldy install, integration,
and configuration processes
• Tools that don’t scale to
broader enterprise use
• Ensure secure role based
access to all data
• Adopt tools that meet the
needs of a broad range of
Data Science Teams
• Encourage adoption by
making things easy, secure,
and complete
Important
© 2017 MapR TechnologiesMapR Confidential 22
Data Science @ MapR
© 2017 MapR TechnologiesMapR Confidential 23
The MapR Data Science VisionA Holistic Approach To Self-Service Data Science
MAPR DATA SCIENCE REFINERY REFINERY DATA SCIENTISTS
Data Scientist led product-and-
services offerings including Quick
Start Solutions (QSS) & Training
REFINERY PARTNERSHIPS
Expand on what we offer in-
product to meet the needs of all
data science teams
An easy-to-deploy, secure, and
extensible data science offering
that leverages all existing platform
assets
MAPR CONVERGED DATA PLATFORM
© 2017 MapR TechnologiesMapR Confidential 24
MapR Data Science Refinery
Provides the ability to work across many
engines in one visual space
• Apache Spark: Spark Streaming, SparkSQL, SparkR, and
PySpark
• Apache Hive
• Apache Pig
• Apache Drill
• Python
• Shell access to MapR-FS
• Programmatic access to MapR-DB and MapR-ES in Spark
Pluggable Visualization Available via Helium!
An Enterprise-ready Data Science Notebook
MAPR
POSIX CLIENT
FOR CONTAINERS
MAPR
CONVERGED CLIENT
FOR CONTAINERS
© 2017 MapR TechnologiesMapR Confidential 25
MapR Data Science Refinery Benefits
Easy to Deploy• A Docker Image includes all the necessary bits - no more,
no less - required to leverage MapR as a persistent data
store for your data science output.
• Available on DockerHub
Secure• Authentication occurs at a container level to ensure
containerized applications only have access to data for
which they are authorized.
• Communications are encrypted to ensure privacy when
accessing data in MapR.
Extensible• A Dockerfile is also available on GitHub, allowing you to
further customize the image as needed to support your
specific application needs.
• The Helium Framework enables pluggable visualization
Leverage Locally, On-premise, or in Cloud
CLOUD-SCALE
DATA STORE
MAPR-XD
OPERATIONAL
DATABASE
MAPR-DB
EVENT
STREAMING
MAPR-ES
High Availability Real-time Unified Security Multi-Tenancy Disaster Recovery Global Namespace
MAPR CONVERGED DATA PLATFORM
© 2017 MapR TechnologiesMapR Confidential 26
Partner Integration: An Example
We’re enabling our partners to integrate with and use this product
DataScience.com Platform
Services
MapR DSR
Zeppelin Livy
JDBC
MapR Clients
© 2017 MapR TechnologiesMapR Confidential 27
© 2017 MapR TechnologiesMapR Confidential 28
Demo: Ease of Deployment & Data Exploration
© 2017 MapR TechnologiesMapR Confidential 29
Demo: Ease of Deployment
What’s in the command
docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --
device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e
MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e
MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e
MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e
MAPR_CONTAINER_GID=5000 -e
MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e
MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e
ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e
MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v
/tmp/maprticket_5000:/tmp/maprticket_5000:ro -v
/sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-
refinery:v1.0_6.0.0_4.0.0_centos7
© 2017 MapR TechnologiesMapR Confidential 30
Demo: Ease of Deployment
What’s in the command
docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --
device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e
MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e
MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e
MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e
MAPR_CONTAINER_GID=5000 -e
MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e
MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e
ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e
MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v
/tmp/maprticket_5000:/tmp/maprticket_5000:ro -v
/sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-
refinery:v1.0_6.0.0_4.0.0_centos7
© 2017 MapR TechnologiesMapR Confidential 31
Demo: Ease of Deployment
What’s in the command
docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --
device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e
MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e
MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e
MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e
MAPR_CONTAINER_GID=5000 -e
MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e
MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e
ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e
MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v
/tmp/maprticket_5000:/tmp/maprticket_5000:ro -v
/sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-
refinery:v1.0_6.0.0_4.0.0_centos7
© 2017 MapR TechnologiesMapR Confidential 32
Demo: Ease of Deployment
What’s in the command
docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --
device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e
MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e
MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e
MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e
MAPR_CONTAINER_GID=5000 -e
MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e
MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e
ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e
MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v
/tmp/maprticket_5000:/tmp/maprticket_5000:ro -v
/sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-
refinery:v1.0_6.0.0_4.0.0_centos7
© 2017 MapR TechnologiesMapR Confidential 33
Demo: Ease of Deployment
What’s in the command
docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --
device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e
MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e
MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e
MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e
MAPR_CONTAINER_GID=5000 -e
MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e
MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e
ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e
MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v
/tmp/maprticket_5000:/tmp/maprticket_5000:ro -v
/sys/fs/cgroup:/sys/fs/cgroup:ro maprtech/data-science-
refinery:v1.0_6.0.0_4.0.0_centos7
© 2017 MapR TechnologiesMapR Confidential 34
Demo: Ease of Deployment
How is Security Handled?
$ maprlogin password
[Password for user ’jane' at cluster 'my.cluster.com': ]
MapR credentials of user ’john' for cluster 'my.cluster.com' are written to '/tmp/janes_ticket’
Job submits as ‘jane’
© 2017 MapR TechnologiesMapR Confidential 35
Demo: Ease of Deployment
Why Livy?
CLOUD-SCALE
DATA STORE
MAPR-XD
OPERATIONAL
DATABASE
MAPR-DB
EVENT
STREAMING
MAPR-ES
MAPR CONVERGED DATA PLATFORMHTTP (RPC)
Advantages over native Spark Interpreter:• Jobs are submitted in YARN cluster mode
• Spark context can be shared
• Support for Spark Dynamic Resource Allocation
© 2017 MapR TechnologiesMapR Confidential 36
Demo: Extensibility & Collaboration
© 2017 MapR TechnologiesMapR Confidential 37
Demo: Extensibility & Collaboration
Collaboration
CLOUD-SCALE
DATA STORE
MAPR-XD
OPERATIONAL
DATABASE
MAPR-DB
EVENT
STREAMING
MAPR-ES
MAPR CONVERGED DATA PLATFORM
© 2017 MapR TechnologiesMapR Confidential 38
Demo: Extensibility & Collaboration
Collaboration
CLOUD-SCALE
DATA STORE
MAPR-XD
OPERATIONAL
DATABASE
MAPR-DB
EVENT
STREAMING
MAPR-ES
MAPR CONVERGED DATA PLATFORM
MAPR
POSIX CLIENT
FOR CONTAINERS
© 2017 MapR TechnologiesMapR Confidential 39
Demo: Extensibility & Collaboration
What’s in the command
docker run --rm -it --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --device /dev/fuse --memory 0 -e MAPR_CLUSTER=my.cluster.com -e MAPR_MEMORY=0 -e MAPR_MOUNT_PATH=/mapr -e ZEPPELIN_NOTEBOOK_DIR=/mapr/my.cluster.com/user/mapr/zeppelin/shared-notebooks/ -e MAPR_TZ=America/Los_Angeles -e MAPR_CONTAINER_USER=mapr -e MAPR_CONTAINER_UID=5000 -e MAPR_CONTAINER_GROUP=mapr -e MAPR_CONTAINER_GID=5000 -e MAPR_CLDB_HOSTS=172.24.8.195,172.24.11.200,172.24.10.4 -e MAPR_TICKETFILE_LOCATION=/tmp/maprticket_5000 -e ZEPPELIN_SSL_PORT=9995 -e HOST_IP=172.24.11.62 -e MAPR_HS_HOST=172.24.8.195 -p 9995:9995 -p 10000-10010:10000-10010 -v /tmp/maprticket_5000:/tmp/maprticket_5000:ro -v /sys/fs/cgroup:/sys/fs/cgroup:romaprtech/data-science-refinery:v1.0_6.0.0_4.0.0_centos7
© 2017 MapR TechnologiesMapR Confidential 40
Demo: Extensibility
Adding Deep Learning libraries to the container
© 2017 MapR TechnologiesMapR Confidential 41
Demo: Extensibility
Adding Deep Learning libraries to the container
CLOUD-SCALE
DATA STORE
MAPR-XD
OPERATIONAL
DATABASE
MAPR-DB
EVENT
STREAMING
MAPR-ES
MAPR CONVERGED DATA PLATFORM
Compute Persistent Storage
© 2017 MapR TechnologiesMapR Confidential 42
Demo: Extensibility
Adding Deep Learning libraries to the container
CLOUD-SCALE
DATA STORE
MAPR-XD
OPERATIONAL
DATABASE
MAPR-DB
EVENT
STREAMING
MAPR-ES
MAPR CONVERGED DATA PLATFORM
Compute Persistent Storage
What if this was a box of GPUs?
© 2017 MapR TechnologiesMapR Confidential 43
A Final Comparison
Traditional Hadoop Vendor
Ba
tch
Clu
ste
r
Stre
am
ing
Clu
ste
r
No
SQ
L C
luste
r
On Premises
Data
Science
cluster