Post on 20-May-2020
The Analytics Pipeline and Data Flow
September 20, 2018
Linton Ward, PhD
IBM Distinguished Engineer
OpenPower Cognitive Solutions
Emmanuel Macron Talks to WIRED About France's AI Strategy
2
EM: I think artificial intelligence will
disrupt all the different business
models and it’s the next disruption to
come. So I want to be part of it.
Otherwise I will just be subjected to
this disruption without creating jobs
in this country. So that’s where we
are. And there is a huge acceleration
and as always the winner takes all in
this field.
https://www.wired.com/story/emmanuel-macron-talks-to-wired-about-frances-ai-strategy/
Nicholas Thompson business 03.31.18 06:00 am
One Month of Civilian Agency news …
3
HHS CTO Report Calls Data Silos to TaskA new report from the Department of Health and Human Services’ (HHS) CTO calls
out the department and its individual agencies for keeping their data in silos, and
calls for a department-wide data governance framework.
“Whether surveillance, survey, or claims data, HHS expends an enormous amount
of financial resources to report on the state of the health of the population it serves,”
https://www.meritalk.com/articles/hhs-cto-report-calls-data-silos-to-task/
House Bill to Codify CDM Moves to Senate“Cyberattacks are escalating at an alarming rate, making it vital that our Federal
agencies have access to programs and tools to help mitigate these risks,”
https://www.meritalk.com/articles/house-bill-to-codify-cdm-moves-to-senate/
New DHS S&T Program Targets Internet, Critical
Infrastructure DisruptionThe new program – the Predict, Assess Risk, Identify (and Mitigate) Disruptive Internet-
scale Network Events (PARIDINE) project – aims to study Network/Internet-scale
Disruptive Events (NIDE), which can cut internet or network connectivity, leading to
disruptions of “energy and water systems, the finance sector, commerce, and public
safety and emergency communications systems, as well as other essential systems.”
https://www.meritalk.com/articles/new-dhs-st-program-paridine/
NIST Wants to Know: Can You Trust Your IoT?The draft publication outlines 17 trust-related issues “that may negatively impact
the adoption of IoT products and services,” spanning scalability,
predictability, difficult in measurement, lack of certification criteria, all the way
down to usability, performance, and reliability.
https://www.meritalk.com/articles/can-you-trust-your-iot/
GAO Releases Updated Cyber Risk ReportThe Government Accountability Office (GAO) today released an updated version of a
report it issued in July detailing major cybersecurity challenges facing the Federal
government and critical actions needed to address them.
https://www.meritalk.com/articles/gao-releases-updated-cyber-risk-report/
DHS CIO Says Priorities Include Modernization, Workforce,
Supply Chainhe Department of Homeland Security (DHS) is focused on modernizing its mindset to
tackle a host of pressing issues including reducing its reliance on legacy systems,
competing to attract cybersecurity talent, and combating supply chain threats, said DHS
CIO John Zangardi today at the Billington Cybersecurity Summit.
“We’re in a very, very different world than we have been in the past,” said Zangardi.
“I’ve been in government for a long time. We’re really good at routine. But cyber threats
are asymmetrical. The adversary’s not thinking about routine, the adversary is thinking
about how to do things differently.”
https://www.meritalk.com/articles/dhs-cio-says-priorities-include-modernization-
workforce-supply-chain/
SEC Looking for Social Media Monitoring ToolThe Securities and Exchange Commission (SEC) on Thursday issued a solicitation
for “a web-based subscription to a Commercial-Off the Shelf (COTS) social media
monitoring tool that provides emailed alerts to SEC staff based on keyword searches
for relevant topics with ability to monitor social media sites.”
https://www.meritalk.com/articles/sec-looking-for-social-media-monitoring-
tool/
State Department Looking for Platform to Track, Analyze
Online InfoThe State Department has issued a request for information for systems that collect
relevant online information to “analyze and track global developments in (near) real-
time.” … The State Department listed its needs for a monitoring system, including:
aiding the ability to verify the credibility of a source; ensuring the accuracy of machine-
generated content from different languages; and distributing information quickly.
https://www.meritalk.com/articles/state-department-looking-for-platform-to-track-
analyze-online-info/
4
The Administration is developing a Federal Data Strategy
to leverage data as a strategic asset to grow the economy,
increase the effectiveness of the Federal Government,
facilitate oversight, and promote transparency.
Strategy 1: Enterprise Data Governance. Set priorities for
managing Government data as a strategic asset, including
establishing data poli- cies, specifying roles and responsibilities
for data privacy, security, and confidentiality protection, and
monitoring compliance with standards and policies …
•Strategy 2: Access, Use, and Augmentation. Develop policies and procedures and incent investments that
enable stakeholders to effectively and efficiently access and
use data assets by: (1) improving dissemination, making data
available more quickly and in more useful formats; (2)
maximizing the amount of non-sensitive data shared with the
public; and (3) leveraging new technologies and best
practices to increase access to sensitive or restricted data
while protecting the privacy, security, and confidentiality, and
interests of data providers.
•Strategy 3: Decision-Making and Accountability. Improve the use of data assets for decision-making and
accountability for the Federal Government, including both
internal and external uses. This includes: (1) providing high
quality and timely information to inform evidence-based
decision-making and learning; (2) facilitating external
research on the effectiveness of Government programs and
policies which will inform future policymaking; and (3)
fostering public accountability and transparency ....
Strategy 4: Commercialization, Innovation, and
Public Use. Facilitate the use of Federal Government data
assets by external stakeholders at the forefront of making
Government data accessible and useful through commercial
ventures, innovation, or for other public uses. This includes use
by the private sector and scientific and research communities;
by states, localities, and tribes for public policy pur- poses; for
education; and in enabling civic engagement.
PRESIDENT'S MANAGEMENT AGENDA
Application Transformation
5
Insight
Cognitive Platforms
Analytic data
Platforms
Digital transformation is driving demand for
new applications, new databases, and new insights
Engagement
Mobile, Web,
Call centers,
Edge & IoT
Engaging
partners, clients,
employees &
machines
Record
Business Logic
Operational Data
Platforms
Operating
business process
flows
Insights,
Trained Models
Scores
Insights,
Trained Models
Scores
Customer
context, queries
Machine data
Customer,
Transaction
Data
Enabling
data-driven
Decisions
New:
In-line analytics
Scoring capability
Context Data
New:
Applications
Data types &
representations
Database types
Systems of Insight Landscape
6
Enterprise
Data Warehouses
Business Intelligence Tools
Data Ingest
Hadoop Data Lakes
Conventional
Emerging
AI Grid
+Open Source
Python, R
Data Science Workbench
Modern Databases
Modern Business
Intelligence
Data Governance
Statistics Tools
Application
Development
PlatformSQL
+
+File Systems
Success with analytics projects (ways to succeed)
7
How do we derisk analytics projects?
Clarity on the question Apply critical thinking techniques with buy-in
Enable faster exploration The data science workbench: create an ad hoc
workflow quickly
Enable quicker win Data science sandbox: prototype from data scientist
rather than presentation alone
Scale to production AI Grid: multi-tenant, high stability, high efficiency
cluster
Cognitive Platform: Analytic Project Lifecycle
Progression from Data Science Workbench to operationalized insights
Prototype Pilot Scale
Highly Stable Highly Agile
Minimize
Investment
Demonstrate
Value
Operationalize
Value
Optimized
Value / $
SandboxProduction
Model Build
Common Data
Maintain
Model
Currency
Sustain
Value
Dev Ops Stable
Streamlined
Maintenance
Innovation
Early
Libraries
Unstable
Explanable
Mature
Libraries
Stable
© 2015 IBM Corporation
Welcome to the Waitless World
© 2016 IBM Corporation
The Data Science Workbench
9
10 © IBM Corporation, 2017
Workload flow and data flow are key to results
Traditional Business
IoT & Sensors
Collaboration Partners
Mobile Apps & Social Media
Legacy
Data Preparation
Pre-Processing
Training
Dataset
Data Source Model Training Inference
AI Deep Learning
Frameworks
(Tensorflow & Caffe)
Monitor
& Advise
Instrumentation
Iterate
Distributed & Elastic Deep
Learning (Fabric)
Parallel Hyper-Parameter
Search & Optimization
Network
ModelsHyper-
Parameters
Testing
Dataset
Trained Model
Deploy in
Production using
Trained Model
New Data
Years
of Data
Hours of
preparation
Weeks &
months of
training
Seconds
to results
Heavy IO
Cognitive Systems – Capabilities in the Data Science Workbench
Structured
Text
Audio
Image
Video
The Data Science Workbench comprises a set of capabilities
Data Platforms
Yarn (Map-Reduce)SparkStreams
Visualization
Exploration
Interpretive
Environments
NLP Text
Analytics
Graph
Analytics
Image
Analytics
Machine Learning
Deep Learning
Analysts
Toolbox
HPDA
HPC
HDFS
Spectrum
Scale
Open Stack
SwiftCloud Object
Store
Cassandra Redis Mongo
Geospatial
Analytics
Streaming
Analytics
Statistics &
Classification
Titan
Neo4j
11
Ingest
Streaming
Message
BatchPostgres
Execution Frameworks and AI Grid
Data Science Workbench
IBM Spectrum Conductor
AI Grid
PowerAI: Optimized Open Source ML Frameworks
Large Model Support (LMS)
Distributed Deep Learning (DDL)
PowerAI: Open Source ML Frameworks
PowerAI Enterprise
Distribution
Package Manager
Efficient multi-tennant
Resource Scheduler
Python & R Ecosystem
Deep Learning Impact
PowerAI Vision
Productivity &Simplification
Data & Model Management,
Visualize, AdviseAuto-hyperparameter
optimization
End to EndImage Classification
DRIVERLESS AI Auto ML
Scale DL to Hundreds of GPUs
DL for much higherresolution
13
ANACONDA Accelerates Adoption of
Open Data Science for Enterprises
• Easy to install
• Agile data exploration
• Powerful data analysis
• Simple to collaborate
• Accessible to everyone
PYTHON & R OPEN SOURCE ANALYTICS
NumPy SciPy Pandas Scikit-learn Jupyter/IPython
Numba Matplotlib Spyder Numexpr Cython Theano
Scikit-image NLTK NetworkX IRKernel dplyr shiny
ggplot2 tidyr caret PySpark & 720+ packages
14
IBM AI / Data Science Workbench: DSX Local
14
DSX (Data Science Experience)
IBM ML
Libraries
Jupyter Notebooks & Rstudio, Model & Data
Management, Hyper-parameter Tuning, GUI
Spark, Data Lake, Connectors to DBs
Cognitive Systems Data Stores
H2O &
Anaconda
PowerAI DL
Distribution
Non-IBM ProductsLegend
Hadoop
Spark
Object Store
PowerAI
Deep Learning Frameworks
DDL: Distributed Deep Learning
Hyper-Parameter Tuning, GUI
Spectrum Conductor
Tape
Servers & Storage
IBM Software Defined Infrastructure
Multi-scale Infrastructure for High Performance Computing & Analytics
Workload AwareScheduling
SharedResourceManagement
High Performance Computing
Design / Simulation / Modeling
Hybrid Cloud Infrastructure
‘New-gen Workloads’
Hadoop, Spark, Containers
Disk Flash Power
SharedMulti-tier Data Management
Cloud
IBM Spectrum Conductor
17
Faster Time to Results
• Proven High-performance scalable resource and job scheduler
• Multitenant resource sharing
Simplified Deployment & Management
• Complete solution: scheduling, monitoring, alerting, reporting &
diagnostics
• Lifecycle management supporting multiple concurrent and different
versions
Lower Infrastructure Costs with Optimized Resource Sharing
Coming Soon
Secure Multi-tenant, deploy and manage modern computing frameworks & services
Workload Management
Services Management
Resource Management and Orchestration
Services andSupport
Mo
nit
ori
ng
an
d R
ep
ort
ing
• Enhanced Notebook &
Anaconda Integrations
• Job Dependencies
• DSX Integration
• Fine Grained Resource
Allocation
Cognitive Systems are built with optimized hardware and software
Open Source
Software
Partner Software
Industry Solutions
Dev E
co
syste
m
Accelerator Roadmaps
Open Accelerator Interfaces
Not Just About Hardware Design
hardware
software
+
It’s about co-optimized
which just work for Machine Learning,
Deep Learning, and AI
Optimized Libraries
9DaysRecognition
Recognition
54x
Learning
runs with
Power 8
What will you do?
Iterate more and create more accurate models?
Create more models?
Both?
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
4 H
ou
rs
Faster Training Time with Distributed Deep Learning
21
libGLM (C++ / CUDA
Optimized Primitive Lib)
Distributed Training
Logistic Regression Linear Regression
Support Vector
Machines (SVM)
Distributed Hyper-
Parameter Optimization
More Coming Soon
APIs for Popular ML Frameworks
Snap ML
Distributed GPU-Accelerated Machine Learning Library
(coming
soon)
Snap Machine Learning (ML) Library
An Optimized AI Infrastructure Stack
22
Data Platform
Applications and Services
Cognitive APIs (Eg: Watson)
In-House APIs
Machine & Deep Learning Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
SparkML
Spark, MPI
Hadoop HDFS,
NoSQL DBs
Accelerated
InfrastructureAccelerated Servers Storage
PowerAI
AI Grid
Open Source and ISV ToolsFunction Specific
Finance, Retail, Healthcare
Open Source Programming Ecosystem
Python, R, etc
Languages and
Libs
Data Science
Workbench
Open Source
Software
Partner Software
Industry Solutions
Dev E
co
syste
m
Accelerator Roadmaps
Open Accelerator Interfaces
Optimized Libraries
Time to value for new intelligence
Data Science Productivity
Data Productivity
AI for the rest of us
“We can do new science”
Solve larger problems
Solve previously intractable problems