Driving Real Insights Through Data Science

The Journey to Becoming a Data-Driven Enterprise

Pivotal Big Data Roadshow 2015

2© Copyright 2015 Pivotal. All rights reserved.

Where we’re going today…

3 Great Keynotes• Journey to a Data-driven Enterprise

• Data Science Use Cases

• Streaming Data and Predictive Analytics

Stock Inference Demo and Architecture Overview

Intensive hands-on training sessions


Today’s Agenda

12:00 PM – 12:45 PM - Check-In & Lunch

12:45 PM – 1:00 PM – Welcome and Agenda Review

1:00 PM – 1:PM AM – How Pivotal’s Tools Help Drive Value from Data Science

1:20 PM – 2:20 PM – Accelerating the Generation of New Insights – R&D Use Case

Review and Demo

2:20 PM – 2:30 PM – Coffee Break

2:30 PM – 3:00 PM – Manufacturing Use Case Review and Demo

3:00 PM – 3:15 PM – Closing Remarks

© Copyright 2015 Pivotal. All rights reserved.

MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’ - IT COULD TRANSFORM GE'S BUSINESS -

AND THE ECONOMY.

JEFF IMMELT, CEO, GE

“

”JEFF IMMELT, CEO, GE


THE POWER OF 1

RX

IncreasingFreight Utilization Rail

PredictiveMaintenance Healthcare

PredictiveDiagnostics Power

Driving Outcomes That Matter

One Percent Improvement Equals$27B

Industry Value byReducing System

Inefficiency

$63BIndustry Value byReducing Process

Inefficiency

$66BIndustry Value with

Efficiency ImprovementsIn Gas-fired Power

Plant FleetsSource: General Electric


DATA-DRIVEN ENTERPRISE JOURNEY

STORE• Structured

• Unstructured

• High Volume

• High Velocity

ANALYZE• Predictive Analytics

• Machine Learning

• Advance Data Science

• Realtime Analytics

DEVELOP• Advanced Analytic Pipelines

• Realtime Analytical Applications

• Global Scale Data-Driven Applications

• Enterprise, Consumer, and Mobile

INNOVATE• Agile Dev Expertise

• DevOps

• Microservice

• Continuous Delivery

• Closed Loop Applications

AGILE DEVELOPMENT

BIG DATA PREDICTIVE ANALYTICS

CLOUD NATIVE PLATFORM


0% of CIOs think their IT infrastructure is fully prepared for big data (3)

30% of companies have deployed advanced analytics, 11% big data analysis (4)

44% of new applications failed to meet performance expectations (5)

2X90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6)

But…

80% of CEOs thinking data mining and analysis are strategically important (1)

4% of companies use analytics effectively (2)

(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study

LARGE ENTERPRISE BIG DATA TROUBLE

http://download.pwc.com/gx/ceo-survey/assets/pdf/pwc-18th-annual-global-ceo-survey-jan-2015.pdf

http://www.bain.com/publications/articles/the-value-of-big-data.aspx

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?%20subtype=XB&infotype=PM&appname=CHQE_GB_TI_USEN&htmlfid=GBE03612USEN&attachment=GBE03612USEN.PDF%23loaded

http://www.ey.com/Publication/vwLUAssets/EY-CIO-Klub-Enterprise-IT-trends-and-Investment-survey-2014/$FILE/EY-CIO-Klub-Enterprise-IT-trends-and-Investment-survey-2014.pdf

http://media-cms.riverbed.com/documents/Riverbed_Transformers+Report_English_March2014.pdf

http://www.businesscloudnews.com/2014/09/04/server-capacity-over-provisioned-as-cios-bring-legacy-habits-to-cloud-platforms/


BIG DATACHASM

70%of data

generated by customers

80%of data stored

3%prepared for

analysis

0.5%being

analyzed

<0.5%being

operationalized

9

THE DATA DIVIDE


Software Is Eating The World

Data Is Fueling Software

SOFTWARE IS EATING THE WORLD


WE CHOSE PIVOTAL BECAUSE WE BELIEVE IT PROVIDES A 360-DEGREE VIEW OF THE PROCESS.

FROM A DATA SCIENCE AND DATA TECHNOLOGY PERSPECTIVE, IT MEANS DELIVERING BEST-IN-

CLASS DATA TECHNOLOGIES AND ENABLING THEM ON THEIR PLATFORM.

“

”


ACROSS INDUSTRIES


THE NEW DATA IMPERATIVES

ConvergedData & Cloud

OpenData-DrivenApps


THE BIG DATA PROBLEM

Fragmentation ConstraintsComplexity


• Remove Lock-in

• Leverage Ecosystem

• Co-innovate

GUIDING PRINCIPLES IN THE NEW ERA

OPEN AGILE CLOUD-READY

• Shorten innovation cycles

• Reduce TCO

• Improve TTM

• Solve business problems

• Avoid lock-in

• Appropriate security


JOURNEY TO A DATA-DRIVEN ENTERPRISE

Deploy analytic apps and automate at scale

Perform advanced analyticsDiscover insights

Modernize data infrastructure





DATA-DRIVEN COMPANIES:

USE MODERN DATA INFRASTRUCTURE


MODERNIZE DATA INFRASTRUCTURE

Elastic, Scale-outstorage and processing

Flexible data types and pipelining

ETL on demand: low operational costExpanded use cases

Higher quality analyticsLowered storage/processing cost

Less fragmented ecosystemReduced vendor lock-in

REQUIREMENTS BENEFITS

Cloud friendly and open-source based






STRATEGICALLY USE ADVANCED ANALYTICS


ADVANCED ANALYTICS

Leverage existing skills and toolsRapid time to insights

Internet of Things use casesRapid time to insights

Solve business problemsPredictive insights: proactive execution


Machine learning and advanced analytics

010101010101010100101010101010101100101010

SQL- compliant batch and interactive queries

Massive stream processing

0101010101010101001010

1010101010110010101010

10101010





INNOVATE AT SCALE



ANALYTIC APPS AND AUTOMATION AT SCALE

Reduced time to actionLow ‘analytics app-dev’ integration cost

Reduced time to insightsFlexible ingestion: low operating cost

High performance: low operating costTransactional safety: business critical ops


Low-latency, distributed in-memory transactions

Resilient, scale-out messaging and object storage

Agile analytic app-dev with enterprise PaaSPaaS


JOURNEY TO A DATA-DRIVEN ENTERPRISE




Pivotal Data Sciencehelps you move from BI to

Data Science

Pivotal Labs helps you move to an agile development of apps at scale

Pivotal Data Engineeringhelps you move from data administration to data engineering


PIVOTAL BIG DATA SUITE


Open sourcing all Pivotal Big Data Suite components including:

WORLD’S FIRST OPEN SOURCED BIG DATA PORTFOLIOBUILDING ON SUCCESS OF CLOUD FOUNDRY FOUNDATION

BUILT FOR ENTERPRISES

Pivotal GemFireApache Geode

Apache HAWQPivotal HDB

PivotalGreenplum Database


BUILT FOR ENTERPRISES

Value added features: enterprise grade performance + robustness without lock-in• Advanced Query Optimization in analytics

• WAN replication and continuous query in transactional processing

Flexible Deployment models: align to business objectives and needs• Balance cost objectives with policy and compliance requirements

• Leverage Pivotal’s pre-integration + certification on supported configurations

Enterprise grade support: one throat to choke for the suite• Focus on business problems – not on lifecycle management

• Expert support on Big Data Suite means reduced business risk


• Common core for Hadoop ecosystem

• Rapidly accelerated certifications, ecosystem development and enterprise-grade quality

OpenDataPlatform.org

OPEN

http://Opendataplatform.org/


AGILE




Spring XD Spark Pivotal HD & Open Data Platform

Pivotal Greenplum Database

Pivotal HDB Rabbit MQ

Redis

Pivotal GemFire Pivotal BDS on PCF

Pivotal Cloud Foundry


CLOUD-READY

COMMODITY HARDWARE APPLIANCE HYBRID CLOUDCLOUD

IaaS IaaS

PAAS


DATA-DRIVEN ENTERRPRISE JOURNEY WITH PIVOTAL BIG DATA SUITE

STORE• Structured

• Unstructured

• High Volume

• High Velocity

ANALYZE• Predictive Analytics


• Advance Data Science

• Realtime Analytics

DEVELOP• Advanced Analytic Pipelines

• Realtime Analytical Applications

• Global Scale Data-Driven Applications

• Enterprise, Consumer, IoT, and Mobile

INNOVATE• Agile Dev Expertise

• DevOps

• Microservices

• Continuous Delivery

• Closed Loop Applications

AGILE DEVELOPMENT

BIG DATA PREDICTIVE ANALYTICS

CLOUD NATIVE PLATFORM

Spring XD

Spark

Pivotal HD & Open Data Platform

Spring XD


Pivotal HDB

Spring XD

Pivotal GemFire

Rabbit MQ

Spring Cloud

Pivotal BDS on PCF

Pivotal Cloud Foundry

Pivotal LabsData ScienceData Engineering


FOR FURTHER INFO, CHECKOUT…

• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data

• Pivotal Blog @ http://blog.pivotal.io

• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal

• Pivotal Academy @ https://pivotal.biglms.com

Or reach out to your local Pivotal Account Executive…

36© Copyright 2015 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.

Pivotal Data Science Overview and Use CasesPivotal Big Data Roadshow


DATA SCIENCE?

App Development

Analytics

Business IntelligenceReporting

Visualization

Dashboards

Insights Big Data

Machine LearningStatistics

MathematicsTime Series

Algorithms

Databases

Software

Modeling

Queries

Real-Time

Sensors

Predictive Models

ETL

Research

Hadoop

Distributed Computing

MapReduce

SQL

In-Memory

OLAP

Text Mining

Unstructured Data

Open Source

Decision Science

Ad Hoc Queries

Hacking

In-Database Analytics

Internet of Things

Data Cleansing

Sentiment


• ETL

• Unstructured

• Data Cleansing

• Sensors

Data Related

• Algorithms

• Mathematics

• Statistics

• Econometrics

• Predictive Modeling


• Text Mining

• Sentiment

• Map Reduce

Fields of Study & Techniques

• Dashboards

• Insights

• Visualization

• Ad Hoc Queries

• Reporting

Business Intelligence

• Software

• In-Database Analysis

• Distributed Computing

• Hadoop

• Open Source

Implementation

• Big Data

• Decision Science

• Internet of Things

• Real-Time

• Hacking

• In-Memory

Industry Buzzwords


What is Data Science?

The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.

DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST


Gene Sequencing

Smart Grids

COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014

READING SMART METERSEVERY 15 MINUTES IS

3000X MOREDATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS250 MILLION

PHOTOS EACH DAY

Billions of Data Points

Oil Exploration

Video Surveillance

OIL RIGS GENERATE

25000DATA POINTS PER SECOND

Medical Imaging

Mobile Sensors


What is Big Data Analytics?

DescriptiveAnalytics

WHAT HAPPENED?DiagnosticAnalytics

WHY DID IT HAPPENED?

BI

Data Science

PredictiveAnalytics

WHAT WILL HAPPEN?

PrescriptiveAnalytics

HOW CAN WE MAKE IT HAPPEN?

Hindsight

Insight

Foresight

Complexity

Value ofAnalytics

($)


P L A T F O R M

Data Science Toolkit

KEY TOOLS KEY LANGUAGES

SQL


Scalable, In-Database ML

• Open Source https://github.com/madlib/madlib• Works on Greenplum DB, HAWQ and PostgreSQL• In active development by Pivotal• Downloads and Docs: http://madlib.net/

https://github.com/madlib/madlib



http://madlib.net/

http://madlib.net/

http://madlib.net/


Functions

Supervised LearningRegression Models• Cox Proportional Hazards Regression• Elastic Net Regularization• Generalized Linear Models• Linear Regression• Logistic Regression• Marginal Effects• Multinomial Regression• Ordinal Regression• Robust Variance, Clustered Variance• Support Vector MachinesTree Methods• Decision Tree• Random ForestOther Methods• Conditional Random Field• Naïve Bayes

Unsupervised Learning• Association Rules (Apriori)• Clustering (K-means) • Topic Modeling (LDA)

StatisticsDescriptive• Cardinality Estimators• Correlation• SummaryInferential• Hypothesis TestsOther Statistics• Probability Functions

Other Modules• Conjugate Gradient• Linear Solvers• PMML Export• Random Sampling• Term Frequency for Text

Time Series• ARIMA

Aug 2015

Data Types and Transformations• Array Operations• Dimensionality Reduction (PCA)• Encoding Categorical Variables• Matrix Operations• Matrix Factorization (SVD, Low Rank)• Norms and Distance Functions• Sparse Vectors

Model Evaluation• Cross Validation

Predictive Analytics Library


A single address for everything analyticsAnalytics with Pivotal

Time-to-Insights

FORECASTING CLUSTERING

REGRESSION

CLASSIFICATION

OPTIMIZATION


Smart Systems = Sensors + Digital Brain + Actuators

Problem Formulation

Modeling Step

Data StepApplication Step

Data Science forBuilding Models

Sensors & Actuators

Data Lake


Data Science Use Cases


Financial Services


Identifying and Pricing Cross-Sell Opportunities

CUSTOMER

A global financial services provider

BUSINESS PROBLEM

Identify cross-sell opportunities between two business arms of a financial institution.

CHALLENGES

Integration of large-scale data originating from multiple data warehouses. Developing predictive models to identify novel cross-sell opportunities within the financial institution. Evaluate the identified cross-sell opportunities by their revenue potential.

SOLUTIONS

Fast integration of data in Pivotal Greenplum Database.

Predictive models and evaluation of profitability:– Association rule. – Logistic regression for each product

offered.– Estimation of revenue opportunity.

On-demand reporting and visualization via custom dashboards connected to in-database models.

Identified multi-million dollar opportunities for the bank.


Financial Compliance

BUSINESS PROBLEMEnsure compliance with Dodd-Frank and Basel Committee regulationsIdentify underlying risk and fraud while reducing the compliance department’s overburdened

Emails Chats Trades

Transactions Policy Securities

Phone Calls Watch Lists …

Financial complianceData Lake

Data integration

Data clean up Modeling Classification

and ranking

Analyst user interfacesFeedback

Analytics

Analyst feedbackData integration: e.g., append trade information with email and chat communications

Data cleanup: e.g., identify newsletters and spam emails

Modeling: • Predictive modeling to flag

messages and trades• Graph and cohort analysis

Analyst feedbackReviewed fraud instances included in periodic model refreshes

SOLUTION A data lake platform coupled with cutting edge data

science techniques Flexible user interface to promote an adaptive,

continuously learning compliance framework


Telco & Mobile


Subscriber Micro-SegmentationCUSTOMER

A major telco with cable & VOD, internet, and phone business unitsBUSINESS PROBLEM

Better understand aggregated subscriber behavior to drive business strategy using newly available data sources

CHALLENGES

▪ Large quantities of deep packet inspection data and set top box data that had not been analyzed before

▪ Needed to incorporate internet usage and TV consumption information into pre-existing subscriber segments

SOLUTION

▪ Generated new subscriber segments that incorporated features based on consumption of TV and internet services across a variety of devices

▪ Crossed new segments with existing segments to generate new micro-segments for cross-sell/upsell and new product development opportunities

Customized Micro-Segments


Newly Identified Behavior-Based Segments S

ubsc

riber

s

Moderates

OTT & Data Heavyweights

Portable OTT Entertainment Seekers

iPhone Heavy

Android Heavy

iPad Heavy

In-Home OTT Entertainment Seekers

In-Home Native Content Seekers

VOD Heavy

TV Heavy

http://images.search.yahoo.com/images/view;_ylt=A0PDoKrx_ZxQMlEARk6JzbkF;_ylu=X3oDMTBlMTQ4cGxyBHNlYwNzcgRzbGsDaW1n?back=http://images.search.yahoo.com/search/images?p=pandora&n=30&ei=utf-8&fr=my-myy&tab=organic&ri=13&w=512&h=512&imgurl=wp.appadvice.com/wp-content/uploads/2010/10/Pandora-RadioLarge.jpg&rurl=http://appadvice.com/appnn/2010/10/jailbreak-pandora-hack-kills-skip-limits&size=33.9+KB&name=Pandora+Radio+is+one+of+the+greatest+apps+out+there.+This+service+allows+you+to+listen+to+an+unlimited&p=pandora&oid=05c6d4baf523c968148384275a002c39&fr2=&fr=my-myy&tt=Pandora+Radio+is+one+of+the+greatest+apps+out+there.+This+service+allows+you+to+listen+to+an+unlimited&b=0&ni=96&no=13&ts=&tab=organic&sigr=12bfksgbn&sigb=130293gs1&sigi=122mjl9qr&.crumb=hxLAwDIjGQy

http://images.search.yahoo.com/images/view;_ylt=A0PDoKrx_ZxQMlEARk6JzbkF;_ylu=X3oDMTBlMTQ4cGxyBHNlYwNzcgRzbGsDaW1n?back=http://images.search.yahoo.com/search/images?p=pandora&n=30&ei=utf-8&fr=my-myy&tab=organic&ri=13&w=512&h=512&imgurl=wp.appadvice.com/wp-content/uploads/2010/10/Pandora-RadioLarge.jpg&rurl=http://appadvice.com/appnn/2010/10/jailbreak-pandora-hack-kills-skip-limits&size=33.9+KB&name=Pandora+Radio+is+one+of+the+greatest+apps+out+there.+This+service+allows+you+to+listen+to+an+unlimited&p=pandora&oid=05c6d4baf523c968148384275a002c39&fr2=&fr=my-myy&tt=Pandora+Radio+is+one+of+the+greatest+apps+out+there.+This+service+allows+you+to+listen+to+an+unlimited&b=0&ni=96&no=13&ts=&tab=organic&sigr=12bfksgbn&sigb=130293gs1&sigi=122mjl9qr&.crumb=hxLAwDIjGQy


Opportunities for Data-Driven Decisions in Pharma


Data driven drugs: From discovery to delivery

RICH DATA SOURCES• Molecular data

• Cellular drug screens• Animal models

• Clinical data including notes, images, markers (e.g. genomics, lab results)

• Sensor and assay data• Internal and partner/purchased

external data• Contact center data• Patient registries, public and federal

data, clinical partnerships

Clinical Trials

Manufacturing

Marketing

Distribution and surveillance

Drug discovery + development


A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

0 20 40 60 80 100 120 140 160 180 200

0

5

10

15

20

25

30

Sensors

High-Content Screens

TEM

PTIME

Abs

orba

nce

Elution volume

Velo

city

TimeAutomated raw

materials mixing


Vaccine Potency PredictionCUSTOMER A major pharmaceutical company

BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.

CHALLENGES Customer’s data model was not optimal for

running analytical queries Manual data quality issues Data capture was performed with varying

consistency due to high cost associated with manual data collection

SOLUTION Introduced a new data model to make data

accessible and enable analytics (including LIMS and DeltaV)

Built automated outlier detection/correction methods to address manual data entry quality issues

Devised imputation methods to deal with data completeness issues

Built predictive models with high accuracy


http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!


FOR FURTHER INFO…





• Or reach out to your local Pivotal Account Executive…


Pivotal Data Science Labs: Packaged Services

• Analytics Roadmap

• Prioritized Opportunities

• Architectural Recommendations

• Hands-on training

• Hosted data on

Pivotal Data stack

• Results review &assessment

• On-site MPP analytics training

• Analytics tool-kit

• Quick insight (2 weeks)

• Prof. services

• Data science model building

• Ready-to-deploymodel(s)

• Prof. services

• Data sciencemodel building

• Ready-to-deploymodel(s)

LAB PRIMER(2-Week Roadmapping)

LAB 600(6-Week Lab)

LAB 1200(12-Week Lab)

LAB 100(Analytics Bundle)

DATA JAM(Internal DS Contest)


Data Streaming and Predictive AnalyticsUsing Pivotal Big Data Suite


Converging Trends

InnovationNew Data

New Processes

New Insights

The Journey to the Data-Driven Enterprise

Data Scienceand Machine

LearningBig DataIoT, Mobile Apps,

Social Media


HDFS

Data Lake

Ingest Store Analytics

Hard to changeLabor intensive

Inefficient

Coding basedNo real-time informationBased on expensive ETL

Migrating from a Reactive, Static and Constrained Model…


HDFSData Lake Expert System / Machine Learning

In-Memory Real-Time

Data

Continuous LearningContinuous Improvement

Continuous Adapting

Data Stream Pipeline

Multiple Data SourcesReal-Time ProcessingStore Everything

To Pro-Active, Self-Improving, Machine Learning Systems


New York Times Research: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

“50-80% OF THE TIME ON DATA

SCIENCE PROJECTS IS SPENT ON DATA WRANGLING

”


Data FeedsStream Processing

Expert Systems Machine Learning

Historical Data

Business ValueSmart

Decisions

Still…

HDFS

Data Lake


Ingest Transform SinkSpringXD

GemFire

Data Stream Needs an Agile, Scalable and Fast Solution

HAWQ GPDBDataLake



DistributedComputing

In-Memory Real-Time Data

Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining

Expert System / Machine Learning

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB

DataLake


INGEST / SINK PROCESS ANALYZE

• No coding required

• Dozens of built-in connectors

• Seamless integration with Kafka, Sqoop

• Create new connectors easily using Spring

• Call Spark, Reactor or RxJava

• Built-in configurable filtering, splitting and transformation

• Out-of-box configurable jobs for batch processing

• Import and invoke PMML jobs easily

• Call Python, R, Madlib and other tools

• Built-in configurable counters and gauges

Spring XDState of the Art Data Pipeline Automation




GemFire Provides Scalable, Low-Latency Data Access, Storage and Event Processing


GemFire

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB

DataLake


GemFire

• In-Memory Enterprise Data Grid• Horizontally Scalable, Consistent,

Highly Available • Event handling• Continuous Queries• Enterprise Data Geo Distribution

In-memory Real Time Data




Pivotal Provides SQL Based Advanced Analytics


GemFire

ExtensibleOpen-SourceFault-TolerantHorizontally Scalable

DataLake HAWQ GPDB


HAWQ

• Massively Parallel Processing RDBMS on HADOOP

• ANSI SQL on Hadoop• Extremely high

performance for analytics (not like Hive)

• Stores all data directly on HDFS

• Open-Source

Advanced SQL analytics in Hadoop

Combining SQL with Hadoop is key for analytics

SQL remains #1 choice for Data Science



Developers and Data Scientists Can Focus onthe Business Value of Data

GemFire


DataLake HAWQ GPDB


Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps


DistributedComputing Real-Time Data Expert Systems &

Machine LearningAdvancedAnalytics

HDFSData Lake


Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps


HDFSData Lake

GemFire HAWQ GPDB

SpringXD


“SO WE ARE MOVING TO A WORLD WHERE THE

MACHINES WE WORK WITH ARE NOT JUST INTELLIGENT; THEY ARE BRILLIANT.THEY ARE SELF-

AWARE, THEY ARE PREDICTIVE, REACTIVE AND SOCIAL. IT'S A WORLD WHERE INFORMATION

ITSELF BECOMES INTELLIGENT AND COMES TO US AUTOMATICALLY WHEN WE NEED IT WITHOUT

HAVING TO LOOK FOR IT.

”MARCO ANNUNZIATA, GE


DemoPowered by Pivotal Big Data Suite


It's all about DATA

Data SourcesLook for patterns

Prediction

Transform Sink

SpringXD

ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native

Machine Learning

Enrich Filter

Split

Dashboard

Indicators

1

2

Predict

3

Real data

Simulator

/Stocks

/TechIndicators

/Predictions


“THE REAL OPPORTUNITY FOR

CHANGE...SURPASSING THE MAGNITUDE OF THE CONSUMER INTERNET...IS THE INDUSTRIAL

INTERNET, AN OPEN, GLOBAL NETWORK THAT CONNECTS PEOPLE, DATA AND MACHINES.

”JEFF IMMELT, CEO, GE







• Or reach out to your local Pivotal Account Executive…

BUILT FOR THE SPEED OF BUSINESS


Accelerating the Generation of New Insights

October 27, 2015

Sarah Aerni, Data ScienceAntonio Petrole, Data Engineering


Gene Sequencing

Smart Grids

COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014

READING SMART METERSEVERY 15 MINUTES IS

3000X MOREDATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS250 MILLION

PHOTOS EACH DAY

Technology to process and store data is needed in all industries

Oil Exploration

Video Surveillance

OIL RIGS GENERATE

25000DATA POINTS PER SECOND

Medical Imaging

Mobile Sensors


What is Big Data Analytics?

DescriptiveAnalytics

WHAT HAPPENED?DiagnosticAnalytics

WHY DID IT HAPPEN?

BI

Data Science

PredictiveAnalytics

WHAT WILL HAPPEN?

PrescriptiveAnalytics

HOW CAN WE MAKE IT HAPPEN?

Hindsight

Insight

Foresight

Complexity

Value ofAnalytics

($)


Opportunities for Data-Driven Decisions in Pharma


Data driven drugs: From discovery to delivery

RICH DATA SOURCES• Molecular data

• Cellular drug screens• Animal models

• Clinical data including notes, images, markers (e.g. genomics, lab results)

• Sensor and assay data• Internal and partner/purchased

external data• Contact center data• Patient registries, public and federal

data, clinical partnerships

Clinical Trials

Manufacturing

Marketing

Distribution and surveillance

Drug discovery + development


A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing

Input materials Mix Incubate Filter Centrifuge Final Product

0 20 40 60 80 100 120 140 160 180 200

0

5

10

15

20

25

30

Sensors

High-Content Screens

TEM

PTIME

Abs

orba

nce

Elution volume

Velo

city

TimeAutomated raw

materials mixing


Vaccine Potency PredictionCUSTOMER A major pharmaceutical company

BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.

CHALLENGES Customer’s data model was not optimal for

running analytical queries Manual data quality issues Data capture was performed with varying

consistency due to high cost associated with manual data collection

SOLUTION Introduced a new data model to make data

accessible and enable analytics (including LIMS and DeltaV)

Built automated outlier detection/correction methods to address manual data entry quality issues

Devised imputation methods to deal with data completeness issues

Built predictive models with high accuracy


Interpreting the utility of a measure obtained during manufacturing based on model outcomes

Sample model insights

Some features may reveal tunable parameters to alter potency, others may simply be markers

Features consistently absent from models may be uninformative for predicting potency

Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes

Assayed value Duration of a step

Pot

ency

Pot

ency

Correlation=0.45 Correlation=0.38


Need for new environments to process big data?HDFS STORAGE AND MPP

ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT

VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION

PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

TOOLS

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE


Multiple tools with a single, simple goal: Distributed storage with in-place computation

PivotalHadoop


HAWQ



Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

PivotalHadoop


HAWQ


Identifying duplicates: counting with groupingOpportunities for performance improvements

– Sorting and re-sorting is required in many pipelines

– Single-threaded processes create bottlenecks in speed and need to move data

https://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-1-Map_and_Dedup.pdf

Solution: Leverage Pivotal’s distributed MPP environment by using common database functions


Identifying duplicates: counting with grouping

Reference genome

Mapped reads


Identifying duplicates: counting with grouping

Duplicates

13

11115

13

select locus, count(*) from readsgroup by locusReference genome

Mapped reads


Reference genome

Mapped reads

Counting numbers of reads mapped to featuresselect exon, count(*) from reads JOIN refseqON(<reads overlap exon>)group by exon

5

17

12



Think of it as distributed file system with very large blocks of data

Schema on read allows flexibility for a variety of datasetsCompute using a variety of paradigms (e.g. MapReduce)

PivotalHadoop


HAWQ

Name Node

Data Node 1

Data Node 2

Data Node 3

Data Node 4

1 2 3 2 3 1 1 2



SQL compliantWorld-class query optimizerInteractive queryHorizontal scalabilityRobust data managementCommon Hadoop formatsDeep analytics

PivotalHadoop


HAWQ

Think of it as distributed PostGreSQL (GPDB) on Hadoop• SQL compliant• World-class query optimizer• Interactive query• Horizontal scalability• Robust data management• Common Hadoop formats• Deep analytics


A single address for everything analyticsAnalytics with Pivotal

Time-to-Insights

FORECASTING CLUSTERING

REGRESSION

CLASSIFICATION

OPTIMIZATION


P L A T F O R M

Data Science Toolkit

KEY TOOLS KEY LANGUAGES

SQL


Historically data was studied in silosBRCA dataset

Treatments

Protein Assays

Imaging

VariantsPatient History &Follow-Ups

Gene Expression

Copy NumberVariation

miRNA


GenomicsData Center

ResearcherComputing

ClusterUnnecessary data movementNetwork usage

Need for new environment: Data movementComputation and storage in a single location


In-database genome-wide association study

NetworkInterconnect

MasterSevers

SegmentSevers

SQL & RCOVARIATES GENOTYPESIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

SNP1 2 MAA

CC

TT

AT

CG

TT

AA

GG

TC

TT CG

TC



NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

SQL & RIndiv Covariates

1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES



NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM


1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC




NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM


1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-21

2 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS



NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM


1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC


2 0.3953 7.15x10-17

M 0.000142


• In-database computation of 1 million loci for thousands of individuals in seconds

• Results are easily manipulated and explored



NetworkInterconnect

MasterSevers

SegmentSevers

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

LOR1 LOR2 LORM


1 2 10

1 F 23 18

2 M 39 41

3 M 50 23

N F 19 24

Indiv SNP

Geno

1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC


2 0.3953 7.15x10-17

M 0.000142


• In-database computation of 1 million loci for thousands of individuals in seconds

• Results are easily manipulated and explored


Procedural Languages in Big Data Science HAWQ & PL/X can take advantage of “data

parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks

– Little/no effort required to break up the problem into parallel tasks

– No dependency (or communication) between tasks

Examples of ‘data parallel’ problems:– Counting words in documents– Genome-Wide Association Study– Studying network anomalies

Sample Implementations by PDL– Digital image processing– Bayesian Inference with MCMC– Parallel Bagged Decision Trees

Doc1 Doc2 DocM

Stem1 Stem2 StemM

SQL & R

Count1 Count2 CountM

NetworkInterconnect

MasterSevers

SegmentSevers


Finding Causal Variants in LupusCustomer

Biotech Company

Business Problem

The customer wants to establish internal data science capabilities: building a culture and acquiring hardware and people to support it

Challenges

Customer needs to establish a culture around sharing and analyzing datasets for value

Current in-house technology is unable to support large-scale analysis (e.g. unable to analyze genomics datasets)

Need to learn new paradigms for analyzing data at-scale

Solution

Train customer employees on our solution stack, provide one-on-one consulting and run a hackathon

Greatly reduced computation time enabling implementation and results during a 30 hour period

– Module previously requiring 30min took only 5sec in using SQL in database

Novel scientific discovery on untouched data– Dataset previously untouched for 2 years due to

limited resources– Mine for statistically significant associations

between ~400,000 variants for 1000 phenotypes


Processing images and building integrated models at scale


Image Computation Framework

Hadoop Sequenc

eFile

Thousands of Images

Image Pre-Processing

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS

Map reduce

Map reduce

One sequence file



Hadoop Sequenc

eFile

Thousands of Images


Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Join to additional datasets

ProteomicsMedical HistoryVariants

Additional Datasets

Build Models at-Scale

SQLOne sequence file



Hadoop Sequenc

eFile

Thousands of Images

One sequence file


Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Raw Pixels

img1 [rgb1-rgbK]

img2 [rgb1-rgbK]

imgN [rgb1-rgbK]

Map reduce



Additional Datasets


SQL



Hadoop Sequenc

eFile

Thousands of Images

One sequence file


Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Raw Pixels

img1 [rgb1-rgbK]

img2 [rgb1-rgbK]

imgN [rgb1-rgbK]

Map reduce PL/X

SQL


Feature Generation

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]


Additional Datasets


SQL


Representing an image in HAWQHAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations

Source Image:Col

Row

0 1 2012

0 00 10 21 01 11 22 02 12 2

col

row

intsy

Structured:


Translating image processing to simple SQL

Function Distribution of pixel intensities

SQL SELECT intsy, count(*) FROM tbl GROUP BY intsy

Output 150, 5215, 4

HAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations

No data movement required

Simple SQL queries for data exploration0 00 10 21 01 11 22 02 12 2

Source Image:Col

Row

0 1 2012

col

row

intsy

Structured:


Image Processing PipelineFor Object Counting

Original

Image name # Cells

Tma_001.jpg 359

Tma_002.jpg 1892

Tma_003.jpg 871

… …

SmoothingAverage over

window of pixels

ThresholdingSelect pixels under intensity threshold

CleanupMin/max over

window of pixels

Object DetectionConnected

components

Object CountingSelect components

with size filter



Hadoop Sequenc

eFile

Thousands of Images

One sequence file


Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]

Feature Generation

HDFS HAWQ/GPDB

Map reduce

Map reduce

Raw Pixels

img1 [rgb1-rgbK]

img2 [rgb1-rgbK]

imgN [rgb1-rgbK]

Map reduce PL/X

SQL


Feature Generation

Features

img1 [x1-xM]

img2 [x1-xM]

imgN [x1-xM]


Additional Datasets


SQL


A Drug-Centric Data Lake to Enable Drug DiscoveryCustomer A major pharmaceutical company

Business ProblemIdentifying promising drug targets leveraging and integrating the vast datasets available will reduce time and cost to bring a new product to market

Challenges Data for drugs screens across multiple modalities

cannot be easily integrated Current environments cannot support the growing

data form high-content screens Researchers are unable to leverage the entirety of

datasets, instead work with aggregates or summaries

Solution Proved that current customer models can be ported,

sped up and scaled in the Pivotal environment

Created richer models integrating multiple types of data (genomics, images, etc)

Improve models using the raw, most-granular data

Demonstrate how the availability of tools and data enables scientists to interrogate models and derive a deeper, actionable insight


http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!





http://blog.pivotal.io/


Driving Insights from Data Lakes Manufacturing Demo

Matthew Ross & Antonio PetrolePivotal Data Engineering


Internet of What????


Industrial Internet of Things?


IoT Goes MainstreamAccording to Gartner, Inc. (a technology research and advisory corporation), there will be nearly 26 billion devices on the Internet of Things by 2020.

https://en.wikipedia.org/wiki/Gartner


GE Doubles DownGE invests in IIoT cloud and creates Predix cloud built upon Pivotal Cloud Foundry. GE estimates that connecting these industrial machines to the IoT could boost global GDP by $10 trillion to $15 trillion in 20 years. McKinsey Global Institute research holds that IoT in general could add $6.2 billion to the global economy by 2025.

http://www.mckinsey.com/insights/high_tech_telecoms_internet/the_internet_of_things_sizing_up_the_opportunity


Converging Trends

InnovationNew Data New Processes New Insights

The Journey to the Data-Driven Enterprise

Data Scienceand Machine

LearningBig Data

Internet of Things


IoT Key WorkflowsData Flow Management

Reliable Infrastructure

Enterprise Level Tooling

● High Availability

● Fault Tolerant

● Scalable

● Data Overload

● Normalization

● Multiple Sources and Destinations

● Workflow Orchestration

● Admin Tooling

● Developer Enablement






● Fault Tolerant

● Scalable

● Data Overload

● Normalization



● Admin Tooling



Data Flow Management-Data Overload● The ability to stream

and process massive amounts of data

● Must have a platform that can handle that much data without losing or corrupting any of it


Data Flow Management-Data Normalization and Cleansing

● Organizing Fields to fit into a relational structure

● Adding extra fields or removing unneeded ones


Data Flow Management-Multiple Sources and Destinations

● Stream data from multiple different sources

● Persist it so multiple different destinations

● Process multiple different data formats






● Fault Tolerant

● Scalable

● Data Overload

● Normalization



● Admin Tooling



Reliable Infrastructure-High Availability and Fault Tolerance

● Resources are available under any conditions

● System stays up even if some resources go down


Reliable Infrastructure-Scalability

● Must be able to handle more traffic demand at any time

● Easily process big data workloads






● Fault Tolerant

● Scalable

● Data Overload

● Normalization



● Admin Tooling



Enterprise Level Tooling-Workflows and Tooling

● Manage complex flows of data

● Provide rich User Interface Applications


Enterprise Level Tooling-Developer Enablement

● Full Featured APIs

● Extreme module customization if needed


Are you making the most out of your data?


Bringing it all together


Reporting is nice, but being able to take action is what drives the value of a platform


Predictive Analytics

Proactive Monitoring

Reactive Maintenance


ETL vs Streaming● Data is loaded in large

batches● Typically happens once a

day● Analysis can only be done

once data is transformed and persisted to data warehouse


ETL vs Streaming● Continuous, data streams

are “listening” for data being emitted from sensors

● Data can be analyzed in stream

● Can be integrated into data driven applications


Reactive Maintenance● Alert is sent out to someone

on factory floor● Worker must receive alert, and

be able to react● Equipment is down for

however long it takes for worker to receive notification and complete repairs.


Proactive Maintenance Workflow● Manager has Dashboard with

all gauges of the system● Historical Log of Recent Alerts

are visible on the Dashboard● Has the ability to dispatch a

worker to investigate a specific line or robot


Predictive Analytics Workflow● Run Machine Learning and

Data Science models

● Incorporate Business Intelligence Tools

● Persist data to Data Lake and run advanced queries


Data Streaming Needs an Agile, Scalable and Fast Solution

Data Lake

Data Ingestion

Business Intelligence

Real Time Analytics

Mobile App



Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining

HDB PHD

DataLake



INGEST / SINK PROCESS ANALYZE

• No coding required

• Dozens of built-in connectors

• Seamless integration with Kafka, Sqoop

• Create new connectors easily using Spring

• Call Spark, Reactor or RxJava

• Built-in configurable filtering, splitting and transformation

• Out-of-box configurable jobs for batch processing

• Import and invoke PMML jobs easily

• Call Python, R, Madlib and other tools

• Built-in configurable counters and gauges

Spring XDState of the Art Data Pipeline Automation


Pivotal HDBHadoop Native SQL

• Exceptional Hadoop Native SQL Performance

• No compatibility risks to SQL developers or SQL BI tools and applications

• Support query roll-ups, dynamic partitions and joins

• Massive MPP scalability to petabytes

• On premise or on the cloud

• Scale your cluster out, not up

• World class parallel loading and unloading

• Fast performance for complex and advanced data analytics

• Integrated with MADLib for advanced machine learning

• Powerful Cost-based Query Optimizer

Driving Real Insights Through Data Science

Data & Analytics

Transcript of Driving Real Insights Through Data Science