Driving Real Insights Through Data Science
Click here to load reader
-
Upload
pivotal -
Category
Data & Analytics
-
view
1.637 -
download
0
Transcript of Driving Real Insights Through Data Science
The Journey to Becoming a Data-Driven Enterprise
Pivotal Big Data Roadshow 2015
2© Copyright 2015 Pivotal. All rights reserved.
Where we’re going today…
3 Great Keynotes• Journey to a Data-driven Enterprise
• Data Science Use Cases
• Streaming Data and Predictive Analytics
Stock Inference Demo and Architecture Overview
Intensive hands-on training sessions
3© Copyright 2015 Pivotal. All rights reserved.
Today’s Agenda
12:00 PM – 12:45 PM - Check-In & Lunch
12:45 PM – 1:00 PM – Welcome and Agenda Review
1:00 PM – 1:PM AM – How Pivotal’s Tools Help Drive Value from Data Science
1:20 PM – 2:20 PM – Accelerating the Generation of New Insights – R&D Use Case
Review and Demo
2:20 PM – 2:30 PM – Coffee Break
2:30 PM – 3:00 PM – Manufacturing Use Case Review and Demo
3:00 PM – 3:15 PM – Closing Remarks
© Copyright 2015 Pivotal. All rights reserved.
MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’ - IT COULD TRANSFORM GE'S BUSINESS -
AND THE ECONOMY.
JEFF IMMELT, CEO, GE
“
”JEFF IMMELT, CEO, GE
© Copyright 2015 Pivotal. All rights reserved.
THE POWER OF 1
RX
IncreasingFreight Utilization Rail
PredictiveMaintenance Healthcare
PredictiveDiagnostics Power
Driving Outcomes That Matter
One Percent Improvement Equals$27B
Industry Value byReducing System
Inefficiency
$63BIndustry Value byReducing Process
Inefficiency
$66BIndustry Value with
Efficiency ImprovementsIn Gas-fired Power
Plant FleetsSource: General Electric
© Copyright 2015 Pivotal. All rights reserved.
DATA-DRIVEN ENTERPRISE JOURNEY
STORE• Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE• Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP• Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, and Mobile
INNOVATE• Agile Dev Expertise
• DevOps
• Microservice
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
CLOUD NATIVE PLATFORM
8© Copyright 2015 Pivotal. All rights reserved.
0% of CIOs think their IT infrastructure is fully prepared for big data (3)
30% of companies have deployed advanced analytics, 11% big data analysis (4)
44% of new applications failed to meet performance expectations (5)
2X90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6)
But…
80% of CEOs thinking data mining and analysis are strategically important (1)
4% of companies use analytics effectively (2)
(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study
LARGE ENTERPRISE BIG DATA TROUBLE
9© Copyright 2015 Pivotal. All rights reserved.
BIG DATACHASM
70%of data
generated by customers
80%of data stored
3%prepared for
analysis
0.5%being
analyzed
<0.5%being
operationalized
9
THE DATA DIVIDE
10© Copyright 2015 Pivotal. All rights reserved.
Software Is Eating The World
Data Is Fueling Software
SOFTWARE IS EATING THE WORLD
11© Copyright 2015 Pivotal. All rights reserved.
WE CHOSE PIVOTAL BECAUSE WE BELIEVE IT PROVIDES A 360-DEGREE VIEW OF THE PROCESS.
FROM A DATA SCIENCE AND DATA TECHNOLOGY PERSPECTIVE, IT MEANS DELIVERING BEST-IN-
CLASS DATA TECHNOLOGIES AND ENABLING THEM ON THEIR PLATFORM.
“
”
12© Copyright 2015 Pivotal. All rights reserved.
ACROSS INDUSTRIES
13© Copyright 2015 Pivotal. All rights reserved.
THE NEW DATA IMPERATIVES
ConvergedData & Cloud
OpenData-DrivenApps
14© Copyright 2015 Pivotal. All rights reserved.
THE BIG DATA PROBLEM
Fragmentation ConstraintsComplexity
15© Copyright 2015 Pivotal. All rights reserved.
• Remove Lock-in
• Leverage Ecosystem
• Co-innovate
GUIDING PRINCIPLES IN THE NEW ERA
OPEN AGILE CLOUD-READY
• Shorten innovation cycles
• Reduce TCO
• Improve TTM
• Solve business problems
• Avoid lock-in
• Appropriate security
16© Copyright 2015 Pivotal. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
17© Copyright 2015 Pivotal. All rights reserved.
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
DATA-DRIVEN COMPANIES:
USE MODERN DATA INFRASTRUCTURE
18© Copyright 2015 Pivotal. All rights reserved.
MODERNIZE DATA INFRASTRUCTURE
Elastic, Scale-outstorage and processing
Flexible data types and pipelining
ETL on demand: low operational costExpanded use cases
Higher quality analyticsLowered storage/processing cost
Less fragmented ecosystemReduced vendor lock-in
REQUIREMENTS BENEFITS
Cloud friendly and open-source based
19© Copyright 2015 Pivotal. All rights reserved.
Modernize data infrastructure
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
DATA-DRIVEN COMPANIES:
STRATEGICALLY USE ADVANCED ANALYTICS
20© Copyright 2015 Pivotal. All rights reserved.
ADVANCED ANALYTICS
Leverage existing skills and toolsRapid time to insights
Internet of Things use casesRapid time to insights
Solve business problemsPredictive insights: proactive execution
REQUIREMENTS BENEFITS
Machine learning and advanced analytics
010101010101010100101010101010101100101010
SQL- compliant batch and interactive queries
Massive stream processing
0101010101010101001010
1010101010110010101010
10101010
21© Copyright 2015 Pivotal. All rights reserved.
Modernize data infrastructure
Perform advanced analyticsDiscover insights
DATA-DRIVEN COMPANIES:
INNOVATE AT SCALE
Deploy analytic apps and automate at scale
22© Copyright 2015 Pivotal. All rights reserved.
ANALYTIC APPS AND AUTOMATION AT SCALE
Reduced time to actionLow ‘analytics app-dev’ integration cost
Reduced time to insightsFlexible ingestion: low operating cost
High performance: low operating costTransactional safety: business critical ops
REQUIREMENTS BENEFITS
Low-latency, distributed in-memory transactions
Resilient, scale-out messaging and object storage
Agile analytic app-dev with enterprise PaaSPaaS
23© Copyright 2015 Pivotal. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
Pivotal Data Sciencehelps you move from BI to
Data Science
Pivotal Labs helps you move to an agile development of apps at scale
Pivotal Data Engineeringhelps you move from data administration to data engineering
26© Copyright 2015 Pivotal. All rights reserved.
PIVOTAL BIG DATA SUITE
27© Copyright 2015 Pivotal. All rights reserved.
Open sourcing all Pivotal Big Data Suite components including:
WORLD’S FIRST OPEN SOURCED BIG DATA PORTFOLIOBUILDING ON SUCCESS OF CLOUD FOUNDRY FOUNDATION
BUILT FOR ENTERPRISES
Pivotal GemFireApache Geode
Apache HAWQPivotal HDB
PivotalGreenplum Database
28© Copyright 2015 Pivotal. All rights reserved.
BUILT FOR ENTERPRISES
Value added features: enterprise grade performance + robustness without lock-in• Advanced Query Optimization in analytics
• WAN replication and continuous query in transactional processing
Flexible Deployment models: align to business objectives and needs• Balance cost objectives with policy and compliance requirements
• Leverage Pivotal’s pre-integration + certification on supported configurations
Enterprise grade support: one throat to choke for the suite• Focus on business problems – not on lifecycle management
• Expert support on Big Data Suite means reduced business risk
29© Copyright 2015 Pivotal. All rights reserved.
• Common core for Hadoop ecosystem
• Rapidly accelerated certifications, ecosystem development and enterprise-grade quality
OpenDataPlatform.org
OPEN
30© Copyright 2015 Pivotal. All rights reserved.
AGILE
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
Spring XD Spark Pivotal HD & Open Data Platform
Pivotal Greenplum Database
Pivotal HDB Rabbit MQ
Redis
Pivotal GemFire Pivotal BDS on PCF
Pivotal Cloud Foundry
31© Copyright 2015 Pivotal. All rights reserved.
CLOUD-READY
COMMODITY HARDWARE APPLIANCE HYBRID CLOUDCLOUD
IaaS IaaS
PAAS
32© Copyright 2015 Pivotal. All rights reserved.
DATA-DRIVEN ENTERRPRISE JOURNEY WITH PIVOTAL BIG DATA SUITE
STORE• Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE• Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP• Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, IoT, and Mobile
INNOVATE• Agile Dev Expertise
• DevOps
• Microservices
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
CLOUD NATIVE PLATFORM
Spring XD
Spark
Pivotal HD & Open Data Platform
Spring XD
Pivotal Greenplum Database
Pivotal HDB
Spring XD
Pivotal GemFire
Rabbit MQ
Spring Cloud
Pivotal BDS on PCF
Pivotal Cloud Foundry
Pivotal LabsData ScienceData Engineering
35© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
Or reach out to your local Pivotal Account Executive…
36© Copyright 2015 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Science Overview and Use CasesPivotal Big Data Roadshow
37© Copyright 2015 Pivotal. All rights reserved.
DATA SCIENCE?
App Development
Analytics
Business IntelligenceReporting
Visualization
Dashboards
Insights Big Data
Machine LearningStatistics
MathematicsTime Series
Algorithms
Databases
Software
Modeling
Queries
Real-Time
Sensors
Predictive Models
ETL
Research
Hadoop
Distributed Computing
MapReduce
SQL
In-Memory
OLAP
Text Mining
Unstructured Data
Open Source
Decision Science
Ad Hoc Queries
Hacking
In-Database Analytics
Internet of Things
Data Cleansing
Sentiment
38© Copyright 2015 Pivotal. All rights reserved.
• ETL
• Unstructured
• Data Cleansing
• Sensors
Data Related
• Algorithms
• Mathematics
• Statistics
• Econometrics
• Predictive Modeling
• Machine Learning
• Text Mining
• Sentiment
• Map Reduce
Fields of Study & Techniques
• Dashboards
• Insights
• Visualization
• Ad Hoc Queries
• Reporting
Business Intelligence
• Software
• In-Database Analysis
• Distributed Computing
• Hadoop
• Open Source
Implementation
• Big Data
• Decision Science
• Internet of Things
• Real-Time
• Hacking
• In-Memory
Industry Buzzwords
39© Copyright 2015 Pivotal. All rights reserved.
What is Data Science?
The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.
DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST
40© Copyright 2015 Pivotal. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014
READING SMART METERSEVERY 15 MINUTES IS
3000X MOREDATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS250 MILLION
PHOTOS EACH DAY
Billions of Data Points
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
41© Copyright 2015 Pivotal. All rights reserved.
What is Big Data Analytics?
DescriptiveAnalytics
WHAT HAPPENED?DiagnosticAnalytics
WHY DID IT HAPPENED?
BI
Data Science
PredictiveAnalytics
WHAT WILL HAPPEN?
PrescriptiveAnalytics
HOW CAN WE MAKE IT HAPPEN?
Hindsight
Insight
Foresight
Complexity
Value ofAnalytics
($)
42© Copyright 2015 Pivotal. All rights reserved.
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
43© Copyright 2015 Pivotal. All rights reserved.
Scalable, In-Database ML
• Open Source https://github.com/madlib/madlib• Works on Greenplum DB, HAWQ and PostgreSQL• In active development by Pivotal• Downloads and Docs: http://madlib.net/
44© Copyright 2015 Pivotal. All rights reserved.
Functions
Supervised LearningRegression Models• Cox Proportional Hazards Regression• Elastic Net Regularization• Generalized Linear Models• Linear Regression• Logistic Regression• Marginal Effects• Multinomial Regression• Ordinal Regression• Robust Variance, Clustered Variance• Support Vector MachinesTree Methods• Decision Tree• Random ForestOther Methods• Conditional Random Field• Naïve Bayes
Unsupervised Learning• Association Rules (Apriori)• Clustering (K-means) • Topic Modeling (LDA)
StatisticsDescriptive• Cardinality Estimators• Correlation• SummaryInferential• Hypothesis TestsOther Statistics• Probability Functions
Other Modules• Conjugate Gradient• Linear Solvers• PMML Export• Random Sampling• Term Frequency for Text
Time Series• ARIMA
Aug 2015
Data Types and Transformations• Array Operations• Dimensionality Reduction (PCA)• Encoding Categorical Variables• Matrix Operations• Matrix Factorization (SVD, Low Rank)• Norms and Distance Functions• Sparse Vectors
Model Evaluation• Cross Validation
Predictive Analytics Library
45© Copyright 2015 Pivotal. All rights reserved.
A single address for everything analyticsAnalytics with Pivotal
Time-to-Insights
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
46© Copyright 2015 Pivotal. All rights reserved.
Smart Systems = Sensors + Digital Brain + Actuators
Problem Formulation
Modeling Step
Data StepApplication Step
Data Science forBuilding Models
Sensors & Actuators
Data Lake
47© Copyright 2015 Pivotal. All rights reserved. 47© Copyright 2013 Pivotal. All rights reserved.
Data Science Use Cases
48© Copyright 2015 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved.
Financial Services
49© Copyright 2015 Pivotal. All rights reserved.
Identifying and Pricing Cross-Sell Opportunities
CUSTOMER
A global financial services provider
BUSINESS PROBLEM
Identify cross-sell opportunities between two business arms of a financial institution.
CHALLENGES
Integration of large-scale data originating from multiple data warehouses. Developing predictive models to identify novel cross-sell opportunities within the financial institution. Evaluate the identified cross-sell opportunities by their revenue potential.
SOLUTIONS
Fast integration of data in Pivotal Greenplum Database.
Predictive models and evaluation of profitability:– Association rule. – Logistic regression for each product
offered.– Estimation of revenue opportunity.
On-demand reporting and visualization via custom dashboards connected to in-database models.
Identified multi-million dollar opportunities for the bank.
50© Copyright 2015 Pivotal. All rights reserved.
Financial Compliance
BUSINESS PROBLEMEnsure compliance with Dodd-Frank and Basel Committee regulationsIdentify underlying risk and fraud while reducing the compliance department’s overburdened
Emails Chats Trades
Transactions Policy Securities
Phone Calls Watch Lists …
Financial complianceData Lake
Data integration
Data clean up Modeling Classification
and ranking
Analyst user interfacesFeedback
Analytics
Analyst feedbackData integration: e.g., append trade information with email and chat communications
Data cleanup: e.g., identify newsletters and spam emails
Modeling: • Predictive modeling to flag
messages and trades• Graph and cohort analysis
Analyst feedbackReviewed fraud instances included in periodic model refreshes
SOLUTION A data lake platform coupled with cutting edge data
science techniques Flexible user interface to promote an adaptive,
continuously learning compliance framework
51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved.
Telco & Mobile
52© Copyright 2015 Pivotal. All rights reserved.
Subscriber Micro-SegmentationCUSTOMER
A major telco with cable & VOD, internet, and phone business unitsBUSINESS PROBLEM
Better understand aggregated subscriber behavior to drive business strategy using newly available data sources
CHALLENGES
▪ Large quantities of deep packet inspection data and set top box data that had not been analyzed before
▪ Needed to incorporate internet usage and TV consumption information into pre-existing subscriber segments
SOLUTION
▪ Generated new subscriber segments that incorporated features based on consumption of TV and internet services across a variety of devices
▪ Crossed new segments with existing segments to generate new micro-segments for cross-sell/upsell and new product development opportunities
Customized Micro-Segments
53© Copyright 2015 Pivotal. All rights reserved.
Newly Identified Behavior-Based Segments S
ubsc
riber
s
Moderates
OTT & Data Heavyweights
Portable OTT Entertainment Seekers
iPhone Heavy
Android Heavy
iPad Heavy
In-Home OTT Entertainment Seekers
In-Home Native Content Seekers
VOD Heavy
TV Heavy
54© Copyright 2015 Pivotal. All rights reserved.
Opportunities for Data-Driven Decisions in Pharma
55© Copyright 2015 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
RICH DATA SOURCES• Molecular data
• Cellular drug screens• Animal models
• Clinical data including notes, images, markers (e.g. genomics, lab results)
• Sensor and assay data• Internal and partner/purchased
external data• Contact center data• Patient registries, public and federal
data, clinical partnerships
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
56© Copyright 2015 Pivotal. All rights reserved.
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
0 20 40 60 80 100 120 140 160 180 200
0
5
10
15
20
25
30
Sensors
High-Content Screens
TEM
PTIME
Abs
orba
nce
Elution volume
Velo
city
TimeAutomated raw
materials mixing
57© Copyright 2015 Pivotal. All rights reserved.
Vaccine Potency PredictionCUSTOMER A major pharmaceutical company
BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.
CHALLENGES Customer’s data model was not optimal for
running analytical queries Manual data quality issues Data capture was performed with varying
consistency due to high cost associated with manual data collection
SOLUTION Introduced a new data model to make data
accessible and enable analytics (including LIMS and DeltaV)
Built automated outlier detection/correction methods to address manual data entry quality issues
Devised imputation methods to deal with data completeness issues
Built predictive models with high accuracy
58© Copyright 2015 Pivotal. All rights reserved.
http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!
59© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
60© Copyright 2015 Pivotal. All rights reserved.
Pivotal Data Science Labs: Packaged Services
• Analytics Roadmap
• Prioritized Opportunities
• Architectural Recommendations
• Hands-on training
• Hosted data on
Pivotal Data stack
• Results review &assessment
• On-site MPP analytics training
• Analytics tool-kit
• Quick insight (2 weeks)
• Prof. services
• Data science model building
• Ready-to-deploymodel(s)
• Prof. services
• Data sciencemodel building
• Ready-to-deploymodel(s)
LAB PRIMER(2-Week Roadmapping)
LAB 600(6-Week Lab)
LAB 1200(12-Week Lab)
LAB 100(Analytics Bundle)
DATA JAM(Internal DS Contest)
61© Copyright 2015 Pivotal. All rights reserved.
Data Streaming and Predictive AnalyticsUsing Pivotal Big Data Suite
62© Copyright 2015 Pivotal. All rights reserved.
Converging Trends
InnovationNew Data
New Processes
New Insights
The Journey to the Data-Driven Enterprise
Data Scienceand Machine
LearningBig DataIoT, Mobile Apps,
Social Media
63© Copyright 2015 Pivotal. All rights reserved.
HDFS
Data Lake
Ingest Store Analytics
Hard to changeLabor intensive
Inefficient
Coding basedNo real-time informationBased on expensive ETL
Migrating from a Reactive, Static and Constrained Model…
64© Copyright 2015 Pivotal. All rights reserved.
HDFSData Lake Expert System / Machine Learning
In-Memory Real-Time
Data
Continuous LearningContinuous Improvement
Continuous Adapting
Data Stream Pipeline
Multiple Data SourcesReal-Time ProcessingStore Everything
To Pro-Active, Self-Improving, Machine Learning Systems
65© Copyright 2015 Pivotal. All rights reserved.
New York Times Research: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
“50-80% OF THE TIME ON DATA
SCIENCE PROJECTS IS SPENT ON DATA WRANGLING
”
66© Copyright 2015 Pivotal. All rights reserved.
Data FeedsStream Processing
Expert Systems Machine Learning
Historical Data
Business ValueSmart
Decisions
Still…
HDFS
Data Lake
67© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
GemFire
Data Stream Needs an Agile, Scalable and Fast Solution
HAWQ GPDBDataLake
68© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
DistributedComputing
In-Memory Real-Time Data
Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining
Expert System / Machine Learning
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB
DataLake
69© Copyright 2015 Pivotal. All rights reserved.
INGEST / SINK PROCESS ANALYZE
• No coding required
• Dozens of built-in connectors
• Seamless integration with Kafka, Sqoop
• Create new connectors easily using Spring
• Call Spark, Reactor or RxJava
• Built-in configurable filtering, splitting and transformation
• Out-of-box configurable jobs for batch processing
• Import and invoke PMML jobs easily
• Call Python, R, Madlib and other tools
• Built-in configurable counters and gauges
Spring XDState of the Art Data Pipeline Automation
70© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
DistributedComputing
GemFire Provides Scalable, Low-Latency Data Access, Storage and Event Processing
Expert System / Machine Learning
GemFire
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB
DataLake
71© Copyright 2015 Pivotal. All rights reserved.
GemFire
• In-Memory Enterprise Data Grid• Horizontally Scalable, Consistent,
Highly Available • Event handling• Continuous Queries• Enterprise Data Geo Distribution
In-memory Real Time Data
72© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
DistributedComputing
Pivotal Provides SQL Based Advanced Analytics
Expert System / Machine Learning
GemFire
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable
DataLake HAWQ GPDB
73© Copyright 2015 Pivotal. All rights reserved.
HAWQ
• Massively Parallel Processing RDBMS on HADOOP
• ANSI SQL on Hadoop• Extremely high
performance for analytics (not like Hive)
• Stores all data directly on HDFS
• Open-Source
Advanced SQL analytics in Hadoop
Combining SQL with Hadoop is key for analytics
SQL remains #1 choice for Data Science
74© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
Developers and Data Scientists Can Focus onthe Business Value of Data
GemFire
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable
DataLake HAWQ GPDB
75© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps
Data Stream Pipeline
DistributedComputing Real-Time Data Expert Systems &
Machine LearningAdvancedAnalytics
HDFSData Lake
76© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps
Data Stream Pipeline
HDFSData Lake
GemFire HAWQ GPDB
SpringXD
77© Copyright 2015 Pivotal. All rights reserved.
“SO WE ARE MOVING TO A WORLD WHERE THE
MACHINES WE WORK WITH ARE NOT JUST INTELLIGENT; THEY ARE BRILLIANT.THEY ARE SELF-
AWARE, THEY ARE PREDICTIVE, REACTIVE AND SOCIAL. IT'S A WORLD WHERE INFORMATION
ITSELF BECOMES INTELLIGENT AND COMES TO US AUTOMATICALLY WHEN WE NEED IT WITHOUT
HAVING TO LOOK FOR IT.
”MARCO ANNUNZIATA, GE
78© Copyright 2015 Pivotal. All rights reserved.
DemoPowered by Pivotal Big Data Suite
79© Copyright 2015 Pivotal. All rights reserved.
It's all about DATA
Data SourcesLook for patterns
Prediction
Transform Sink
SpringXD
ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
81© Copyright 2015 Pivotal. All rights reserved. 81
91© Copyright 2015 Pivotal. All rights reserved.
“THE REAL OPPORTUNITY FOR
CHANGE...SURPASSING THE MAGNITUDE OF THE CONSUMER INTERNET...IS THE INDUSTRIAL
INTERNET, AN OPEN, GLOBAL NETWORK THAT CONNECTS PEOPLE, DATA AND MACHINES.
”JEFF IMMELT, CEO, GE
100© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
BUILT FOR THE SPEED OF BUSINESS
102© Copyright 2015 Pivotal. All rights reserved. 102© Copyright 2013 Pivotal. All rights reserved.
Accelerating the Generation of New Insights
October 27, 2015
Sarah Aerni, Data ScienceAntonio Petrole, Data Engineering
BUILT FOR THE SPEED OF BUSINESS
104© Copyright 2015 Pivotal. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014
READING SMART METERSEVERY 15 MINUTES IS
3000X MOREDATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS250 MILLION
PHOTOS EACH DAY
Technology to process and store data is needed in all industries
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
105© Copyright 2015 Pivotal. All rights reserved.
What is Big Data Analytics?
DescriptiveAnalytics
WHAT HAPPENED?DiagnosticAnalytics
WHY DID IT HAPPEN?
BI
Data Science
PredictiveAnalytics
WHAT WILL HAPPEN?
PrescriptiveAnalytics
HOW CAN WE MAKE IT HAPPEN?
Hindsight
Insight
Foresight
Complexity
Value ofAnalytics
($)
106© Copyright 2015 Pivotal. All rights reserved.
Opportunities for Data-Driven Decisions in Pharma
107© Copyright 2015 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
RICH DATA SOURCES• Molecular data
• Cellular drug screens• Animal models
• Clinical data including notes, images, markers (e.g. genomics, lab results)
• Sensor and assay data• Internal and partner/purchased
external data• Contact center data• Patient registries, public and federal
data, clinical partnerships
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
108© Copyright 2015 Pivotal. All rights reserved.
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
0 20 40 60 80 100 120 140 160 180 200
0
5
10
15
20
25
30
Sensors
High-Content Screens
TEM
PTIME
Abs
orba
nce
Elution volume
Velo
city
TimeAutomated raw
materials mixing
109© Copyright 2015 Pivotal. All rights reserved.
Vaccine Potency PredictionCUSTOMER A major pharmaceutical company
BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.
CHALLENGES Customer’s data model was not optimal for
running analytical queries Manual data quality issues Data capture was performed with varying
consistency due to high cost associated with manual data collection
SOLUTION Introduced a new data model to make data
accessible and enable analytics (including LIMS and DeltaV)
Built automated outlier detection/correction methods to address manual data entry quality issues
Devised imputation methods to deal with data completeness issues
Built predictive models with high accuracy
110© Copyright 2015 Pivotal. All rights reserved.
Interpreting the utility of a measure obtained during manufacturing based on model outcomes
Sample model insights
Some features may reveal tunable parameters to alter potency, others may simply be markers
Features consistently absent from models may be uninformative for predicting potency
Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes
Assayed value Duration of a step
Pot
ency
Pot
ency
Correlation=0.45 Correlation=0.38
111© Copyright 2015 Pivotal. All rights reserved.
Need for new environments to process big data?HDFS STORAGE AND MPP
ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT
VARIETY/VELOCITY
DISTRIBUTED COMPUTATION FOR PARALLELIZATION
PETABYTES OF DATA
OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK
TO ACCESS COMMON LANGUAGES
RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND
TOOLS
SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP
MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE
FLEXIBLE
SCALABLE
ENABLING
ACCESSIBLE
112© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
PivotalHadoop
Pivotal Greenplum Database
HAWQ
113© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
PivotalHadoop
Pivotal Greenplum Database
HAWQ
114© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with groupingOpportunities for performance improvements
– Sorting and re-sorting is required in many pipelines
– Single-threaded processes create bottlenecks in speed and need to move data
https://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-1-Map_and_Dedup.pdf
Solution: Leverage Pivotal’s distributed MPP environment by using common database functions
115© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Reference genome
Mapped reads
116© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Duplicates
13
11115
13
select locus, count(*) from readsgroup by locusReference genome
Mapped reads
117© Copyright 2015 Pivotal. All rights reserved.
Reference genome
Mapped reads
Counting numbers of reads mapped to featuresselect exon, count(*) from reads JOIN refseqON(<reads overlap exon>)group by exon
5
17
12
118© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
Think of it as distributed file system with very large blocks of data
Schema on read allows flexibility for a variety of datasetsCompute using a variety of paradigms (e.g. MapReduce)
PivotalHadoop
Pivotal Greenplum Database
HAWQ
Name Node
Data Node 1
Data Node 2
Data Node 3
Data Node 4
1 2 3 2 3 1 1 2
119© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
SQL compliantWorld-class query optimizerInteractive queryHorizontal scalabilityRobust data managementCommon Hadoop formatsDeep analytics
PivotalHadoop
Pivotal Greenplum Database
HAWQ
Think of it as distributed PostGreSQL (GPDB) on Hadoop• SQL compliant• World-class query optimizer• Interactive query• Horizontal scalability• Robust data management• Common Hadoop formats• Deep analytics
120© Copyright 2015 Pivotal. All rights reserved.
A single address for everything analyticsAnalytics with Pivotal
Time-to-Insights
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
121© Copyright 2015 Pivotal. All rights reserved.
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
122© Copyright 2015 Pivotal. All rights reserved.
Historically data was studied in silosBRCA dataset
Treatments
Protein Assays
Imaging
VariantsPatient History &Follow-Ups
Gene Expression
Copy NumberVariation
miRNA
123© Copyright 2015 Pivotal. All rights reserved.
GenomicsData Center
ResearcherComputing
ClusterUnnecessary data movementNetwork usage
Need for new environment: Data movementComputation and storage in a single location
124© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SQL & RCOVARIATES GENOTYPESIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
SNP1 2 MAA
CC
TT
AT
CG
TT
AA
GG
TC
TT CG
TC
125© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
126© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
127© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
128© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-21
2 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
129© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-21
2 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of 1 million loci for thousands of individuals in seconds
• Results are easily manipulated and explored
130© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
LOR1 LOR2 LORM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-21
2 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of 1 million loci for thousands of individuals in seconds
• Results are easily manipulated and explored
131© Copyright 2015 Pivotal. All rights reserved.
Procedural Languages in Big Data Science HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks
– Little/no effort required to break up the problem into parallel tasks
– No dependency (or communication) between tasks
Examples of ‘data parallel’ problems:– Counting words in documents– Genome-Wide Association Study– Studying network anomalies
Sample Implementations by PDL– Digital image processing– Bayesian Inference with MCMC– Parallel Bagged Decision Trees
Doc1 Doc2 DocM
Stem1 Stem2 StemM
SQL & R
Count1 Count2 CountM
NetworkInterconnect
MasterSevers
SegmentSevers
132© Copyright 2015 Pivotal. All rights reserved.
Finding Causal Variants in LupusCustomer
Biotech Company
Business Problem
The customer wants to establish internal data science capabilities: building a culture and acquiring hardware and people to support it
Challenges
Customer needs to establish a culture around sharing and analyzing datasets for value
Current in-house technology is unable to support large-scale analysis (e.g. unable to analyze genomics datasets)
Need to learn new paradigms for analyzing data at-scale
Solution
Train customer employees on our solution stack, provide one-on-one consulting and run a hackathon
Greatly reduced computation time enabling implementation and results during a 30 hour period
– Module previously requiring 30min took only 5sec in using SQL in database
Novel scientific discovery on untouched data– Dataset previously untouched for 2 years due to
limited resources– Mine for statistically significant associations
between ~400,000 variants for 1000 phenotypes
133© Copyright 2015 Pivotal. All rights reserved.
Processing images and building integrated models at scale
134© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS
Map reduce
Map reduce
One sequence file
135© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Join to additional datasets
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQLOne sequence file
136© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
One sequence file
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map reduce
Join to additional datasets
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQL
137© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
One sequence file
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map reduce PL/X
SQL
Join to additional datasets
Feature Generation
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQL
138© Copyright 2015 Pivotal. All rights reserved.
Representing an image in HAWQHAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations
Source Image:Col
Row
0 1 2012
0 00 10 21 01 11 22 02 12 2
col
row
intsy
Structured:
139© Copyright 2015 Pivotal. All rights reserved.
Translating image processing to simple SQL
Function Distribution of pixel intensities
SQL SELECT intsy, count(*) FROM tbl GROUP BY intsy
Output 150, 5215, 4
HAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations
No data movement required
Simple SQL queries for data exploration0 00 10 21 01 11 22 02 12 2
Source Image:Col
Row
0 1 2012
col
row
intsy
Structured:
140© Copyright 2015 Pivotal. All rights reserved.
Image Processing PipelineFor Object Counting
Original
Image name # Cells
Tma_001.jpg 359
Tma_002.jpg 1892
Tma_003.jpg 871
… …
SmoothingAverage over
window of pixels
ThresholdingSelect pixels under intensity threshold
CleanupMin/max over
window of pixels
Object DetectionConnected
components
Object CountingSelect components
with size filter
141© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
One sequence file
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map reduce PL/X
SQL
Join to additional datasets
Feature Generation
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQL
142© Copyright 2015 Pivotal. All rights reserved.
A Drug-Centric Data Lake to Enable Drug DiscoveryCustomer A major pharmaceutical company
Business ProblemIdentifying promising drug targets leveraging and integrating the vast datasets available will reduce time and cost to bring a new product to market
Challenges Data for drugs screens across multiple modalities
cannot be easily integrated Current environments cannot support the growing
data form high-content screens Researchers are unable to leverage the entirety of
datasets, instead work with aggregates or summaries
Solution Proved that current customer models can be ported,
sped up and scaled in the Pivotal environment
Created richer models integrating multiple types of data (genomics, images, etc)
Improve models using the raw, most-granular data
Demonstrate how the availability of tools and data enables scientists to interrogate models and derive a deeper, actionable insight
143© Copyright 2015 Pivotal. All rights reserved.
http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!
144© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Academy @ https://pivotal.biglms.com
145© Copyright 2015 Pivotal. All rights reserved. 145© Copyright 2013 Pivotal. All rights reserved.
Driving Insights from Data Lakes Manufacturing Demo
Matthew Ross & Antonio PetrolePivotal Data Engineering
146© Copyright 2015 Pivotal. All rights reserved.
Internet of What????
147© Copyright 2015 Pivotal. All rights reserved.
Industrial Internet of Things?
148© Copyright 2015 Pivotal. All rights reserved.
IoT Goes MainstreamAccording to Gartner, Inc. (a technology research and advisory corporation), there will be nearly 26 billion devices on the Internet of Things by 2020.
149© Copyright 2015 Pivotal. All rights reserved.
GE Doubles DownGE invests in IIoT cloud and creates Predix cloud built upon Pivotal Cloud Foundry. GE estimates that connecting these industrial machines to the IoT could boost global GDP by $10 trillion to $15 trillion in 20 years. McKinsey Global Institute research holds that IoT in general could add $6.2 billion to the global economy by 2025.
150© Copyright 2015 Pivotal. All rights reserved.
Converging Trends
InnovationNew Data New Processes New Insights
The Journey to the Data-Driven Enterprise
Data Scienceand Machine
LearningBig Data
Internet of Things
151© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
152© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
153© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Data Overload● The ability to stream
and process massive amounts of data
● Must have a platform that can handle that much data without losing or corrupting any of it
154© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Data Normalization and Cleansing
● Organizing Fields to fit into a relational structure
● Adding extra fields or removing unneeded ones
155© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Multiple Sources and Destinations
● Stream data from multiple different sources
● Persist it so multiple different destinations
● Process multiple different data formats
156© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
157© Copyright 2015 Pivotal. All rights reserved.
Reliable Infrastructure-High Availability and Fault Tolerance
● Resources are available under any conditions
● System stays up even if some resources go down
158© Copyright 2015 Pivotal. All rights reserved.
Reliable Infrastructure-Scalability
● Must be able to handle more traffic demand at any time
● Easily process big data workloads
159© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
160© Copyright 2015 Pivotal. All rights reserved.
Enterprise Level Tooling-Workflows and Tooling
● Manage complex flows of data
● Provide rich User Interface Applications
161© Copyright 2015 Pivotal. All rights reserved.
Enterprise Level Tooling-Developer Enablement
● Full Featured APIs
● Extreme module customization if needed
162© Copyright 2015 Pivotal. All rights reserved.
Are you making the most out of your data?
163© Copyright 2015 Pivotal. All rights reserved.
Bringing it all together
164© Copyright 2015 Pivotal. All rights reserved. 164© Copyright 2013 Pivotal. All rights reserved.
Reporting is nice, but being able to take action is what drives the value of a platform
165© Copyright 2015 Pivotal. All rights reserved.
Predictive Analytics
Proactive Monitoring
Reactive Maintenance
166© Copyright 2015 Pivotal. All rights reserved.
ETL vs Streaming● Data is loaded in large
batches● Typically happens once a
day● Analysis can only be done
once data is transformed and persisted to data warehouse
167© Copyright 2015 Pivotal. All rights reserved.
ETL vs Streaming● Continuous, data streams
are “listening” for data being emitted from sensors
● Data can be analyzed in stream
● Can be integrated into data driven applications
168© Copyright 2015 Pivotal. All rights reserved.
Reactive Maintenance● Alert is sent out to someone
on factory floor● Worker must receive alert, and
be able to react● Equipment is down for
however long it takes for worker to receive notification and complete repairs.
169© Copyright 2015 Pivotal. All rights reserved.
Proactive Maintenance Workflow● Manager has Dashboard with
all gauges of the system● Historical Log of Recent Alerts
are visible on the Dashboard● Has the ability to dispatch a
worker to investigate a specific line or robot
170© Copyright 2015 Pivotal. All rights reserved.
Predictive Analytics Workflow● Run Machine Learning and
Data Science models
● Incorporate Business Intelligence Tools
● Persist data to Data Lake and run advanced queries
171© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Needs an Agile, Scalable and Fast Solution
Data Lake
Data Ingestion
Business Intelligence
Real Time Analytics
Mobile App
172© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining
HDB PHD
DataLake
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable
173© Copyright 2015 Pivotal. All rights reserved.
INGEST / SINK PROCESS ANALYZE
• No coding required
• Dozens of built-in connectors
• Seamless integration with Kafka, Sqoop
• Create new connectors easily using Spring
• Call Spark, Reactor or RxJava
• Built-in configurable filtering, splitting and transformation
• Out-of-box configurable jobs for batch processing
• Import and invoke PMML jobs easily
• Call Python, R, Madlib and other tools
• Built-in configurable counters and gauges
Spring XDState of the Art Data Pipeline Automation
174© Copyright 2015 Pivotal. All rights reserved.
Pivotal HDBHadoop Native SQL
• Exceptional Hadoop Native SQL Performance
• No compatibility risks to SQL developers or SQL BI tools and applications
• Support query roll-ups, dynamic partitions and joins
• Massive MPP scalability to petabytes
• On premise or on the cloud
• Scale your cluster out, not up
• World class parallel loading and unloading
• Fast performance for complex and advanced data analytics
• Integrated with MADLib for advanced machine learning
• Powerful Cost-based Query Optimizer
BUILT FOR THE SPEED OF BUSINESS