Data as the New Oil: Producing Value in the Oil and Gas Industry
-
Upload
pivotal -
Category
Data & Analytics
-
view
1.722 -
download
0
Transcript of Data as the New Oil: Producing Value in the Oil and Gas Industry
1 © 2014 Pivotal Software, Inc. All rights reserved. 1 © 2014 Pivotal Software, Inc. All rights reserved.
Data as the New Oil Producing Value for the Oil & Gas Industry
2 © 2014 Pivotal Software, Inc. All rights reserved.
Data: The New Oil
• Oil and gas exploration and production activities generate large amounts of data from sensors, logistics, business operations and more
• The rise of cost-effective data collection, storage and computing devices is giving an established industry a new boost
• Producing value from big data is a challenge and an opportunity in the industry
• The promise of Data as “the new oil” is realized when we can tap into its value in a meaningful, cross-functional way to enhance decision-making, which provides the competitive advantage
http://commons.wikimedia.org/wiki/File:Rig_wind_river.jpg
3 © 2014 Pivotal Software, Inc. All rights reserved.
Challenges and Opportunities
Challenges • Current data collection and
curation practices are mostly in silos
• Different data models for data from different functions in the organization
• Missing or incomplete data for integrating varied data sources
• Legacy systems that need to be taken into consideration
• Domain expertise in silos – ability to work across domains needed for extracting full value from ‘the new oil’
Opportunities • Data Lake concepts and technology
allow data to be stored centrally and curated in a meaningful way
• Comprehensive, single view of the truth:
– Integration of data assets lead to more informed, powerful models
– Many “first-of-its-kind” models become possible for the business
– These models enhance decision making by providing better predictions
• Real-time application of predictive models can speed up responses to events
4 © 2014 Pivotal Software, Inc. All rights reserved.
Significant Use Cases
• Predictive Maintenance – Model equipment function and failure – Optimize maintenance schedules – Real-time alerts based on predictive models
• Seismic Imaging and Inversion Analysis • Reservoir Simulation and Management • Production Optimization • Supply Chain Optimization • Energy Trading
5 © 2014 Pivotal Software, Inc. All rights reserved. 5 © Copyright 2013 Pivotal. All rights reserved.
Predictive Analytics for Drilling Operations Predicting Equipment Function and Failure
6 © 2014 Pivotal Software, Inc. All rights reserved.
Predictive Analytics for Drilling Operations
Business Goals • Increase efficiency, reduce
costs • Take steps towards zero
unplanned downtime • Predict equipment function
for maintenance • Provide early warning
system for equipment failure • Optimize parameters for
drilling operations • Improve health, safety and
environmental risks
Big Data Sources • Sensor data
– Surface and down-hole sensors
– Measurement While Drilling (MWD)
– SCADA data • Drill Operator data
– Operator comments – Activity log / codes – Incident reports / logs
• And more …
Introduction Data Integration Feature Building Modeling & Impact
7 © 2014 Pivotal Software, Inc. All rights reserved.
Predicting Equipment Function and Failure
• Business Problem: Predict drilling equipment function and failure – a step towards early warning systems and zero unplanned downtime
• Motivation: Drilling wells and equipment failure during the process are expensive. Example: Drilling motor damage could account for 35% of rig non-productive time (NPT) and can cost $150,000 per incident1
• Goals: – Predict equipment function and failure à this enables:
• Optimization of parameters for efficient drilling • Reducing non-productive drill time (and costs) • Reducing failures
– Provide insights into prominent features impacting operation and failure
Introduction Data Integration Feature Building Modeling & Impact
1 The American Oil & Gas Reporter, April 2014 Cover Story
8 © 2014 Pivotal Software, Inc. All rights reserved.
The Eightfold Path of Data Science Four Phases and Four Differentiating Factors
Technology Selection Select the right platform and
the right set of tools for solving the problem at hand
Iterative Approach Perform each phase in an agile manner, team up with domain experts and SMEs,
and iterate as required
Creativity Take the opportunity to innovate at every phase
Building a Narrative Create a fact-based narrative
that clearly communicates insights to stakeholders
Phase 1: Problem Formulation Make sure you formulate a
problem that is relevant to the goals and pain points of the
stakeholders
Phase 2: Data Step Build the right feature set
making full use of the volume, variety and velocity of all
available data
Phase 3: Modeling Step This is where you move from answering what, where and when to answering why and
what if?
Phase 4: Application Create a framework for
integrating the model with decision making processes and taking action using the
Internet of Things
Introduction Data Integration Feature Building Modeling & Impact
9 © 2014 Pivotal Software, Inc. All rights reserved.
Technology Selection
• Platform for all phases of the analytics cycle • Support development of complex and extensible predictive models to
predict equipment function and failure • Provide framework for integrating data from multiple sources across data
warehouses and rig operators • Ability to analyze both structured and unstructured data in a unified manner.
For instance: – Support fast computation of hundreds of features over time windows within 100s
of millions (or billions / trillions) of records of time-series data – Natural language processing pipeline for analysis of operator comments to
identify failures from unstructured text
Introduction Data Integration Feature Building Modeling & Impact
PL/Python PL/R
10 © 2014 Pivotal Software, Inc. All rights reserved.
Predictive Analytics for Drilling Operations
• Consider two examples: – Predicting drill rate-of-penetration (ROP) – Predicting drilling equipment failure
• Primary data sources for these examples – Drill Rig Sensor Data: Depth, Rate of Penetration (ROP), RPM,
Torque, Weight on Bit, etc… ( >billions of records) – Operator Data: Drill Bit details, Failure details, Component details etc…
(>100s of thousands of records)
Introduction Data Integration Feature Building Modeling & Impact
Data Integration
Feature Building Modeling
11 © 2014 Pivotal Software, Inc. All rights reserved.
Drill Rig Sensor
data
Comprehensive Data Integration Framework
• Need a comprehensive framework for data integration at scale – Data cleansing – removing NULLs and outliers, missing value
impuation techniques – Standardizing columns that are used to join across multiple data
sources
Sensor and Operator data
integrated
Introduction Data Integration Feature Building Modeling & Impact
Operator data
12 © 2014 Pivotal Software, Inc. All rights reserved.
Data Integration Challenges
• Data sources do not use consistent entries in features / columns that link them (join columns) – e.g. well names
• Manually entered data (some operator data) is prone to entry errors – Hitting several keys – Key strokes not appearing (e.g. missing a character / digit)
• Invalid values for sensor measurements – Invalid values could be placeholders for sensor malfunction or
non-recording time – Duration of invalid values can range from one-off occurrences to
several hours
Introduction Data Integration Feature Building Modeling & Impact
13 © 2014 Pivotal Software, Inc. All rights reserved.
Data Integration Challenges
• Standardization of join column entries across data sources • Problem: Data sources do not use consistent entries in join columns
• Resolution options: Derive a canonical representation for the columns – Regular expression transformations – String edit distance computations à closest distance matches – + Manual correction
• Include standardized entries in each table
Introduction Data Integration Feature Building Modeling & Impact
Data Source #1 Data Source #2 A B C A-B-C
PARENT-TEACHER PARENT-TEACHERS
GRANDFATHER CLOK GRANDFATHER_CLOCK
KOALA 123 KOALA 122
14 © 2014 Pivotal Software, Inc. All rights reserved.
Data Integration Challenges
• Problem: Manually entered data is prone to operator entry errors – Hitting several keys – Key strokes not appearing (e.g. missing a digit / character)
• Resolution options: – Ignore rows if depth does not lie between previous and next values – Replace value with interpolated result
Timestamp Depth
2014-09-01 00:06:00 13504
2014-09-02 00:05:00 140068
2014-09-03 00:07:00 14754
2014-09-04 00:11:00 15388
2014-09-05 00:16:00 16100
Introduction Data Integration Feature Building Modeling & Impact
15 © 2014 Pivotal Software, Inc. All rights reserved.
Understanding Correlations in Data
• Summary statistics and Correlations between variables need to be computed at-scale for >1000s of variable combinations
• Able to leverage MADlib’s parallel implementation of: – ‘summary’ function – Pearson’s correlation
Introduction Data Integration Feature Building Modeling & Impact
16 © 2014 Pivotal Software, Inc. All rights reserved.
Big Data Machine Learning in SQL
Introduction Data Integration Feature Building Modeling & Impact
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition
(SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber
white, clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis,
Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random
Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions PMML Export
http://madlib.net/
17 © 2014 Pivotal Software, Inc. All rights reserved.
Complex Feature Set Across Multiple Data Sources • Often useful to create features
from time series variables and not just use them raw
• One such class of features are statistical features created on moving windows of time series data
• Fast computation of features is possible on Pivotal’s MPP platform leveraging window functions on native SQL (and MADlib or PL/R if needed for added functionality)
Introduction Data Integration Feature Building Modeling & Impact
Time window
18 © 2014 Pivotal Software, Inc. All rights reserved.
Complex Feature Set Across Multiple Data Sources
• Depth • Rate of Penetration • Torque • Weight on Bit • RPM • …
• Drill Bit details • Component
details etc. • Failure events • …
Features on Time
Windows
• Mean • Median • Standard Deviation • Range • Skewness • …
Final Set of Features on
Time Windows
Introduction Data Integration Feature Building Modeling & Impact
Leverage GPDB / HAWQ (+ MADlib and PL/R if needed) for fast computation of hundreds of features over time windows within billions of rows of time-series data
Operator data
Drill Rig Sensor
data
19 © 2014 Pivotal Software, Inc. All rights reserved.
Working with Time Series Data
• Pivotal GPDB has built in support for dealing with time series data – SQL window functions: e.g. lead, lag, custom windows – More details in Pivotal’s Time Series Analysis blogs:
http://blog.pivotal.io/tag/time-series-analysis
Aggregations • By time slice • By custom window • Example aggregates: Avg, median,
variance
Mapping What time slice does an observation at a particular timestamp map to?
Pattern detection
Introduction Data Integration Feature Building Modeling & Impact
Rolling averages Gap filling and interpolation
Running Accumulations
20 © 2014 Pivotal Software, Inc. All rights reserved.
Predictive Analytics for Drilling Operations
Predict function • Predict Rate-of-Penetration
– Linear Regression – Elastic Net Regularized
Regression (Gaussian) – Support Vector Machines
Predict failure • Predict occurrence of
equipment failure in a chosen future time window – Logistic Regression – Elastic Net Regularized
Regression (Binomial) – Support Vector Machines
• Predict remaining life of equipment – Cox Proportional Hazards
Regression
Introduction Data Integration Feature Building Modeling & Impact
Elastic Net Regularized Regression • Fits problem statements • Ease of interpretation, scoring and
operationalization • Provides probability of failure in the
binomial case • Leveraged MADlib’s in-database parallel
implementation
21 © 2014 Pivotal Software, Inc. All rights reserved.
Background on Elastic Net Regularization
• Elastic Net regularization seeks to find a weight vector that, for any given training example set, minimizes:
Advantages Limitations
Ordinary Least Squares
• Unbiased estimators • Significance levels for coefficients
• Highly affected by multi-collinearity • Requires more records than predictors • Feature selection
Elastic Net Regularization
• Biased towards smaller MSE • Less limitations on number of predictors • Better at handling multi-collinearity • Feature selection
• Multiple parameters • No significance levels for coefficients
where α∈[0,1], λ≥0 and L(w) is the linear/logistic objective function
• If α=0 à Ridge regularization • If α=1 à LASSO regularization
Available in MADlib: http://doc.madlib.net/latest/group__grp__elasticnet.html
Introduction Data Integration Feature Building Modeling & Impact
22 © 2014 Pivotal Software, Inc. All rights reserved.
Predictive Analytics for Drilling Operations
Predict ROP Predict equipment failure
Introduction Data Integration Feature Building Modeling & Impact Introduction Data Integration Feature Building Modeling & Impact
Actual Predicted
Time 0
0.5
1
0 0.5 1
ROC curve ROP time series
23 © 2014 Pivotal Software, Inc. All rights reserved.
Data Science
Platform and Technology Summary
0.5GB
Platform
PL/Python PL/R
Visualization
Introduction Data Integration Feature Building Modeling & Impact
24 © 2014 Pivotal Software, Inc. All rights reserved.
One step closer to zero unplanned downtime …
• Ability to fully utilize big data – volume, variety and velocity • Comprehensive data integration framework for multiple complex
data sources • Learn and implement best practices for:
– Data governance policy – Data capture techniques, flow, and curation – Platform and toolset for data fabric
• Build and operationalize complex and extensible predictive models
• Improve efficiency, reduce costs and risks • Gain competitive advantage by leveraging full big data analytics
pipeline
Business Impacts
Introduction Data Integration Feature Building Modeling & Impact