Data as the New Oil: Producing Value in the Oil and Gas Industry

1 © 2014 Pivotal Software, Inc. All rights reserved. 1 © 2014 Pivotal Software, Inc. All rights reserved.

Data as the New Oil Producing Value for the Oil & Gas Industry

2 © 2014 Pivotal Software, Inc. All rights reserved.

Data: The New Oil

•  Oil and gas exploration and production activities generate large amounts of data from sensors, logistics, business operations and more

•  The rise of cost-effective data collection, storage and computing devices is giving an established industry a new boost

•  Producing value from big data is a challenge and an opportunity in the industry

•  The promise of Data as “the new oil” is realized when we can tap into its value in a meaningful, cross-functional way to enhance decision-making, which provides the competitive advantage

http://commons.wikimedia.org/wiki/File:Rig_wind_river.jpg


Challenges and Opportunities

Challenges •  Current data collection and

curation practices are mostly in silos

•  Different data models for data from different functions in the organization

•  Missing or incomplete data for integrating varied data sources

•  Legacy systems that need to be taken into consideration

•  Domain expertise in silos – ability to work across domains needed for extracting full value from ‘the new oil’

Opportunities •  Data Lake concepts and technology

allow data to be stored centrally and curated in a meaningful way

•  Comprehensive, single view of the truth:

–  Integration of data assets lead to more informed, powerful models

–  Many “first-of-its-kind” models become possible for the business

–  These models enhance decision making by providing better predictions

•  Real-time application of predictive models can speed up responses to events


Significant Use Cases

•  Predictive Maintenance –  Model equipment function and failure –  Optimize maintenance schedules –  Real-time alerts based on predictive models

•  Seismic Imaging and Inversion Analysis •  Reservoir Simulation and Management •  Production Optimization •  Supply Chain Optimization •  Energy Trading

5 © 2014 Pivotal Software, Inc. All rights reserved. 5 © Copyright 2013 Pivotal. All rights reserved.

Predictive Analytics for Drilling Operations Predicting Equipment Function and Failure


Predictive Analytics for Drilling Operations

Business Goals •  Increase efficiency, reduce

costs •  Take steps towards zero

unplanned downtime •  Predict equipment function

for maintenance •  Provide early warning

system for equipment failure •  Optimize parameters for

drilling operations •  Improve health, safety and

environmental risks

Big Data Sources •  Sensor data

–  Surface and down-hole sensors

–  Measurement While Drilling (MWD)

–  SCADA data •  Drill Operator data

–  Operator comments –  Activity log / codes –  Incident reports / logs

•  And more …

Introduction Data Integration Feature Building Modeling & Impact


Predicting Equipment Function and Failure

•  Business Problem: Predict drilling equipment function and failure – a step towards early warning systems and zero unplanned downtime

•  Motivation: Drilling wells and equipment failure during the process are expensive. Example: Drilling motor damage could account for 35% of rig non-productive time (NPT) and can cost $150,000 per incident1

•  Goals: –  Predict equipment function and failure à this enables:

•  Optimization of parameters for efficient drilling •  Reducing non-productive drill time (and costs) •  Reducing failures

–  Provide insights into prominent features impacting operation and failure


1 The American Oil & Gas Reporter, April 2014 Cover Story


The Eightfold Path of Data Science Four Phases and Four Differentiating Factors

Technology Selection Select the right platform and

the right set of tools for solving the problem at hand

Iterative Approach Perform each phase in an agile manner, team up with domain experts and SMEs,

and iterate as required

Creativity Take the opportunity to innovate at every phase

Building a Narrative Create a fact-based narrative

that clearly communicates insights to stakeholders

Phase 1: Problem Formulation Make sure you formulate a

problem that is relevant to the goals and pain points of the

stakeholders

Phase 2: Data Step Build the right feature set

making full use of the volume, variety and velocity of all

available data

Phase 3: Modeling Step This is where you move from answering what, where and when to answering why and

what if?

Phase 4: Application Create a framework for

integrating the model with decision making processes and taking action using the

Internet of Things



Technology Selection

•  Platform for all phases of the analytics cycle •  Support development of complex and extensible predictive models to

predict equipment function and failure •  Provide framework for integrating data from multiple sources across data

warehouses and rig operators •  Ability to analyze both structured and unstructured data in a unified manner.

For instance: –  Support fast computation of hundreds of features over time windows within 100s

of millions (or billions / trillions) of records of time-series data –  Natural language processing pipeline for analysis of operator comments to

identify failures from unstructured text


PL/Python PL/R



•  Consider two examples: –  Predicting drill rate-of-penetration (ROP) –  Predicting drilling equipment failure

•  Primary data sources for these examples –  Drill Rig Sensor Data: Depth, Rate of Penetration (ROP), RPM,

Torque, Weight on Bit, etc… ( >billions of records) –  Operator Data: Drill Bit details, Failure details, Component details etc…

(>100s of thousands of records)


Data Integration

Feature Building Modeling


Drill Rig Sensor

data

Comprehensive Data Integration Framework

•  Need a comprehensive framework for data integration at scale –  Data cleansing – removing NULLs and outliers, missing value

impuation techniques –  Standardizing columns that are used to join across multiple data

sources

Sensor and Operator data

integrated


Operator data


Data Integration Challenges

•  Data sources do not use consistent entries in features / columns that link them (join columns) – e.g. well names

•  Manually entered data (some operator data) is prone to entry errors –  Hitting several keys –  Key strokes not appearing (e.g. missing a character / digit)

•  Invalid values for sensor measurements –  Invalid values could be placeholders for sensor malfunction or

non-recording time –  Duration of invalid values can range from one-off occurrences to

several hours




•  Standardization of join column entries across data sources •  Problem: Data sources do not use consistent entries in join columns

•  Resolution options: Derive a canonical representation for the columns –  Regular expression transformations –  String edit distance computations à closest distance matches –  + Manual correction

•  Include standardized entries in each table


Data Source #1 Data Source #2 A B C A-B-C

PARENT-TEACHER PARENT-TEACHERS

GRANDFATHER CLOK GRANDFATHER_CLOCK

KOALA 123 KOALA 122



•  Problem: Manually entered data is prone to operator entry errors –  Hitting several keys –  Key strokes not appearing (e.g. missing a digit / character)

•  Resolution options: –  Ignore rows if depth does not lie between previous and next values –  Replace value with interpolated result

Timestamp Depth

2014-09-01 00:06:00 13504

2014-09-02 00:05:00 140068

2014-09-03 00:07:00 14754

2014-09-04 00:11:00 15388

2014-09-05 00:16:00 16100



Understanding Correlations in Data

•  Summary statistics and Correlations between variables need to be computed at-scale for >1000s of variable combinations

•  Able to leverage MADlib’s parallel implementation of: –  ‘summary’ function –  Pearson’s correlation



Big Data Machine Learning in SQL


Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition

(SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber

white, clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis,

Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random

Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions PMML Export

http://madlib.net/


Complex Feature Set Across Multiple Data Sources •  Often useful to create features

from time series variables and not just use them raw

•  One such class of features are statistical features created on moving windows of time series data

•  Fast computation of features is possible on Pivotal’s MPP platform leveraging window functions on native SQL (and MADlib or PL/R if needed for added functionality)


Time window


Complex Feature Set Across Multiple Data Sources

•  Depth •  Rate of Penetration •  Torque •  Weight on Bit •  RPM •  …

•  Drill Bit details •  Component

details etc. •  Failure events •  …

Features on Time

Windows

•  Mean •  Median •  Standard Deviation •  Range •  Skewness •  …

Final Set of Features on

Time Windows


Leverage GPDB / HAWQ (+ MADlib and PL/R if needed) for fast computation of hundreds of features over time windows within billions of rows of time-series data

Operator data

Drill Rig Sensor

data


Working with Time Series Data

•  Pivotal GPDB has built in support for dealing with time series data –  SQL window functions: e.g. lead, lag, custom windows –  More details in Pivotal’s Time Series Analysis blogs:

http://blog.pivotal.io/tag/time-series-analysis

Aggregations •  By time slice •  By custom window •  Example aggregates: Avg, median,

variance

Mapping What time slice does an observation at a particular timestamp map to?

Pattern detection


Rolling averages Gap filling and interpolation

Running Accumulations



Predict function •  Predict Rate-of-Penetration

–  Linear Regression –  Elastic Net Regularized

Regression (Gaussian) –  Support Vector Machines

Predict failure •  Predict occurrence of

equipment failure in a chosen future time window –  Logistic Regression –  Elastic Net Regularized

Regression (Binomial) –  Support Vector Machines

•  Predict remaining life of equipment –  Cox Proportional Hazards

Regression


Elastic Net Regularized Regression •  Fits problem statements •  Ease of interpretation, scoring and

operationalization •  Provides probability of failure in the

binomial case •  Leveraged MADlib’s in-database parallel

implementation


Background on Elastic Net Regularization

•  Elastic Net regularization seeks to find a weight vector that, for any given training example set, minimizes:

Advantages Limitations

Ordinary Least Squares

•  Unbiased estimators •  Significance levels for coefficients

•  Highly affected by multi-collinearity •  Requires more records than predictors •  Feature selection

Elastic Net Regularization

•  Biased towards smaller MSE •  Less limitations on number of predictors •  Better at handling multi-collinearity •  Feature selection

•  Multiple parameters •  No significance levels for coefficients

where α∈[0,1], λ≥0 and L(w) is the linear/logistic objective function

•  If α=0 à Ridge regularization •  If α=1 à LASSO regularization

Available in MADlib: http://doc.madlib.net/latest/group__grp__elasticnet.html




Predict ROP Predict equipment failure

Introduction Data Integration Feature Building Modeling & Impact Introduction Data Integration Feature Building Modeling & Impact

Actual Predicted

Time 0

0.5

1

0 0.5 1

ROC curve ROP time series


Data Science

Platform and Technology Summary

0.5GB

Platform

PL/Python PL/R

Visualization



One step closer to zero unplanned downtime …

•  Ability to fully utilize big data – volume, variety and velocity •  Comprehensive data integration framework for multiple complex

data sources •  Learn and implement best practices for:

–  Data governance policy –  Data capture techniques, flow, and curation –  Platform and toolset for data fabric

•  Build and operationalize complex and extensible predictive models

•  Improve efficiency, reduce costs and risks •  Gain competitive advantage by leveraging full big data analytics

pipeline

Business Impacts


A NEW PLATFORM FOR A NEW ERA

Data as the New Oil: Producing Value in the Oil and Gas Industry

Data & Analytics

Transcript of Data as the New Oil: Producing Value in the Oil and Gas Industry