Data as the New Oil: Producing Value in the Oil and Gas Industry

25
1 © 2014 Pivotal Software, Inc. All rights reserved. Data as the New Oil Producing Value for the Oil & Gas Industry

Transcript of Data as the New Oil: Producing Value in the Oil and Gas Industry

1 © 2014 Pivotal Software, Inc. All rights reserved. 1 © 2014 Pivotal Software, Inc. All rights reserved.

Data as the New Oil Producing Value for the Oil & Gas Industry

2 © 2014 Pivotal Software, Inc. All rights reserved.

Data: The New Oil

•  Oil and gas exploration and production activities generate large amounts of data from sensors, logistics, business operations and more

•  The rise of cost-effective data collection, storage and computing devices is giving an established industry a new boost

•  Producing value from big data is a challenge and an opportunity in the industry

•  The promise of Data as “the new oil” is realized when we can tap into its value in a meaningful, cross-functional way to enhance decision-making, which provides the competitive advantage

http://commons.wikimedia.org/wiki/File:Rig_wind_river.jpg

3 © 2014 Pivotal Software, Inc. All rights reserved.

Challenges and Opportunities

Challenges •  Current data collection and

curation practices are mostly in silos

•  Different data models for data from different functions in the organization

•  Missing or incomplete data for integrating varied data sources

•  Legacy systems that need to be taken into consideration

•  Domain expertise in silos – ability to work across domains needed for extracting full value from ‘the new oil’

Opportunities •  Data Lake concepts and technology

allow data to be stored centrally and curated in a meaningful way

•  Comprehensive, single view of the truth:

–  Integration of data assets lead to more informed, powerful models

–  Many “first-of-its-kind” models become possible for the business

–  These models enhance decision making by providing better predictions

•  Real-time application of predictive models can speed up responses to events

4 © 2014 Pivotal Software, Inc. All rights reserved.

Significant Use Cases

•  Predictive Maintenance –  Model equipment function and failure –  Optimize maintenance schedules –  Real-time alerts based on predictive models

•  Seismic Imaging and Inversion Analysis •  Reservoir Simulation and Management •  Production Optimization •  Supply Chain Optimization •  Energy Trading

5 © 2014 Pivotal Software, Inc. All rights reserved. 5 © Copyright 2013 Pivotal. All rights reserved.

Predictive Analytics for Drilling Operations Predicting Equipment Function and Failure

6 © 2014 Pivotal Software, Inc. All rights reserved.

Predictive Analytics for Drilling Operations

Business Goals •  Increase efficiency, reduce

costs •  Take steps towards zero

unplanned downtime •  Predict equipment function

for maintenance •  Provide early warning

system for equipment failure •  Optimize parameters for

drilling operations •  Improve health, safety and

environmental risks

Big Data Sources •  Sensor data

–  Surface and down-hole sensors

–  Measurement While Drilling (MWD)

–  SCADA data •  Drill Operator data

–  Operator comments –  Activity log / codes –  Incident reports / logs

•  And more …

Introduction Data Integration Feature Building Modeling & Impact

7 © 2014 Pivotal Software, Inc. All rights reserved.

Predicting Equipment Function and Failure

•  Business Problem: Predict drilling equipment function and failure – a step towards early warning systems and zero unplanned downtime

•  Motivation: Drilling wells and equipment failure during the process are expensive. Example: Drilling motor damage could account for 35% of rig non-productive time (NPT) and can cost $150,000 per incident1

•  Goals: –  Predict equipment function and failure à this enables:

•  Optimization of parameters for efficient drilling •  Reducing non-productive drill time (and costs) •  Reducing failures

–  Provide insights into prominent features impacting operation and failure

Introduction Data Integration Feature Building Modeling & Impact

1 The American Oil & Gas Reporter, April 2014 Cover Story

8 © 2014 Pivotal Software, Inc. All rights reserved.

The Eightfold Path of Data Science Four Phases and Four Differentiating Factors

Technology Selection Select the right platform and

the right set of tools for solving the problem at hand

Iterative Approach Perform each phase in an agile manner, team up with domain experts and SMEs,

and iterate as required

Creativity Take the opportunity to innovate at every phase

Building a Narrative Create a fact-based narrative

that clearly communicates insights to stakeholders

Phase 1: Problem Formulation Make sure you formulate a

problem that is relevant to the goals and pain points of the

stakeholders

Phase 2: Data Step Build the right feature set

making full use of the volume, variety and velocity of all

available data

Phase 3: Modeling Step This is where you move from answering what, where and when to answering why and

what if?

Phase 4: Application Create a framework for

integrating the model with decision making processes and taking action using the

Internet of Things

Introduction Data Integration Feature Building Modeling & Impact

9 © 2014 Pivotal Software, Inc. All rights reserved.

Technology Selection

•  Platform for all phases of the analytics cycle •  Support development of complex and extensible predictive models to

predict equipment function and failure •  Provide framework for integrating data from multiple sources across data

warehouses and rig operators •  Ability to analyze both structured and unstructured data in a unified manner.

For instance: –  Support fast computation of hundreds of features over time windows within 100s

of millions (or billions / trillions) of records of time-series data –  Natural language processing pipeline for analysis of operator comments to

identify failures from unstructured text

Introduction Data Integration Feature Building Modeling & Impact

PL/Python PL/R

10 © 2014 Pivotal Software, Inc. All rights reserved.

Predictive Analytics for Drilling Operations

•  Consider two examples: –  Predicting drill rate-of-penetration (ROP) –  Predicting drilling equipment failure

•  Primary data sources for these examples –  Drill Rig Sensor Data: Depth, Rate of Penetration (ROP), RPM,

Torque, Weight on Bit, etc… ( >billions of records) –  Operator Data: Drill Bit details, Failure details, Component details etc…

(>100s of thousands of records)

Introduction Data Integration Feature Building Modeling & Impact

Data Integration

Feature Building Modeling

11 © 2014 Pivotal Software, Inc. All rights reserved.

Drill Rig Sensor

data

Comprehensive Data Integration Framework

•  Need a comprehensive framework for data integration at scale –  Data cleansing – removing NULLs and outliers, missing value

impuation techniques –  Standardizing columns that are used to join across multiple data

sources

Sensor and Operator data

integrated

Introduction Data Integration Feature Building Modeling & Impact

Operator data

12 © 2014 Pivotal Software, Inc. All rights reserved.

Data Integration Challenges

•  Data sources do not use consistent entries in features / columns that link them (join columns) – e.g. well names

•  Manually entered data (some operator data) is prone to entry errors –  Hitting several keys –  Key strokes not appearing (e.g. missing a character / digit)

•  Invalid values for sensor measurements –  Invalid values could be placeholders for sensor malfunction or

non-recording time –  Duration of invalid values can range from one-off occurrences to

several hours

Introduction Data Integration Feature Building Modeling & Impact

13 © 2014 Pivotal Software, Inc. All rights reserved.

Data Integration Challenges

•  Standardization of join column entries across data sources •  Problem: Data sources do not use consistent entries in join columns

•  Resolution options: Derive a canonical representation for the columns –  Regular expression transformations –  String edit distance computations à closest distance matches –  + Manual correction

•  Include standardized entries in each table

Introduction Data Integration Feature Building Modeling & Impact

Data Source #1 Data Source #2 A B C A-B-C

PARENT-TEACHER PARENT-TEACHERS

GRANDFATHER CLOK GRANDFATHER_CLOCK

KOALA 123 KOALA 122

14 © 2014 Pivotal Software, Inc. All rights reserved.

Data Integration Challenges

•  Problem: Manually entered data is prone to operator entry errors –  Hitting several keys –  Key strokes not appearing (e.g. missing a digit / character)

•  Resolution options: –  Ignore rows if depth does not lie between previous and next values –  Replace value with interpolated result

Timestamp Depth

2014-09-01 00:06:00 13504

2014-09-02 00:05:00 140068

2014-09-03 00:07:00 14754

2014-09-04 00:11:00 15388

2014-09-05 00:16:00 16100

Introduction Data Integration Feature Building Modeling & Impact

15 © 2014 Pivotal Software, Inc. All rights reserved.

Understanding Correlations in Data

•  Summary statistics and Correlations between variables need to be computed at-scale for >1000s of variable combinations

•  Able to leverage MADlib’s parallel implementation of: –  ‘summary’ function –  Pearson’s correlation

Introduction Data Integration Feature Building Modeling & Impact

16 © 2014 Pivotal Software, Inc. All rights reserved.

Big Data Machine Learning in SQL

Introduction Data Integration Feature Building Modeling & Impact

Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition

(SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber

white, clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis,

Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random

Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions PMML Export

http://madlib.net/

17 © 2014 Pivotal Software, Inc. All rights reserved.

Complex Feature Set Across Multiple Data Sources •  Often useful to create features

from time series variables and not just use them raw

•  One such class of features are statistical features created on moving windows of time series data

•  Fast computation of features is possible on Pivotal’s MPP platform leveraging window functions on native SQL (and MADlib or PL/R if needed for added functionality)

Introduction Data Integration Feature Building Modeling & Impact

Time window

18 © 2014 Pivotal Software, Inc. All rights reserved.

Complex Feature Set Across Multiple Data Sources

•  Depth •  Rate of Penetration •  Torque •  Weight on Bit •  RPM •  …

•  Drill Bit details •  Component

details etc. •  Failure events •  …

Features on Time

Windows

•  Mean •  Median •  Standard Deviation •  Range •  Skewness •  …

Final Set of Features on

Time Windows

Introduction Data Integration Feature Building Modeling & Impact

Leverage GPDB / HAWQ (+ MADlib and PL/R if needed) for fast computation of hundreds of features over time windows within billions of rows of time-series data

Operator data

Drill Rig Sensor

data

19 © 2014 Pivotal Software, Inc. All rights reserved.

Working with Time Series Data

•  Pivotal GPDB has built in support for dealing with time series data –  SQL window functions: e.g. lead, lag, custom windows –  More details in Pivotal’s Time Series Analysis blogs:

http://blog.pivotal.io/tag/time-series-analysis

Aggregations •  By time slice •  By custom window •  Example aggregates: Avg, median,

variance

Mapping What time slice does an observation at a particular timestamp map to?

Pattern detection

Introduction Data Integration Feature Building Modeling & Impact

Rolling averages Gap filling and interpolation

Running Accumulations

20 © 2014 Pivotal Software, Inc. All rights reserved.

Predictive Analytics for Drilling Operations

Predict function •  Predict Rate-of-Penetration

–  Linear Regression –  Elastic Net Regularized

Regression (Gaussian) –  Support Vector Machines

Predict failure •  Predict occurrence of

equipment failure in a chosen future time window –  Logistic Regression –  Elastic Net Regularized

Regression (Binomial) –  Support Vector Machines

•  Predict remaining life of equipment –  Cox Proportional Hazards

Regression

Introduction Data Integration Feature Building Modeling & Impact

Elastic Net Regularized Regression •  Fits problem statements •  Ease of interpretation, scoring and

operationalization •  Provides probability of failure in the

binomial case •  Leveraged MADlib’s in-database parallel

implementation

21 © 2014 Pivotal Software, Inc. All rights reserved.

Background on Elastic Net Regularization

•  Elastic Net regularization seeks to find a weight vector that, for any given training example set, minimizes:

Advantages Limitations

Ordinary Least Squares

•  Unbiased estimators •  Significance levels for coefficients

•  Highly affected by multi-collinearity •  Requires more records than predictors •  Feature selection

Elastic Net Regularization

•  Biased towards smaller MSE •  Less limitations on number of predictors •  Better at handling multi-collinearity •  Feature selection

•  Multiple parameters •  No significance levels for coefficients

where α∈[0,1], λ≥0 and L(w) is the linear/logistic objective function

•  If α=0 à Ridge regularization •  If α=1 à LASSO regularization

Available in MADlib: http://doc.madlib.net/latest/group__grp__elasticnet.html

Introduction Data Integration Feature Building Modeling & Impact

22 © 2014 Pivotal Software, Inc. All rights reserved.

Predictive Analytics for Drilling Operations

Predict ROP Predict equipment failure

Introduction Data Integration Feature Building Modeling & Impact Introduction Data Integration Feature Building Modeling & Impact

Actual Predicted

Time 0

0.5

1

0 0.5 1

ROC curve ROP time series

23 © 2014 Pivotal Software, Inc. All rights reserved.

Data Science

Platform and Technology Summary

0.5GB

Platform

PL/Python PL/R

Visualization

Introduction Data Integration Feature Building Modeling & Impact

24 © 2014 Pivotal Software, Inc. All rights reserved.

One step closer to zero unplanned downtime …

•  Ability to fully utilize big data – volume, variety and velocity •  Comprehensive data integration framework for multiple complex

data sources •  Learn and implement best practices for:

–  Data governance policy –  Data capture techniques, flow, and curation –  Platform and toolset for data fabric

•  Build and operationalize complex and extensible predictive models

•  Improve efficiency, reduce costs and risks •  Gain competitive advantage by leveraging full big data analytics

pipeline

Business Impacts

Introduction Data Integration Feature Building Modeling & Impact

A NEW PLATFORM FOR A NEW ERA