Predictive maintenance withsensors_in_utilities_

24
Predictive Maintenance with Sensors in Utilities Tina Zhang

Transcript of Predictive maintenance withsensors_in_utilities_

Predictive Maintenance with

Sensors in Utilities

Tina Zhang

Agenda

Sensors in IOT era

Predictive Maintenance

Predictive Maintenance with sensor data in Utilities industry

Architecture for real time distributed sensor data collection, analysis, visualization, and storage system

Modeling imprecise sensor readings

Sensors in IOT era

Sensors

Sensors are a bridge between the physical world and the internet. They will play an ever increasing role in just about every field imaginable, and powering the “Internet of Things”.

Potential Uses of Sensor Data

Sensors can be used to monitor machines, infrastructure, and environment such as ventilation equipment, bridges, energy meters, airplane engines, temperature, humility, etc.

One use of this data is for predictive maintenance, to repair or replace the items before they break.

3 classes of Maintenance

Corrective maintenance (CM), is simply fixing things after they suffer a breakdown and can also be called Reactive maintenance.

Preventive maintenance (PM), is about replacing or replenishing consumables at scheduled intervals.

Predictive maintenance (PdM) or Condition-based maintenance, focuses on detecting failures before they occur.

PdM incorporates inspections of the system at predetermined intervals to determine system condition.

Depending on the outcome of a continual inspection, either a preventive or no maintenance activity is performed.

Fault Detection Method in Predictive Maintenance

PdM employs many fault or defect detection methods which compare current sensor or inspection data with some reference data.

If the reference data are the outcome of a representation of the real system, the fault detection method is called model-based.

Mainly, two distinctive kind of models are used, analytical models and machine learning models:

Analytical models are limited to represent linear characteristics, however modern machine learning techniques based on artificial intelligence, as neural networks or Bayesian (beliefs) networks or support vector machines are capable of including nonlinearities and complex interdependencies. Even a relatively "simple" machine learning tool such as a decision tree can allow for nonlinearities.

Machine Learning in Predictive Maintenance

Data Mining and Machine Learning allow systematic classifying of patterns contained in data sets.

Patterns of data, “attributes”, containing information about condition of physical assets can be represented by “instances” with an associated failure mode, or “class”.

Predictions can be made based on patterns in real time data.

Decision tree model example Here is an instance of building a decision tree model where the

strategy is to either perform maintenance or not based on outcome from several independent measurements (variables).

Naïve Bayes example

Predictive Maintenance in Utility industry

By analyzing the patterns of circumstances surrounding past equipment failures and power outages and by accessing multiple data sources including sensors in real time, utility companies can predict and prevent future failures.

Predictive Maintenance allows utility companies to not only prepare for known consumption peaks, such as those caused by extreme weather conditions, but also react quickly to unexpected problems when the warning signs appear.

Utility companies can spot the problem early on:

When some of the values of some sensor are not normal;

When the number of abnormal values exceeds a given threshold;

Or when the values of a given sensor are significantly different from the values of its neighbors.

Big and fast sensor data requires a different architecture

Due to the rapid advances in sensor technologies, the number of sensors and the amount of sensor data have been increasing with incredible rates.

Therefore the scalability, availability, speed requirements for sensor data collection, storage, and analysis solutions call for use of new technologies, which have the ability to efficiently distribute data over many servers and dynamically add new attributes to data records.

Architecture for a real time distributed sensor data collection, analysis, visualization, and storage system

The new architecture must be able to scale to support a large number of sensors and big data sizes.

It must be able to automatically gather and analyze large number of sensor measurements over long periods of time and also to deploy statistics and machine learning to execute computationally complex data analysis algorithms with many influence factors.

Open source big data frameworks can be utilized for large-scale sensor data analysis requirements.

Socket

Shared Files

User

Kafka

Web Service

Data Source

::

Spark Streaming & Spark

SQL & ML lib

HDFS

Web UI

HBaseAnalysis results

KafkaHive

An example use case Display all the transformers located in City Houston, Texas on the

map, and when a transformer icon is clicked, display in an info window the following details for each transformer: Transformer ID, Age, Designed Capacity, exact location, and the current Load reading.

If a transformer is of Type “Pole-Top”, with Rating 230, Age > 20, and if its load has exceeds its designed capacity by more than 10 kVA,  and also in the location where the transformer is located, air temperature >100 degrees, we'll highlight the transformer icon as red. 

When user clicks on the specific transformer, we'll populate the details for the transformer, including its Load reading. Both the transformer icon color and the transformer Load reading (with red or green color) will continuously update every second in real time. 

Why Spark?

Spark presents a new distributed memory abstraction, called resilient distributed datasets (RDDs), which provides a data structure for in-memory computations on large clusters.

RDDs can achieve fault tolerance, meaning that if a given task fails due to some reasons such as hardware failures and erroneous user code, lost data can be recovered and reconstructed automatically on the remaining tasks.

Spark has a Java high-level API for working with distributed data similar to Hadoop and presents an in-memory processing solution.

We run Spark on Hortonworks HDP2.2 in YARN mode, also have made Spark 1.3.1 work on HDP2.2 (default Spark version: 1.2).

Spark Streaming Spark Streaming is an extension of the core Spark API that allows to

enable high-throughput, fault-tolerant stream processing of live data streams.

It offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  

DStreams can be created from live incoming data or by transforming other DStreams.  

Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Spark 1.3 offers Streaming K-means Clustering and Streaming Linear Regression

Spark SQL Spark SQL is Spark's module for working with structured data.

The foundation of Spark SQL is a type of RDD, called SchemaRDD (pre-V1.3) or DataFrame (V1.3), an object similar to a table in a relational database.

Spark SQL can run queries against mixed types of data

Spark piece in detail:

Sensor Data Storage – HBase

NoSQL databases provide efficient alternatives for large amount of sensor data storage. In this example, we will use HBase, a NoSQL key/value store which runs on top of HDFS.

Unlike Hive, HBase operations run in real-time on its database rather than batch-based MapReduce jobs.

Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key.

In our case, we’ll store the anomaly sensor data in a table “abnormal_ load” in the format of:

key, Transformer_ID, Timestamp, Load, Overload, Location, Air_Temperature

We can query our HBase table by creating an external Hive table, linking the HBase table to the Hive table, and then running HiveQL:

select Transformer_ID, Timestamp, Overload from spark_poc.abnormal_load where Overload > 20 and Air_Temperature>105 order by Timestamp DESC;

Why sending all sources data to Kafka

In the diagrams in the next 2 slides:

The first shows what happens without Kafka.

Since each source needs to have a connection to each target, it is difficult to maintain and can cause lots of programming and security issues.

The second diagram uses the Kafka, so all sources send data to Kafka.

We only to develop one interface/program to get all different data into Kafka. Each different data is one topic.

And from consumer side, a consumer only deals with Kafka. When we add a new source or a new consumer, it does not affect any existing source or target at all. Thus it is easy to maintain, clean, secure, scalable. 

Sources

Targets

Data Pipe Lines Without Kafka

Data Pipe Lines With Kafka

Kafka

HBase Hive

Sources

Targets

HDFS DB

Why write analysis result data stream to Kafka before publishing it to web UI This is because if we send data steam  (analysis result) to a queue on

the web server and then use web socket to push to the browser, it is very tedious to maintain the queue.

Kafka comes handy as a distributed, persistent message queue which supports multiple concurrent writers, as well as multiple groups of readers that maintain their own offsets within the queue (which Kafka calls a ‘topic’).

This enables us to build applications that consume data from a topic at their own pace without disrupting access from other groups of readers.

Sensor Data Analysis

To analyze data on the aforementioned architecture we use distributed machine-learning algorithms in Apache Mahout and MLlib by Apache Spark.

MLlib is a Spark component and a fast and flexible iterative computing framework to implement machine-learning algorithms, including classification, clustering, linear regression, collaborative filtering, and decomposition aims to create and analyze large-scale data hosted in memory.

We use -means algorithm for clustering sensor data and find the anomalies. -means algorithm is a very popular unsupervised learning algorithm. It aims to assign objects to groups. All of the objects to be grouped need to be represented as numerical features. The technique iteratively assigns points to clusters using distance as a similarity factor until there is no change in which point belongs to which cluster.

We also use Spark’s Streaming K-means.

Modeling imprecise sensor readings

Sensor readings are inherently imprecise because of the noise introduced by the equipment itself.

Two main approaches have emerged for modeling uncertain data series:

In the first, a Probability Density Function (PDF) over the uncertain values is estimated by using some a priori knowledge.

In the second, the uncertain data distribution is summarized by repeated measurements (i.e., samples).

Dynamic probabilistic models over the sensor readings

The KEN technique builds and maintains dynamic probabilistic models over the sensor readings, taking into account the spatio-temporal correlations that exist in the sensor readings.

These models organize the sensor nodes in non-overlapping groups, and are shared by the sensor nodes and the sink.

The expected values of the probabilistic models are the values that are recorded by the sink. If the sensors observe that these values are more than εVT away from the sensed values, then a model update is triggered.

The PAQ and SAF methods employ linear regression and autoregressive models, respectively, for modeling the measurements produced by the nodes, with SAF leading to a more accurate model than PAQ.