Machine Learning Based Fault Prediction for Real-time ...1294436/FULLTEXT01.pdf · prediction for...

IN DEGREE PROJECT MECHANICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Machine Learning Based Fault Prediction for Real-time Scheduling on Shop Floor

WENDA WU

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

Machine Learning BasedFault Prediction for Real-timeScheduling on Shop Floor

WENDA WU

Master in Science, TSCRMDate: August 6, 2018Supervisor: Wei JiExaminer: Lihui WangSwedish title: Maskininlärningsbaserad felprognos förrealtidsplanering på butiksgolvEES / ITM

iii

Abstract

Nowadays, scheduling on a shop floor is only focused on the availabil-ity of resources, where the potential faults are not able to be predicted.

A big data analytics based fault prediction was proposed to be ap-plied in scheduling, which require a real-time decision making. Toselect a proper machine learning algorithm for real-time scheduling,this paper first proposes a data generation method in terms of patterncomplexity and scale. Three levels of depth, an index of data complex-ity, and three levels of data attributes, an index of data scale, are usedto obtain the data sets. Based on those data sets, ten commonly usedmachine learning algorithms are trained, in which the parameters areadjusted to achieve a high accuracy. The testing results including threeindexes including training time, testing time and prediction accuracy,are used to evaluate the algorithms.

The results of the tests shows that when working with data of sim-ple structure and small scale, typical machine learning methods likeNaive Bayes classifier and SVM is good enough with fast training anhigh accuracy. When dealing with complex data on large scale, deeplearning methods like CNN and DBN outperform all other methods.

iv

Sammanfattning

Nu för tiden, schemaläggning på en affärsplan är endast inriktad påtillgången på resurser, där de potentiella felen inte kan förutses.

En stor dataanalysbaserad felprediktion föreslogs att tillämpas vidschemaläggning, vilket kräver beslutsfattande i realtid. För att väljaen riktig maskininlärningsalgoritm för realtidsplanering, föreslår det-ta papper först en datagenereringsmetod när det gäller mönsterkom-plexitet och skala. Baserat på dessa datasatser utbildas tio allmänt an-vända maskininlärningsalgoritmer, där parametrarna justeras för attuppnå hög noggrannhet. Testresultaten inklusive tre index inklusiveträningstid, testtid och prediktionsnoggrannhet används för att utvär-dera algoritmerna.

Resultaten av testen visar att typiska maskininlärningsmetoder somNaive Bayes-klassificerare och SVM är bra nog med snabb träning medhög noggrannhet när de arbetar med data med enkel struktur och litenskala. När man hanterar komplexa data i stor skala, överträffar djupainlärningsmetoder som CNN och DBN alla andra metoder.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Fault prediction for scheduling on a shop floor . . . . . . 21.3 Research Motivation . . . . . . . . . . . . . . . . . . . . . 3

2 Selected Algorithms 52.1 Typical Machine Learning Methods . . . . . . . . . . . . . 5

2.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . 52.1.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . 62.1.3 Support Vector Machine . . . . . . . . . . . . . . . 72.1.4 Ensemble Learning . . . . . . . . . . . . . . . . . . 10

2.2 Artificial Neural Network and its Variations . . . . . . . 112.2.1 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Radial Basis Function Neural Network . . . . . . 142.2.3 Hopfield Network . . . . . . . . . . . . . . . . . . 152.2.4 Convolutional Neural Network . . . . . . . . . . . 172.2.5 Deep Belief Network . . . . . . . . . . . . . . . . . 192.2.6 Stacked Auto-encoders . . . . . . . . . . . . . . . . 21

2.3 Algorithm Configuration . . . . . . . . . . . . . . . . . . . 22

3 Evaluation 253.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Data Structure . . . . . . . . . . . . . . . . . . . . . 253.1.2 Data Attributes . . . . . . . . . . . . . . . . . . . . 283.1.3 Data Metrics . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Evaluation Index . . . . . . . . . . . . . . . . . . . . . . . 293.3 Comparison on Depths . . . . . . . . . . . . . . . . . . . . 303.4 Comparison on Scales . . . . . . . . . . . . . . . . . . . . 333.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v

vi CONTENTS

4 Conclusion and Future Work 364.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A Results of All Tests 44

Chapter 1

Introduction

1.1 Background

Modern manufacture industry, unlike its predecessor, requires bothprecision and swiftness. It is even more so in the case of fault predic-tion for shop floor scheduling. Meanwhile, the techniques machinelearning has been widely applied in IT industry to perform prediction.It is well-known that corporations like Google, Amazon and Spotifyhave gained their advantages by exploiting data using machine learn-ing techniques on their recommendation systems and other services.However, little work based on these advanced methods has been donein the manufacture industry, where the need of precision and swift-ness of fault prediction is imperative. However, today’s schedulingsystems lack prediction capability with regard to errors or potentialfaults of planned or on going tasks on the shop floor. It is also a bigchallenge on how to predict potential faults and what the error pat-terns are before scheduling [1]. Being able to predict potential faultbefore the manufacturing process leads to the benefit of reschedulingwith more optimized resource allocation, less manufacturing cost andhigher manufacturing efficiency. Thus this paper focuses on applyingmodern machine learning methods in the task of real-time fault pre-diction on shop floor.

To further introduce the background of this project, the concept ofa shop floor is presented. The main goals of manufacturing shop floorare to finish ordered parts and to deliver the targeted products. Themain elements on manufacturing shop floor are divided into several

1

2 CHAPTER 1. INTRODUCTION

Figure 1.1: Main elements on shop floor

categories, as shown in Figure 1.1, according to their functions andproperties [1].

• Physical operation level: humans, robots, machine tools, fork-lifts, automatic guided vehicles (AGV), and accessories (fixturesand cutting tools).

• Physical monitor and control level: monitoring equipment, con-trol equipment, and computers (connecting with the controllersof robots or machine tools).

• Product level: materials (including blanks), semi-finish products,and products.

• Cyber level: nets and databases.

1.2 Fault prediction for scheduling on a shopfloor

In such a complex scenario of shop floor, scheduling has become animportant research question where much effort has been devoted into.Researches were first conducted on static scheduling, which is specif-ically beneficial for mass production. Manne [2] developed a methodbased on mathematical model for static scheduling. Khoshnevis and

CHAPTER 1. INTRODUCTION 3

Chen [3] integrated process planning and scheduling. Rajkumar et al.[4] applied greedy randomized adaptive search procedures algorithmto the integration of process planning with production scheduling.

However, since static scheduling lacks the capacity to solve unex-pected faults, dynamic scheduling methods were developed. Vieira etal. [5] presented a rescheduling method based on experimental andpractical approaches by introducing dynamic scheduling and predic-tive–reactive scheduling. Iwamura et al. [6] introduced an estima-tion of future status based real-time scheduling approach for holonicmanufacturing systems (HMS). In their method, the future status ofan HMS is predicted by applying a neural network model based sim-ulation model. Moreover, various algorithms were developed andused for real-time scheduling and especially real-time decision mak-ing. Metan et al. [7] proposed a new scheduling system for selectingdispatching rules in real-time by constructing a decision tree. Sum-marized by Ji and Wang [1], dynamism, flexibility and adaptabilityare the important features in modern scheduling, and a schedulingsystem should be able to perform task rearrangement in case of unex-pected events. Nevertheless, job delay and fault in manufacturing arestill hard to predict if not unavoidable.

Inspired by the work of Ji and Wang [1] in which the concept offault prediction for scheduling is presented as Figure 1.2, where plannedtasks and ongoing tasks are compared with mined fault patterns to ac-quire similarity as a reference for final decision making, this projectuses a simplified concept of fault prediction. Instead of comparingmined fault patterns with tasks, this project develops machine learn-ing classifiers trained by history data, then use planned tasks repre-sented by data attributes as inputs to perform real-time prediction anddirectly gets a final result of fault or normal.

1.3 Research Motivation

The main goal of this project, in general, is to find a best machinelearning algorithm for fault prediction of real-time scheduling on shopfloor. Further, the questions can be described as follows:

4 CHAPTER 1. INTRODUCTION

Figure 1.2: Concept of fault prediction for scheduling

• what the formation of data is like. Since lack of real data of ashop floor, a hypothetical data based on real shop floor data formis created and presented in this project, of which the answer isdetailed in Section 3.1 in Chapter 3.

• a set of algorithms within the sub-field of machine learning areselected considering to cover the related fields, and trained bythe generated data in Section 2.1. The motivation of choosingalgorithms is that we try to cover as many fields as possible,including regression analysis, Bayesian classifier, artificial neu-ral network, and the recently uprising field of deep learning, ofwhich details are described in Section 2.1 and 2.2 in Chapter 2.

• how to evaluate the selected algorithms. Three particular met-rics were selected to evaluate the performances of the algorithms.The details regarding this part is presented in Section 3.2 in Chap-ter 3.

Chapter 2

Selected Algorithms

Data and algorithms are the two keys to solve the problem of faultprediction for real-time scheduling on shop floor. Data structure isdemonstrated in terms of the real data. 10 machine learning algo-rithms are selected, i.e. Logistic Regression, Naive Bayes Classifier,Support Vector Machine(SVM), Ensemble Learning, Multilayer Per-ceptron(MLP), Radial Basis Function Neural Network(RBFNN), Hop-field Network, Convolutional Neural Network(CNN), Deep Belief Net-work(DBN) and Stacked Auto-encoders(SAEs).

2.1 Typical Machine Learning Methods

2.1.1 Logistic Regression

Logistic regression, developed by David Cox in 1958, is a regressionmodel that describes data and explains relationship between binarycategorical dependent value and one or more nominal, ordinal, inter-val or ratio-level independent variables. Like all regression analysis,logistic regression performs predictive analysis, leading to its broadapplicability ranging from investigating changes in birthweight forterm singleton infants in Scotland [8] to serious injuries associatedwith motor vehicle crashes [9] and assessing groundwater vulnera-bility to contamination in Hawaii [10]. In this project, we use logis-tic regression to analyse shop-floor data for real-time fault prediction.Since the dependent data has only two categories, 0 for good and 1 forfault, logistic regression is fit for the purpose. Other alternative ma-

5

6 CHAPTER 2. SELECTED ALGORITHMS

chine learning algorithms are evaluated in this project as well, similarto the work of Westreich et al. [11] on propensity score estimation.

After simplifying the research problem to a binary classificationproblem, which is predicting binary valued labels yi ∈ {0, 1} as faultand normal based on the i’th example xi, we try to learn a function ofthe form:

P (y = 1|x) = hθ(x) =1

1 + exp(−θTx)≡ σ(θTx) (2.1)

P (y = 0|x) = 1− P (y = 1|x) = 1− hθ(x) (2.2)

where σ(z) ≡ 11+exp(−z) is the logistic function that push the value

of θTx into range [0,1] in order to achieve probabilistic interpretation.The cost function evaluating hθ can be described as:

J(θ) = −∑i

(yilog(hθ(xi)) + (1− yilog(1− hθ(xi))) (2.3)

With a cost function measuring the performance of a given hypothesishθ the training process is finding the best choice of θ by minimizingcost function J(θ). In order to minimize the cost function, the gradientof J(θ) can be expressed as:

∇θJ(θ) =∑i

xi(hθ(xi)− yi) (2.4)

2.1.2 Naive Bayes Classifier

Speaking of probabilistic classifier in machine learning, it is naturalto think of naive Bayes classifiers, a group of simple classifiers devel-oped by using Bayes’ theorem under assumptions of strong indepen-dence between the features. Since early 1950s, naive Bayes has beenextensively studied in various fields including text categorization, au-tomatic diagnostics ect. For example, Singh et al. [12] developed anaive Bayes classifier for Hindi word disambiguation, Zhen et al. [13]combined vessel trajectory clustering technique with naive Bayes clas-sifier for maritime anomaly detection and even developmental toxi-city assessment, performed by Zhang et al. [14]. Other than usingthe naive Bayes method to solve different problems, researches were

CHAPTER 2. SELECTED ALGORITHMS 7

conducted to improve the method itself as well. For example, Nettiand Radhika [15] proposed a novel method to minimizing accuracyloss brought by assumption of independence in naive Bayes classifier.Yan et al. [16] improved naive bayes classifier by dividing its decisionregions. Moreover, Tsangaratos and Ilia [17] performed comparisonlogistic regression and naive bayes classifier in landslide susceptibilityassessment and formed the conclusion that naive bayes classifier out-performs logistic regression.

A naive Bayes classifier is built according to Bayes’ theorem. Letvecotr x = (x1, · · · , xn) represent n independent features for each k

classes Ck and in our case Ck ∈ {0, 1}, k ∈ {1, 2}. With Bayes’ theorem,the conditional probability can be described as:

p(Ck|x) =p(Ckp(x|Ck))

p(x)(2.5)

Using the chain rule, the joint probability can be derived into:

P (Ck, x1, · · · , xn) = p(x1, ·, xn, Ck) (2.6)= p(x1|x2, · · · , xn, Ck)p(x2|x3, · · · , xn, Ck) · · · p(xn−1|xn, Ck)p(xn|Ck)p(Ck)

(2.7)

Assume each feature xi is conditionally independent of every otherfeature xj where i 6= j given class C, the conditional distribution overC is:

p(Ck|x1, · · · , xn) =1

p(x)p(Ck)

n∏i=1

p(xi|Ck) (2.8)

Thus the corresponding naive Bayes classifier can be expressed asfollows:

y = arg maxk∈{1,··· ,K}

p(Ck)n∏i=1

p(xi|Ck) (2.9)

where the class label is predicted as y = Ck

2.1.3 Support Vector Machine

Support vector machines(SVM), introduced by Cortes adn Vapnik andcoworkers [18] in the 1990s, are a family of supervised learning mod-els widely used in classification and regression analysis. Mapping the


training data into two separate categories by constructing a hyper-plane in a high- or infinite-dimensional space, SVM algorithm createsa non-probabilistic binary classification model. To solve the problemof non-linear separable data, SVM introduced kernel function for re-ducing dimensions and increasing computational efficiency.

SVM has been widely used in research in recent years, like othermachine learning algorithms. Yu et al. [19] compared SVM and ran-dom forest based on real-time radar-derived rainfall forecasting andconcluded that SVM outperforms random forest. Klass et al. [20] useda SVM model to capture lithium-ion battery dynamics. Dong et al.[21] conducted crash prediction at the level of traffic analysis zoneswith SVM model.

In short, SVM is an algorithm which aims at finding a p − 1 di-mensional hyperplane that best separate given data points as p dimen-sional vectors.

Suppose the given training data has the form of ( ~x1, y1), ..., ( ~xn, yn)

where yi are transformed from 0, 1 to−1, 1 a hyperplane can be writtenas ~w · ~x − b = 0. Therefore, finding the hyperplane with the largestmargin to both sides of data becomes the problem of maximizing thedistance between the two hyperplanes of data labeled as y = −1 andy = 1. These two hyperplanes can be described as

~w · ~x− b = 1 (2.10)~w · ~x− b = 1 (2.11)

The distance between two hyperplanes is 2‖~w‖ and to maximize this

distance we minimize ‖ ~w ‖. To prevent data from falling into themargin the following constraint is added: for each i either

~w · ~xi − b ≥ 1, ifyi = 1 (2.12)

or~w · ~xi − b ≤ −1, ifyi = −1 (2.13)

And the above can be summarised as

yi(~w · ~xi − b) ≥ 1, for all 1 ≤ i ≤ n (2.14)


Since in our case, the data are not linearly separable, the hinge lossfunction is introduced with the form

max(0, 1− yi(~w · ~xi − b)) (2.15)

so that for the data on the wrong side of the margin, the value ofhinge loss function and the distance from the hyperplane is propor-tional.

The problem then becomes minimizing[1

n

n∑i=1

max(0, 1− yi(~w · ~xi − b))

]+ λ ‖ ~w ‖2 (2.16)

The process of reducing the above expression to a quadratic pro-gramming problem is discussed below.

First for each i ∈ 1, ..., n introduce a variable ζi = max(0, 1 − yi(w ·xi−b)) where ζi is the smallest non-negative number that satisfies yi(w·xi − b) ≤ 1− ζi.

The optimization problem can be written as

minimize1

nζi + λ ‖ w ‖2

subject to yi(w · xi − b) ≤ 1− ζi and ζi ≤ 0, for all i.

After solving the Lagrangian dual of the above, we get the simplifiedproblem

maximize f(c1...cn) =n∑i=1

ci −1

2

n∑i=1

n∑j=1

yici(xi · xj)yjcj

subject ton∑i=1

ciyi = 0, and0 ≤ ci ≤1

2nλfor all i.

where the variables ci are defined such that ~w =∑n

i=1 ciyi~xiThe offset b can be computed by solving

yi(~w · ~xi − b = 1)←→ b = ~w · ~xi − yi (2.17)

To learn a non linear classification rule for transformed data pointsφ(~xi) a kernel function k is needed, which satisfies k(~xi, ~xj) = φ(~xi) ·


φ(~xj).The vector ~w in the transformed space satisfies

~w = sumni=1ciyiφ(~xi) (2.18)

where ci can be obtained by solving

maximize f(c1...cn) =n∑i=1

ci −1

2

n∑i=1

n∑j=1

yicik(~xi, ~xj)yjcj

subject ton∑i=1

ciyi = 0, and 0 ≤ ci ≤1

2nλfor all i.

and then

−b = ~w · φ(~xi)− yi =

[n∑i=1

ckykφ( ~xk) · φ(~xi)

]− yi (2.19)

=

[n∑k=1

ckykk( ~xk, ~xi)

]− yi (2.20)

Eventually the new data points can be classified after computing

~z 7→ sgn(~w · φ(~z)− b) = sgn

([n∑i=1

ciyik(~xi, ~z)

]− b

). (2.21)

2.1.4 Ensemble Learning

Ensemble learning is the process by which multiple models, such asclassifiers or experts, are strategically generated and combined to solvea particular computational intelligence problem. Ensemble learning isprimarily used to improve the (classification, prediction, function ap-proximation, etc.) performance of a model, or reduce the likelihoodof an unfortunate selection of a poor one. Other applications of en-semble learning include assigning a confidence to the decision madeby the model, selecting optimal (or near optimal) features, data fusion,incremental learning, nonstationary learning and error-correcting[22].


Singh et al. [23] developed tree ensemble models for seasonal dis-crimination and air quality prediction and found that their models out-performs SVMs. Elowsson and Friberg [24] used ensemble learningmodel to predict performed dynamics of music audio and found theresult well above that of individual human listeners. Sing and Gupta[25] used a few simple non-quantum mechanical molecular descrip-tors as an ensemble classifier to discriminate toxic and non-toxic chem-icals and predict toxicity of chemicals in multi-species. Bai and Wang[26] developed a multi-view ensemble learning model to improve mal-ware detection and successfully reduced false alarm rate.

In our case, a simple stacking ensemble classifier was built. Un-like bagging and boosting, a stacking ensemble classifier does not useweighted or equal vote from sub-classifiers to output prediction. Sim-ilar to the architecture of artificial neural networks, stacking ensembleclassifiers are built by layers of different learning algorithms and onelearning algorithm to combine the predictions of all. The training pro-cess of a stacking ensemble classifier involves two steps. First, all ofthe other algorithms are trained using the training data set. Then thecombiner algorithms is trained to make a final prediction using all thepredictions of the other algorithms as additional inputs. In our case,the stacking ensemble classifier has two layers. The first layer has arandom forest classifier and a SVM classifier and the second layer islogistic regression classifier as the combining algorithms to produceprediction.

The structure of the ensemble classifier used in this case is shownin Figure 2.1.

2.2 Artificial Neural Network and its Varia-tions

2.2.1 MLP

The multilayer perceptron (MLP) algorithm is a multi-layer feedfor-ward network trained according to error backpropagation algorithmand is one of the most widely applied neural network models. A MLP


Figure 2.1: The structure of ensemble classifier.

contains one input layer, at least one hidden layer and one outputlayer. The input layer has the same number of nodes as the numberof attributes of the input data. The numbers of nodes in the hiddenlayers are manually set and are supposed to decrease one layer afteranother till the output layer. Nodes are connected to every node in thenext layer by weights, while the nodes in the same layer are not con-nected. The structure of a MLP model is shown in Figure 2.2. MLP canbe used to learn and store a great deal of mapping relations of input-output model, and no need to disclose in advance the mathematicalequation that describes these mapping relations. Its learning rule isto adopt the steepest descent method in which the backpropagation isused to regulate the weight value and threshold value of the networkto achieve the minimum error sum of square.

MLP has been studied in many different fields including imagerecognition [27], electrical engineering [28] as well as on-line recom-mendation [29].

For a general feed-forward network with single real input, the train-ing process using back propagation can be divided into three phases.First a forward pass is performed where activities of the nodes arecomputed layer after layer. Then the backward pass is performed


Figure 2.2: The structure of a MLP classifier with 2 hidden layers

where an error signal δ is computed for each node. Because the valueof δ depends on the values of δ in the following layer, this second stepis performed backward, from the output layer to the input layer. Thefinal step is weight update. The computations of three phases are de-scribed in detail as following.

First let xi denote the activity level in node i in the output layer,hj be the activity in node j in the hidden layer φ(x) be the non-lineartransfer function of the network. The output signal hj becomes

hj = φ(h∗j) (2.22)

where h∗j denotes the summed input signal to node j as h∗j =∑

iwj,ixihere wj,i is the weight between node j and i.After same operation in the nest layer with k nodes, the final outputcan be written as

ok = φ(o∗k)o∗k =

∑j

vk,jhj (2.23)

The second phase is the backward pass computation. In this phasethe error δ is computed as


δok = (ok − tk · φ′(o∗k)) (2.24)

The error δ in the next layer is

δhj =

(∑k

vk,jδ(o)k

)· φ′(h∗j) (2.25)

In the final step, the weight update then becomes

∆wj,i = −ηxiδ(h)j (2.26)

∆vk,j = −ηhjδ(o)k (2.27)

2.2.2 Radial Basis Function Neural Network

Radio basis function network(RBF network), first formulated in a 1988paper by Broomhead and Lowe[30], is a class of artificial neural net-works that uses radial basis functions as activation functions The out-put of the network is a linear combination of radial basis functions ofthe inputs and neuron parameters. Till now, RBF networks are widelyused in function approximation, time series prediction, classification,and system control.

Dey et al. [31] introduced computational intelligence to spectrumsensing in cognitive radio by using RBF network as a model and foundtheir model better than conventional ones. Nishikawa and Ozawa [32]proposed a novel type of RBF network for multitask pattern recogni-tion. Dai et al. [33] developed an improved RBF network for structuralreliability analysis. Sabour and Movahed [34] applied RBF neural net-work to predict soil sorption partition coefficient.

The architecture of a typical RBF network consists of three layers:an input layer, a hidden layer with a non-linear RBF activation func-tion and a output layer.We model the input layer as a vector of real numbers x ∈ Rn and


the output of the network is thus a scalar function of the input vectorϕ : Rn → R which can be described as:

ϕ(x) =N∑i=1

aiρ(‖ x− ci ‖) (2.28)

here N is the number of neurons in the hidden layer, ci is the centervector for neuron i and ai is the weight of neuron i in the linear outputneuron.The radial basis function in the hidden layer is commonly taken to beGaussian:

ρ(‖ x− ci ‖) = exp[−β ‖ x− ci ‖2] (2.29)

and the Gaussian basis functions are local to the center vector as

lim‖x‖→∞

ρ(‖ x− ci ‖) = 0 (2.30)

The training process of RBF networks are performed using pairs ofinput and target values x(n), y(n), n = 1, . . . , N with a three-step algo-rithm.

Step one, choose the center vectors ci of the RBF functions in thehidden layer by random sampling from the training set.

Step two, fit a linear model with coefficientswi to the hidden layer’soutputs with respect to some objective function. In our case the objec-tive function is the least squares function:

K(w)def=

N∑n=1

Kn(w) (2.31)

The third step is backpropagation to fine-tune all of the RBF network’sparameters, and the method is described in section 2.6.

2.2.3 Hopfield Network

Hopfield network is a form of recurrent artificial neural network de-scribed by Little in 1974 and popularized by John Hopfield in 1982[35][36].Despite its recurrent structure, a main feature of Hopfield network isthe use of binary threshold units, meaning the units only take on two


different values for their states and the value is determined by whetheror not the units’ input exceeds their threshold. Since Hopefield net-work has the ability of association, in machine learning meaning theability to retrieve distorted patterns, it provides a way to understandhuman memory as well.

Many researches have been conducted on Hopfield network in re-cent years. For example Srivastava et al. [37] used Hopefield net-work to model the microtubules in the brain. Basistov and Yanovskii[38] performed comparison between Bayes, correlation algorithm andHopfield network in the field of image recognition finding Bayes andHopfield as equals, outperforming correlation algorithm in general.Furthermore, Hopfield network was also used in biochemistry by Zouet al. [39] for RNA secondary structure prediction. Li and Serpen [40]equipped wireless sensor network with computational intelligence andadaptation capability using Hopfield network model.

The structure of a Hopfield network is shown in Figure 2.3 and todescribe the structure of Hopfield networks, we first specify that theunits in Hopfield networks are binary threshold units that only takeon values of 1 or -1. Every pair of units i and j in a Hopfield networkare connected by a connectivity weight wij and the connections followthe restrictions below:

wii = 0,∀iwij = wji, ∀i, j

Updating of one unit in the Hopfield network follows a rule as be-low:

si ←{

1 if∑

j wijsj ≥ θi,

−1 otherwise(2.32)

where wij is the weight of connection, sj is the state of unit j and θiis the threshold of unit i.

To train the network, we use the Storkey learning rule in our case.The weight matrix is said to follow the Storkey learning rule if it obeys:

wνij = wν−1ij +1

nενi ε

νj −

1

nενi h

νij −

1

nενjh

νij (2.33)

where hνij =∑n

k=1,k 6=i,j wν−1ik ενk is a form of local field at neuron i.


Figure 2.3: Typical structure of a Hopfield network.

Since the main function of Hopfield network is remember and re-call, we have changed our data structure a little bit to fit our goalof binary prediction in this case. For every single vector of trainingdata, the corresponding label is transformed into one-hot label andappended to the end of the attribute vector. For example, consider ainput vector with 200 attributes, the transformed input vector becomesa 202 length vector with 200 attribute and 2 digits of label at its end.To perform a prediction, a 200 attribute input vector from the test set isfeed into the network and the output will "restore" the input vector toa 202 length vector and the last two digits of this output vector is ourpredicted label.

2.2.4 Convolutional Neural Network

Convolutional neural network(CNN) is a class of deep feed-forwardartificial neural network. A convolutional neural network is comprisedof one or more convolutional layers (often with a sub-sampling step)and then followed by one or more fully connected layers as in a stan-dard artificial neural network. The architecture of a CNN is designedto take advantage of the 2D structure of an input image (or other 2Dinput such as a speech signal). This is achieved with local connections


and tied weights followed by some form of pooling which results intranslation invariant features. Another benefit of CNNs is that theyare easier to train and have many fewer parameters than fully con-nected networks with the same number of hidden units.

Along with the development of GPU computation and other sup-porting techniques, CNN has achieved significant result on image recog-nition, video analysis, natural language processing and other fields.More and more researchers are focusing on CNNs now. Sun et al. [41]developed an enhanced deep CNN for breast cancer diagnosis. Haji-noroozi et al. [42] used deep CNN to predict driver cognitive perfor-mance with EEG data. Zhang et al. [43] developed an adaptive CNNand explored its performance on face recognition. Perlin et al. [44]used CNN as an approach to extract multiple human attributes fromimage. Janssens et al. [45] used CNN for fault detection on rotatingmachinery.

A typical CNN is constructed by three different kinds of layers:convolutional layer, pooling layer and fully connected layer.

The convolutional layer consist of a set of kernels with a small re-ceptive field. During the forward pass, each kernel is convolved acrossthe input volume, computing the dot product and producing a two-dimensional map of that kernel.

A pooling layer, in our case a max pooling layer, partitions the in-put map into a set of rectangles and for each sub-region, outputs themaximum value. In sense of this, the pooling layer serves to signif-icantly reduce the spatial size of the representation, thus reduce thenumber of parameters and computational cost as well as control over-fitting.

Fully connected layers is constructed after convolutional and maxpooling layers in order to perform high-level reasoning in the network.In a fully connected layer, neurons have connections to all activationsint eh previous layer.

In our case, a six-layer CNN was built to perform predictions. Thefirst layer is a convolutional layer which serves also as input layer. The


Figure 2.4: CNN with 2 conv layers and 2 max pooling layers.

second layer is a max pooling layer. The third layer is another convolu-tional layer followed by another max pooling layer as the fourth layer.After max-pooling comes the first fully connected layer followed bya second fully connected layer as logits output of the prediction. Thestructure of the CNN is shown in Figure 2.4

2.2.5 Deep Belief Network

Deep belief nets(DBNs) are probabilistic generative models that arecomposed of multiple layers of stochastic, latent variables. The la-tent variables typically have binary values and are often called hid-den units or feature detectors. The top two layers have undirected,symmetric connections between them and form an associative mem-ory. The lower layers receive top-down, directed connections from thelayer above. The states of the units in the lowest layer represent adata vector[46]. DBN can be viewed as a composition of simple, un-supervised networks like restricted Boltzmann machines(RBMs). Thestructure of a typical DBN is shown in Figure 2.5. In the structure ofDBN, each sub-network’s hidden layer serves as the visible layer forthe next. This structure leads to a layer-by-layer fast training process.

Deep belief nets have been widely used for image and video se-quence recognition, as well as fault diagnosis and prediction in otherfields. For example, Liu et al. [47] developed a discriminative DBN for


Figure 2.5: Schematic of DBN structure

visual data classification which outperforms both representative semi-supervised classifiers and existing deep learning techniques. Huanget al. [48] combined regression-based DBN with SVM and performedsound quality prediction of vehicle interior noise with their modelwhere the result shows the combined model outperforms four con-ventional machine learning methods multiple linear regression(MLR),back-propagation neural network(BPNN), general regression neuralnetwork(GRNN) and SVM. Shen et al. [49] used DBN for exchangerate forecasting in finance and found their model better than typicalforecasting methods such as feed forward neural network (FFNN).Feng et al. [50] developed a DBN model for fault-diagnosis simula-tion study of gas turbine engine.

The training method of DBN components(RBMs) is called contrastivedivergence(CD) which provides an approximation of maximum like-lihood for learning the weights. To train a single RBM, the weightupdates with gradient descent is described as:

wij(t+ 1) = wij(t) + η∂log(p(v))

∂wij(2.34)

where p(v) is the probability of a visible vector given by p(v) = 1Z

∑h e−E(v,h).

Here Z is the partition function and E(v, h) is the energy function as-signed to the state of the network.

The CD process can be described as five steps:


step 1 Initialize the visible units into a training vector.

step 2 Update the hidden units in parallel given the visible units: p(hj =

1|V ) = σ(bj +∑

i viwij). σ is the sigmoid function and bj is thebias of hj .

step 3 Update the visible units in parallel give the hidden units: p(vi =

1|H) = σ(ai +∑

j hjwij). ai is the bias of vi.

step 4 Re-update the hidden units in parallel given the reconstructedvisible units as in step 2.

step 5 Perform the weight update: ∆wij ∝ 〈vihj〉data−〈vihj〉reconstruction.

Once an RBM is trained, another RBM is stacked atop it, taking itsinput from the final trained layer and the new visible layer is initial-ized to a training vector and the values for the units in the already-trained layers are assigned with the current weights and biases. Thenew RBM is then trained with the same procedure. This process isrepeated with given number of iterations.

2.2.6 Stacked Auto-encoders

A stacked auto-encoder is a neural network consisting of multiple lay-ers of sparse auto-encoders in which the outputs of each layer is wiredto the inputs of the successive layer. Stacked autoencoders take advan-tage of the greedy layerwise approach for pretraining a deep networkby training each layer in turn. To do this, first train the first layer onraw input to obtain parameters of weights and biases. Use the firstlayer to transform the raw input into a vector consisting of activationof the hidden units. Train the second layer on this vector to obtainparameters of weights and biases.Repeat for subsequent layers, usingthe output of each layer as input for the subsequent layer.

This method trains the parameters of each layer individually whilefreezing parameters for the remainder of the model. To produce betterresults, after this phase of training is complete, fine-tuning using backpropagation can be used to improve the results by tuning the parame-ters of all layers are changed at the same time.


Wu et al. [51] constructed a stacked denoising auto-encoder ar-chitecture with adaptive learning rate for action recognition based onskeleton features and found their results with better robustness andaccuracy than that of classic machine learning models including SVM,REFTrees, Linear Regression, RBF Network and Deep Belief Network.Suk et al. [52] used stacked auto-encoders for diagnosis of Alzheimer’sdisease and its prodromal stage mild cognitive impairment. Ijjina andMohan C [53] built a stacked auto-encoder for human actions classifi-cation using pose based features.

2.3 Algorithm Configuration

Each algorithm was configured with a fixed set of parameters. Duringtraining and testing on all five data sets, the parameters remain thesame to preserve the model structure. The configuration of typicalmachine learning algorithms in Section 2.1 is shown in Table 2.1. Theconfiguration of ANN related algorithms in Section 2.2 is shown inTable 2.2 and Table 2.3.

Table 2.1: Configuration of typical ML algorithmsAlgorithms Configuration

LR learning rate = 0.01 regularization constant = 0 iteration = 100NB model type = Gaussian

SVM kernel = linear penalty parameter = 1.0Ensemble first layer = random forest + SVM second layer = LR

Table 2.2: Configuration of ANN algorithmsAlgorithms Structure Learning Rate Iteration Estimator Batch Size

MLP hidden(20+10 nodes) 0.0001 200 Adam Not applicableRBFNN hidden kernels = 20 function = interpolation

Hopfield NN attribute number+2 Not applicable 1000 Not applicable 100

The selected algorithms has simple structures for test efficiency andhas sufficiently many training iterations to guarantee convergence. Theconfiguration of libraries and GPU accelerations is presented in Table2.5. The system environment of all training and testing process is pre-sented in Table 2.4.


Table 2.3: Configuration of Deep NN algorithmsAlgorithm CNNStructure conv layer+pooling+conv layer+pooling+fc layer+fc layer

Kernel conv1:[2,2]-20 kernels, conv2:[2,2]-40 kernels, pooling:[1, 2, 2, 1]Fully Connected Layer fc layer 1: 10 nodes, fc layer 2: 2 nodes

Activation Function ReLuLearning Rate 0.0001

Estimator AdamIteration 5000

Batch Size 100Algorithm DBNStructure 2 hidden layers, 20 units each

Activation Function ReLuLearning Rate RBM 0.05

Learning Rate Network 0.1Estimator Adam

Iteration RBM 100Iteration BP 100Batch Size 100Algorithm SAEsStructure 100 auto-encoders + 200 neurons in final layer

Noise Rate 0.3Pre-train 100 epochs

Final Layer 100 epochsFine Tune 100 epochs

Table 2.4: System environment for all testsCPU Intel(R) Core(TM) i7-6700HQ @2.60GHzGPU NVIDIA GeForce GTX 960M

System Windows 10Software Python 3.5


Table 2.5: Libraries and computational environmentAlgorithm Libraries GPU Acceleration

LR None NoNB Sklearn No

SVM Sklearn NoEnsemble ML-ENS[54] No

MLP Sklearn NoRBFNN None No

Hopfield NN TensorFlow YesCNN TensorFlow YesDBN TensorFlow YesSAEs None No

Chapter 3

Evaluation

3.1 Data Generation

Typical machine learning models are trained based on experimentaldata. However, since scheduling data which are able to be used in bigdata analytics are unavailable, the hypothetical data were employedin this project. Therefor, the selected algorithms are trained by severaldata sets with different complexities and scales. A set of artificial pat-terns are designed to represent the data on shop floor.

3.1.1 Data Structure

The design of data structure is inspired by the structure of artificialneural networks. From the data attributes, layers of logic relationsare added one above another until the final label. Here the notion ofdepth is used to describe how many layers of logic relations there are,and thus gives a way to evaluate and control the complexity of datasets. For example, the data structure of depth 6 and attribute number400 is shown in Figure 3.1.

Inspired by the work of Chun Wang et al.[cite international jour-nal of Production research], where the shop floor is modeled as a col-lection of work cells, this project assumes that the data attributes inone data point is composed by several sections. In Figure 3.1, everyfive attributes represents a work cell on shop floor. Take the work cellof {x1, . . . x5} for example, the corresponding hidden logic unit a1 at

25

26 CHAPTER 3. EVALUATION

Figure 3.1: The data structure of depth 6 and attribute number 400.

CHAPTER 3. EVALUATION 27

depth=1 is calculated as:

a1 =

0 , 1

5

5∑i=1

xi 6 0.5

1 , 15

5∑i=1

xi > 0.5

The hidden logic unit b1 is calculated as:

b1 = a1 ∨ a2

In the logic relation from depth=2 to depth=3, dependence is added.First the section of hidden units {b1 . . . b5} for example, follows the rulethat if any of the unit bi, i = 1, 2, . . . , 5 has the value of 1, then all thefollowing units in this section takes the value of 1 no matter what valueit has from depth=1. This relation is added to represent the influenceof previous process on its successors. The hidden logic unit c1 is calcu-lated by:

c1 =

0

5∑i=1

bi < 3

15∑i=1

bi > 3

At depth=4, the hidden logic unit d1 is calculated as:

d1 = c1 ∨ c2

At depth=5, the hidden units is calculated as:

e1 = d1 ∧ d2

ande2 = d3 ∧ d4

The final label is then calculated by:

label = e1 ∨ e2

The networks to generate data with depth 4 and depth 2 are similarto depth 6 in Figure 3.1. For depth 4 network model, layer d and e inFigure 3.1 are removed and the final label is calculated as:

labeldepth=4 = c1 ∨ c2 ∨ c3 ∨ c4


For depth 2 network model, the structure has only three layers:input data(depth=0), a layer(depth=1) and label(depth=2). In this casethe input data is divided into four sectors evenly. The final label iscalculated as:

labeldepth=2 = a1 ∨ a2 ∨ a3 ∨ a4

3.1.2 Data Attributes

In our case, each data set has fixed data points in both training set andtesting set. The attributes of a data set refers to the length of one ob-servation(data point) and describes the scale of one input data pointx, meaning x is considered as a vector and x = [x1, x2, . . . , xn] wheren equals to the corresponding attribute number. Meanwhile, each at-tribute is also a hypothetical representation of a certain task on shopfloor scheduling. In this sense, we also finds a way to represent thedifference of complexity on shop floor scheduling tasks. In our case,an assumption was made that all attributes had already been throughpreprocessing and normalized. Every attribute xi takes a value as:xi ∼ N (0, 0.01) or xi ∼ N (1, 0.01)

For every observation(data point) there is also a corresponding la-bel that represents the result of this scheduling. The value of each labelis either 0 or 1, where 0 means no fault reported and 1 means there isfault.

For data sets B, C, and D where the depth of data remain to be 4and the attributes increases from 200 to 400 and to 600. In these threedata sets, the logic structures remain the same, the only difference is atthe input level(depth=0) meaning each sector of input data has doubleamount of attributes in data set C and triple amount of attributes indata set D.

3.1.3 Data Metrics

By varying this hypothetical structure, five data sets with controlleddifferent complexities and scales were produced. Here the complexityof one data set is represented by the "depth" of logic relationship in thehypothetical structure, and the scale of one data set is represented by


the number of attributes of input data. The metrics of the five data setsare shown in Table 3.1.

Table 3.1: Metrics of data setsData set Attribute Number Depths Training data Testing data

A 400 2 100000 50000B 200 4 100000 50000C 400 4 100000 50000D 600 4 100000 50000E 400 6 100000 50000

The composition of the data is 90% with label 0 and 10% with label1 for all training and testing sets. Furthermore, a singular rate 5% isintroduced in the data by randomly select 5% of the data points andreverse their labels.

With these five data sets, we are able to compare A, C and E to findthe influence of depths on the performance of machine learning algo-rithms. By comparing B, C and D, we are able to find the influence ofdata scales on the performance of the selected algorithms.

3.2 Evaluation Index

Each algorithm was trained based on the data in Section 3.1. In thiscase, three indexes are used to evaluate the performances: accuracy,training time and testing(predicting) time.

Here accuracy represents how well the prediction of a certain modelmatches the actual result in the testing set. The accuracy is computedas:

Accuracy =|ypredict − ytest|

N(3.1)

where yprediction is the vector of output prediction, ytest is the vector oflabels in the testing set and N is the number of data points in testingset, in our case N = 50000.


In this project, accuracy is considered the main metric of evalua-tion since it directly reflect the quality of prediction. With more accu-rate predictions, the fault in scheduling can be found accurately andtherefore the efficiency of shop floor manufacturing can be increasedby rescheduling the high risk tasks.

Training time and testing time are also recorded and used as auxil-iary metrics. Shop floor is not a static environment, and its conditionswill be updated per certain time. The new orders and status changingof the machines may generate new data with possible new patterns.In this case, the models need to be retrained with new data and stayupdated in order to provide consistent prediction with high accuracy.Thus a very long training time of hours or days is not acceptable.

The last index is testing(predicting) time. Testing time shows howfast our model can predict given a schedule input. It is crucial thatthe model has short enough testing time in order to perform real-timefault prediction. The faster our model predicts, the more likely fault inscheduling could be pointed out on shop floor.

All the results regarding to the three metrics are presented in Ap-pendix A.

3.3 Comparison on Depths

Figure 3.2 depicts the test accuracy of algorithms on data sets A, Cand E in Table 3.1 is shown in Figure 3.2. In general, the test accura-cies are reduced as the depth(complexity) of data increasing. At depthtwo, NB, MLP, Hopfield NN, CNN and DBN all achieve the highestaccuracy of 0.9522, followed by SVM with 0.9389. At depth four, thetop two algorithms are DBN (0.9444) and CNN (0.9398) and the ac-curacies of all other algorithms falls to around 0.86, where the lowestaccuracy, 70%, of NB happens . At depth 6, the top two algorithmsremain to be DBN and CNN with accuracy of 0.9526 and 0.9521 re-spectively, meanwhile the accuracies of other algorithms are between0.86-0.88 and NB still has the lowest accuracy of 0.7728. Generally,two of the algorithms: CNN and DBN, are able to keep high accuracyof 93%-95% regardless of the complexity of data.


Figure 3.2: Comparison of accuracies with 400 attributes and differentdepth

The training times on data set A, C and E are shown in Figure 3.3.It is obvious that SVM takes extremely long to train. The exact trainingtimes of SVM classifier are 26257.2s on data set A, 43988.6s on data setC, 44012.8 on data set E which can be found in the table of results inthe Appendix. The long training time of SVM is mainly because of itssensitivity towards dimension increase. According to the principle ofSVM, the computation cost increase significantly as the dimension ofinput data rises. In our case, all other algorithms have an acceptabletraining time less than 7200s which guarantees a model update periodof at most two hours.

The testing times on data set A, C and E are shown in Figure 3.4.The testing time of all algorithms, except for SVM and Ensemble Clas-sifier, are rather short. With a prediction time of 0.1-1.5s for 50000 datapoint, the expected prediction time for one data point is 0.002-0.03mswhich satisfy the criterion of real-time prediction.


Figure 3.3: Comparison of training time with 400 attributes and differ-ent depth

Figure 3.4: Comparison of testing time with 400 attributes and differ-ent depth


3.4 Comparison on Scales

Results of experiments on data set B, C and D are shown in Figure 3.5,Figure 3.6 and Figure 3.7. From the comparison of accuracy, the de-crease in accuracy as the number of attributes rises is obvious. Fordata with 200 scale, all of SVM, Ensemble Classifier, MLP, HopfieldNN, CNN and DBN achieve the highest accuracy of 0.9520, while LR,NB, RBFNN and SAEs have lower accuracies around 0.86. For scaleof 400 attributes, DBN and CNN are the top two algorithims with ac-curacies of 0.9444 and 0.9398 while all other algorithms’ performancesdrop to around 0.86 and NB reaches its worst performance of 0.7176.As the scale continues to increase to 600, none of the selected ten clas-sifiers can barely hold the accuracy above 90 percent. The algorithmthat predicts best is CNN with an accuracy of 0.9014.

From Figure 3.6, it is more obvious to see how the training time ofSVM boosting as the attributes number increases. Ruling out SVM, thelongest training time taken is 5183.27s for DBN on data set D. In thiscase, the training time of all classifiers except for SVM fulfil the updatecycle of at most two hours.

From Figure 3.7, only SVM and Ensemble Classifier takes signifi-cantly longer time to predict than other eight classifiers.

3.5 Discussion

According to the results in Chapter 3, the two best candidates of faultprediction performance are CNN and DBN. On data set C and E, asshown in Table A.5(In Appendix A), both deep-learning classifiers out-perform traditional machine learning classifiers of NB, MLP and SVMwhich proves the superiority of CNN and DBN. Given the trainingtime and testing time, both CNN and DBN classifiers satisfy the crite-rion of update training cycle of less than two hours and real-time pre-diction less than 1ms per data point, however, since DBN has longertraining time, the best candidate in this case is CNN. Only on data setA, as shown in Table A.1 (In Appendix A), with low complexity andmedium scale, the typical machine learning classifiers prevail and thebest candidate is NB since it has the smallest training time.


Figure 3.5: Comparison of accuracies with depth of 4 and differentattributes

Figure 3.6: Comparison of training time with depth of 4 and differentattributes


Figure 3.7: Comparison of testing time with depth of 4 and differentattributes

Based on the results of tests on data set B of small data scale showin Table A.2 (In Appendix A), the best choice of classifier is MLP whichhas same accuracy as DBN and CNN but with much shorter trainingtime. For data set C of medium data scale shown in Table A.3 (InAppendix A), only DBN and CNN achieve acceptable prediction ac-curacy thus makes them the best two candidate on medium-scale dataset. For data set D of large scale shown in Talbe A.4 (In Appendix A),the best candidate is CNN because it is the only classifier that holds anacceptable accuracy.

Chapter 4

Conclusion and Future Work

4.1 Conclusion

Towards a real-time fault prediction for scheduling on shop floor, tenselected machine learning algorithms are programmed and tested. Ac-cording to the complexity of machining resources on a shop floor, anovel model of hypothetical data structure is designed and presented,and it is used to generate five data sets which are applied to trainingand testing the ten selected algorithms. The performances of the algo-rithms are evaluated by accuracy, training time and testing(predicting)time. From the results and discussion in Chapter 3, given differentscheduling data on shop floor, three conclusions are drawn as follows:

• For data with low complexity, the best candidate algorithm isNaive Bayes classifier.

• For data with small scale, the best candidate algorithm is multi-layer perceptron.

• For data with high complexity and data with large scale, the bestcandidate algorithm is convolutional neural network.

4.2 Future Work

The main focus of this project lies in the design of hypothetical datastructure for shop floor and benchmarking selected algorithms. To fur-ther expand the study, there are four aspects where efforts needs to be

36

CHAPTER 4. CONCLUSION AND FUTURE WORK 37

spent.

• it would be most beneficial to be able to acquire real data fromshop floor and use it as the basis of optimization and create realbenefits by solving fault prediction on shop floor for industrialpurpose.

• it would be interesting to use, for example the first half digits, ofone data point to represent an ongoing tasks and feed it to thetrained algorithms to find out the performances of fault predic-tion for ongoing tasks on shop floor.

References

[1] W. Ji and L. Wang, “Big data analytics based fault prediction forshop floor scheduling”, Journal of Manufacturing Systems, vol. 43,pp. 187–194, 2017.

[2] A. S. Manne, “On the job-shop scheduling problem”, OperationsResearch, vol. 8, no. 2, pp. 219–223, 1960.

[3] B. Khoshnevis and Q. M. Chen, “Integration of process planningand scheduling functions”, Journal of Intelligent Manufacturing,vol. 2, no. 3, pp. 165–175, 1991.

[4] M. Rajkumar, P. Asokan, T. Page, and S. Arunachalam, “A graspalgorithm for the integration of process planning and schedul-ing in a flexible job-shop”, International Journal of ManufacturingResearch, vol. 5, no. 2, pp. 230–251, 2010.

[5] G. E. Vieira, J. W. Herrmann, and E. Lin, “Rescheduling manu-facturing systems: A framework of strategies, policies, and meth-ods”, Journal of scheduling, vol. 6, no. 1, pp. 39–62, 2003.

[6] K. Iwamura, N. Okubo, Y. Tanimizu, and N. Sugimura, “Real-time scheduling for holonic manufacturing systems based onestimation of future status”, International journal of production re-search, vol. 44, no. 18-19, pp. 3657–3675, 2006.

[7] G. Metan, I. Sabuncuoglu, and H. Pierreval, “Real time selec-tion of scheduling rules and knowledge extraction via dynami-cally controlled data mining”, International Journal of ProductionResearch, vol. 48, no. 23, pp. 6909–6938, 2010.

[8] S. R. Bonellie, “Use of multiple linear regression and logistic re-gression models to investigate changes in birthweight for termsingleton infants in scotland”, Journal of clinical nursing, vol. 21,no. 19pt20, pp. 2780–2788, 2012.

38

REFERENCES 39

[9] D. W. Kononen, C. A. Flannagan, and S. C. Wang, “Identifica-tion and validation of a logistic regression model for predictingserious injuries associated with motor vehicle crashes”, AccidentAnalysis & Prevention, vol. 43, no. 1, pp. 112–122, 2011.

[10] A. Mair and A. I. El-Kadi, “Logistic regression modeling to as-sess groundwater vulnerability to contamination in hawaii, usa”,Journal of contaminant hydrology, vol. 153, pp. 1–23, 2013.

[11] D. Westreich, J. Lessler, and M. J. Funk, “Propensity score es-timation: Neural networks, support vector machines, decisiontrees (cart), and meta-classifiers as alternatives to logistic regres-sion”, Journal of clinical epidemiology, vol. 63, no. 8, pp. 826–833,2010.

[12] S. Singh, T. J. Siddiqui, and S. K. Sharma, “Naıve bayes classifierfor hindi word sense disambiguation”, in Proceedings of the 7thACM India Computing Conference, ACM, 2014, p. 1.

[13] R. Zhen, Y. Jin, Q. Hu, Z. Shao, and N. Nikitakos, “Maritimeanomaly detection within coastal waters based on vessel trajec-tory clustering and naıve bayes classifier”, The Journal of Naviga-tion, vol. 70, no. 3, pp. 648–670, 2017.

[14] H. Zhang, J.-X. Ren, Y.-L. Kang, P. Bo, J.-Y. Liang, L. Ding, W.-B.Kong, and J. Zhang, “Development of novel in silico model fordevelopmental toxicity assessment by using naıve bayes classi-fier method”, Reproductive Toxicology, vol. 71, pp. 8–15, 2017.

[15] K. Netti and Y. Radhika, “A novel method for minimizing loss ofaccuracy in naive bayes classifier”, in Computational Intelligenceand Computing Research (ICCIC), 2015 IEEE International Confer-ence on, IEEE, 2015, pp. 1–4.

[16] Z.-y. Yan, C.-f. Xu, and Y.-h. Pan, “Improving naive bayes classi-fier by dividing its decision regions”, Journal of Zhejiang Univer-sity SCIENCE C, vol. 12, no. 8, p. 647, 2011.

[17] P. Tsangaratos and I. Ilia, “Comparison of a logistic regressionand naıve bayes classifier in landslide susceptibility assessments:The influence of models complexity and training dataset size”,Catena, vol. 145, pp. 164–179, 2016.

[18] C. Cortes and V. Vapnik, “Support-vector networks”, Machinelearning, vol. 20, no. 3, pp. 273–297, 1995.

40 REFERENCES

[19] P.-S. Yu, T.-C. Yang, S.-Y. Chen, C.-M. Kuo, and H.-W. Tseng,“Comparison of random forests and support vector machine forreal-time radar-derived rainfall forecasting”, Journal of Hydrol-ogy, vol. 552, pp. 92–104, 2017.

[20] V. Klass, M. Behm, and G. Lindbergh, “Capturing lithium-ionbattery dynamics with support vector machine-based battery model”,Journal of Power Sources, vol. 298, pp. 92–101, 2015.

[21] N. Dong, H. Huang, and L. Zheng, “Support vector machinein crash prediction at the level of traffic analysis zones: Assess-ing the spatial proximity effects”, Accident Analysis & Prevention,vol. 82, pp. 192–198, 2015.

[22] R. Polikar, “Ensemble learning”, Scholarpedia, vol. 4, no. 1, p. 2776,2009, revision #186077. DOI: 10.4249/scholarpedia.2776.

[23] K. P. Singh, S. Gupta, and P. Rai, “Identifying pollution sourcesand predicting urban air quality using ensemble learning meth-ods”, Atmospheric Environment, vol. 80, pp. 426–437, 2013.

[24] A. Elowsson and A. Friberg, “Predicting the perception of per-formed dynamics in music audio with ensemble learning”, TheJournal of the Acoustical Society of America, vol. 141, no. 3, pp. 2224–2242, 2017.

[25] K. P. Singh and S. Gupta, “In silico prediction of toxicity of non-congeneric industrial chemicals using ensemble learning basedmodeling approaches”, Toxicology and applied pharmacology, vol. 275,no. 3, pp. 198–212, 2014.

[26] J. Bai and J. Wang, “Improving malware detection using multi-view ensemble learning”, Security and Communication Networks,vol. 9, no. 17, pp. 4227–4241, 2016.

[27] G. Sarker, “An optimal backpropagation network for face iden-tification and localization”, International Journal of Computers andApplications, vol. 35, no. 2, pp. 63–69, 2013.

[28] M. Mejıa-Lavalle and G. Rodrıguez-Ortiz, “Flashover forecast-ing on high-voltage insulators with a backpropagation neuralnet”, Canadian Journal of Electrical and Computer Engineering, vol. 21,no. 1, pp. 29–32, 1996.

REFERENCES 41

[29] T. Chen, “Ubiquitous hotel recommendation using a fuzzy-weighted-average and backpropagation-network approach”, InternationalJournal of Intelligent Systems, vol. 32, no. 4, pp. 316–341, 2017.

[30] D. S. Broomhead and D. Lowe, “Radial basis functions, multi-variable functional interpolation and adaptive networks”, RoyalSignals and Radar Establishment Malvern (United Kingdom),Tech. Rep., 1988.

[31] B. Dey, A. Hossain, A. Bhattacharjee, R. Dey, and R. Bera, “Func-tion approximation based energy detection in cognitive radio us-ing radial basis function network”, Intelligent Automation & SoftComputing, vol. 23, no. 3, pp. 393–403, 2017.

[32] H. Nishikawa and S. Ozawa, “Radial basis function network formultitask pattern recognition”, Neural Processing Letters, vol. 33,no. 3, p. 283, 2011.

[33] H. Dai, W. Zhao, W. Wang, and Z. Cao, “An improved radialbasis function network for structural reliability analysis”, Journalof mechanical science and technology, vol. 25, no. 9, p. 2151, 2011.

[34] M. R. Sabour and S. M. A. Movahed, “Application of radial ba-sis function neural network to predict soil sorption partition co-efficient using topological descriptors”, Chemosphere, vol. 168,pp. 877–884, 2017.

[35] S. Sathasivam and W. A. T. W. Abdullah, “Logic learning in hop-field networks”, arXiv preprint arXiv:0804.4075, 2008.

[36] K. Gurney, An introduction to neural networks. CRC press, 2014.

[37] D. P. Srivastava, V. Sahni, and P. S. Satsangi, “Modelling micro-tubules in the brain as n-qudit quantum hopfield network andbeyond”, International Journal of General Systems, vol. 45, no. 1,pp. 41–54, 2016.

[38] Y. A. Basistov and Y. G. Yanovskii, “Comparison of image recog-nition efficiency of bayes, correlation, and modified hopfield net-work algorithms”, Pattern Recognition and Image Analysis, vol. 26,no. 4, pp. 697–704, 2016.

[39] Q. Zou, T. Zhao, Y. Liu, and M. Guo, “Predicting rna secondarystructure based on the class information and hopfield network”,Computers in Biology and Medicine, vol. 39, no. 3, pp. 206–214,2009.

42 REFERENCES

[40] J. Li and G. Serpen, “Adaptive and intelligent wireless sensornetworks through neural networks: An illustration for infras-tructure adaptation through hopfield network”, Applied Intelli-gence, vol. 45, no. 2, pp. 343–362, 2016.

[41] W. Sun, T.-L. B. Tseng, J. Zhang, and W. Qian, “Enhancing deepconvolutional neural network scheme for breast cancer diagno-sis with unlabeled data”, Computerized Medical Imaging and Graph-ics, vol. 57, pp. 4–9, 2017.

[42] M. Hajinoroozi, Z. Mao, T.-P. Jung, C.-T. Lin, and Y. Huang, “Eeg-based prediction of driver’s cognitive performance by deep con-volutional neural network”, Signal Processing: Image Communica-tion, vol. 47, pp. 549–555, 2016.

[43] Y. Zhang, D. Zhao, J. Sun, G. Zou, and W. Li, “Adaptive convo-lutional neural network and its application in face recognition”,Neural Processing Letters, vol. 43, no. 2, pp. 389–399, 2016.

[44] H. A. Perlin and H. S. Lopes, “Extracting human attributes usinga convolutional neural network approach”, Pattern RecognitionLetters, vol. 68, pp. 250–259, 2015.

[45] O. Janssens, V. Slavkovikj, B. Vervisch, K. Stockman, M. Loccu-fier, S. Verstockt, R. Van de Walle, and S. Van Hoecke, “Convolu-tional neural network based fault detection for rotating machin-ery”, Journal of Sound and Vibration, vol. 377, pp. 331–345, 2016.

[46] G. E. Hinton, “Deep belief networks”, Scholarpedia, vol. 4, no. 5,p. 5947, 2009, revision #91189. DOI: 10.4249/scholarpedia.5947.

[47] Y. Liu, S. Zhou, and Q. Chen, “Discriminative deep belief net-works for visual data classification”, Pattern Recognition, vol. 44,no. 10-11, pp. 2287–2296, 2011.

[48] H. B. Huang, X. R. Huang, R. X. Li, T. C. Lim, and W. P. Ding,“Sound quality prediction of vehicle interior noise using deepbelief networks”, Applied Acoustics, vol. 113, pp. 149–161, 2016.

[49] F. Shen, J. Chao, and J. Zhao, “Forecasting exchange rate usingdeep belief networks and conjugate gradient method”, Neuro-computing, vol. 167, pp. 243–253, 2015.

REFERENCES 43

[50] D.-l. Feng, M.-q. Xiao, Y.-x. Liu, H.-f. Song, Z. Yang, and Z.-w.Hu, “Finite-sensor fault-diagnosis simulation study of gas tur-bine engine using information entropy and deep belief networks”,Frontiers of Information Technology & Electronic Engineering, vol. 17,no. 12, pp. 1287–1304, 2016.

[51] D. Wu, W. Pan, L. Xie, and C. Huang, “An adaptive stackeddenoising auto-encoder architecture for human action recogni-tion.”, Applied Mechanics & Materials, 2014.

[52] H.-I. Suk, S.-W. Lee, D. Shen, A. D. N. Initiative, et al., “Latentfeature representation with stacked auto-encoder for ad/mci di-agnosis”, Brain Structure and Function, vol. 220, no. 2, pp. 841–859, 2015.

[53] E. P. Ijjina et al., “Classification of human actions using pose-based features and stacked auto encoder”, Pattern RecognitionLetters, vol. 83, pp. 268–277, 2016.

[54] S. Flennerhag, Ml-ensemble, Nov. 2017. DOI: 10.5281/zenodo.1042144. [Online]. Available: https://dx.doi.org/10.5281/zenodo.1042144.

Appendix A

Results of All Tests

Table A.1: Data set A: depth=2, attribute=400Algorithms Accuracy Training Time(s) Testing Time(s)

LR 0.8624 7.83 0.17NB 0.9522 1.15 0.81

SVM 0.9389 26257.2 249.15Ensemble 0.8624 8261.27 339.693

MLP 0.9522 10.84 0.11RBFNN 0.8624 16.19 7.94

Hopfield NN 0.9522 101.356 0.29CNN 0.9522 1584.41 7.40DBN 0.9522 4966.16 1.32SAEs 0.8624 3846.06 1.42984

44

APPENDIX A. RESULTS OF ALL TESTS 45

Table A.2: Data set B: depth=4, attribute=200Algorithms Accuracy Training Time(s) Testing Time(s)

LR 0.8623 4.30 0.11NB 0.8586 0.58 0.41

SVM 0.9520 9004.03 0.6707Ensemble 0.9520 1162.39 150.609

MLP 0.9520 6.87 0.08RBFNN 0.8623 15.72 7.69


Table A.3: Data set C: depth=4, attribute=400Algorithms Accuracy Training Time(s) Testing Time(s)

LR 0.8616 7.54 0.18NB 0.7176 1.14 0.84

SVM 0.8616 43988.60 487.96Ensemble 0.8616 2112.59 330.41

MLP 0.8616 11.88 0.11RBFNN 0.8616 16.56 8.14


46 APPENDIX A. RESULTS OF ALL TESTS

Table A.4: Data set D: depth=4, attribute=600Algorithms Accuracy Training Time(s) Testing Time(s)

LR 0.8621 10.84 0.11NB 0.7201 2.01 0.41

SVM 0.8621 111845.0 460.42Ensemble 0.8621 3444.44 703.25

MLP 0.8621 14.06 0.17RBFNN 0.8621 16.77 8.59


Table A.5: Data set E: depth=6, attribute=400Algorithms Accuracy Training Time(s) Testing Time(s)

LR 0.8624 7.54 0.16NB 0.7728 1.17 0.83

SVM 0.8624 44012.80 492.34Ensemble 0.8624 2155.59 335.69

MLP 0.8503 35.52 0.99RBFNN 0.8624 16.46 8.18


TRITA -ITM-EX- 2018:555

www.kth.se

Machine Learning Based Fault Prediction for Real-time ...1294436/FULLTEXT01.pdf · prediction for...

Documents

Transcript of Machine Learning Based Fault Prediction for Real-time ...1294436/FULLTEXT01.pdf · prediction for...