Best Basis Intelligent Monitoring

8/13/2019 Best Basis Intelligent Monitoring

1/14

Mechanical Systems

and

Signal ProcessingMechanical Systems and Signal Processing 19 (2005) 357370

Best basis-based intelligent machine fault diagnosis

S. Zhang, J. Mathew, L. Ma, Y. Sun

CRC for Integrated Engineering Asset Management, School of Mechanical, Manufacturing and Medical Engineering,

Queensland University of Technology, Brisbane, QLD 4001, Australia

Received 5 April 2004; received in revised form 21 May 2004; accepted 16 June 2004

Abstract

The wavelet packet transform decomposes a signal into a set of bases for timefrequency analysis. This

decomposition creates an opportunity for implementing distributed data mining where features are

extracted from different wavelet packet bases and served as feature vectors for applications. This paper

presents a novel approach for integrated machine fault diagnosis based on localised wavelet packet bases of

vibration signals. The best basis is firstly determined according to its classification capability. Data mining

is then applied to extract features and local decisions are drawn using Bayesian inference. A final conclusion

is reached using a weighted average method in data fusion. A case study on rolling element bearing

diagnosis shows that this approach can greatly improve the accuracy of diagnosis.

r 2004 Elsevier Ltd. All rights reserved.

Keywords: Wavelet packet transform; Best basis; Fault diagnosis; Bayesian inference; Data mining/fusion

1. Introduction

Condition monitoring is an important part of the process of modern equipment maintenance.

Its implementation typically consists of data acquisition, feature extraction, conditionidentification and fault diagnosis [1]. Researchers in the field have tended to focus on two areas

for their work. The extraction of features that represent the faults in some way is an identified area

of work. The other is design and implementation of an automatic fault diagnosis procedure.

ARTICLE IN PRESS

www.elsevier.com/locate/jnlabr/ymssp

0022-460X/$ - see front matter r 2004 Elsevier Ltd. All rights reserved.

doi:10.1016/j.ymssp.2004.06.001

Corresponding author.

E-mail address: [email protected] (S. Zhang).
http://www.elsevier.com/locate/jnlabr/ymssphttp://www.elsevier.com/locate/jnlabr/ymssp


2/14

Various methods are available for feature extraction. For example, statistical methods are used

to derive time-domain features, such as signal energy and kurtosis. The fast Fourier transform

(FFT) is a traditional tool to extract frequency-domain features. Joint timefrequency features

which can be generated by short-time Fourier transforms are increasingly used since the majorityof real-world signals are essentially time varying. In the past two decades, the wavelet transform

(WT) and wavelet packet transform (WPT)[2,3], have been researched and applied in a variety of

ways[4]. More particularly, in machine fault diagnostics, WT and WPT have become preferred

techniques to the traditional FFT method in the analysis of transient signals [57].

The features extracted from signals build a foundation for subsequent condition identification

and fault diagnosis. On the other hand, different approaches have been developed to design

condition classifiers, aimed at enhancing the accuracy of diagnosis and automating the diagnosis

procedure. Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and

Bayesian inference are known statistical methods. Modern methods, such as neural networks,

fuzzy logic and expert systems, are preferred due to their intelligent properties. Some integratedapproaches consider different signal features in a combined fashion to enhance the accuracy of

diagnosis [8].

In this work, the authors propose a novel approach to conduct integrated fault diagnosis based

on the best bases of the WPT of vibration signals, using data mining and fusion. The best bases

of WPT are firstly selected according to their classification capability. Features are then

extracted from individual best basis and local decisions are made by classifiers such as Bayesian

inference. A final conclusion is reached using the decision-fusion technique, where the

classification capabilities of the best bases are served as the decision weights. This proposed

approach is similar to the distributed data mining approach (DDM), which generally starts from

local data analysis and subsequently generates a global model [9]. However, the proposed

approach has not been previously reported in work related to wavelet packet-based faultdiagnosis in the literature.

This paper is arranged as follows. Section 2 presents the techniques used in this work, such as

WPT, best basis selection, Bayesian inference, data mining and fusion. Section 3 describes the

integrated procedure for fault classification by fusing local information from each best basis of

wavelet packets. The proposed method is validated using signals from faulty rolling element

bearings in Section 4. In addition, back propagation neural networks (BP) as a classifier is

compared. The conclusions are presented in Section 5.

2. Brief introduction of the techniques

2.1. WPT

Both WT and WPT have continuous and discrete formats. The discrete format of WPT was

adopted in this work because it is more popularly used in engineering applications. To illustrate

the underlying mathematical theory of WPT briefly, we denote fhkgk2Z andfgkgk2Z as thequadrature mirror filter banks. A signal can be decomposed at different scales on the basis

functions with the form 2j=2un2jtk; j; k2Z; n2Z; where, Zdenotes the integer and Z+

ARTICLE IN PRESS

S. Zhang et al. / Mechanical Systems and Signal Processing 19 (2005) 357370358


3/14

denotes the non-negative integer. These functions are iterated as

u2nt ffiffiffi2

p

Xn2Z

hkun2tk; 1

u2n1t ffiffiffi

2pX

n2Zgkun2tk; 2

where,jis a scale parameter, kis a time localisation parameter and n is an oscillation parameter.

Thus, u0t is a scale function which corresponds to a low-pass filter. The filtered signal is anapproximation of the analysed signal. The function u1tis a wavelet function which correspondsto a high-pass filter. The filtered signal is a detail of the analysed signal.

The approximation and detail can be further sliced by dyadic decomposition using the dilated

and translated scale functions and wavelet functions. Consequently, WPT generates a binary tree,

with 2

j

1 bases at decomposition level j. Each basis is indexed by a pair of integers (j, k).The binary structure of the tree enables WPT to be used in various applications. For signalrepresentation, for example, a signal can be reconstructed from wavelet packet coefficients

confined in some specific frequency bands. For pattern recognition, features can be extracted from

different wavelet packet bases. In addition, the distributed best bases create opportunities for

feature extraction and combination, where data mining, a convergence of knowledge discovering

techniques [10], can play an important role. Based on the features of each best basis, local

decisions can be made by a classifier.

2.2. Best basis selection

The binary tree of bases can also be considered as a 2D timefrequency plane. The information

in the bases is redundant along two axes, i.e. information in child bases are overlapped with that

in the parent basis. The best basis is preferably selected from the binary tree, so as to reduce the

data analysis effort without losing information. For signal representation, best bases are defined

in that they cover the complete horizontal axis while not overlapping the vertical axis[11]. This

definition results in a complete tree and ensures no redundant information. The Shannon entropy-

based criterion[12]is well suited to the selection of the complete tree. When signals come from

different classes and common best basis is required, a WPT-structured tree[13,14]is used for the

best basis selection. For pattern recognition, common best bases are selected in that they have the

best classification capability. They are not necessary for the construction of a complete tree

[15,16]. In this work, the best bases were searched to guarantee class separation, since faultdiagnosis is essentially about pattern recognition.

Suppose there are c classes oi; i1; . . .; c; in a classification problem, si is denoted as thecluster centre for theith class oi;then the normalised distance between two closest classes iandjis

di;j jjsisjjjPi1j1Pc1

i1 jjsisjjj: 3

The minimal distance infdi;jg is selected as the discriminant distance for best basis selectiond minfdi;jg i1;. . .; c1;j1;. . .; c1: 4

ARTICLE IN PRESS

S. Zhang et al. / Mechanical Systems and Signal Processing 19 (2005) 357370 359


4/14

Apparently, a largerdindicates a better capability of classification. It is noted that the minimal

distance, rather than other measures, such as mean distance, is adopted. This choice assists in the

determination of a best basis in which the classes are relatively well separated.

2.3. Bayesian inference for classification

Bayesian inference is an application of Bayesian theorem and has been used as a fundamental

classifier for pattern recognition [17,18]. Bayesian inference works by assigning an unknown

patternx to the class which has the highest posterior probability. According to Bayesian theorem,

the posterior probability is given by

Poijx PoiPxjoi

Px ; 5

where, Poi is the prior probability of class oi and Pxjoi is the class-conditional probabilitywhich represents the probability distribution ofxin class oi:The total probabilityPxis given by

Px Xci1

PoiPxjoi; 6

To obtain the posterior probability, the prior probability and class-conditional probability

must be known. The prior probability can be inferred from prior knowledge of the application,

estimated from the data or assumed to be equal. The class-conditional probability can be

estimated from the data using either parametric or non-parametric methods. For simplicity, the

parametric multivariate normal distribution is always used as an approximation of probabilitydensity estimation in case of multivariate features. If the underlying distribution does not follow

normal probability distribution, the non-parametric density estimation provides an alternative

approach. In this work, the signal energy and kurtosis were extracted from each best basis as a

scale feature separately. Their distributions were estimated using both parametric and non-

parametric methods.

2.4. Data fusion at the decision level

Distributed data resources, such as the distributed sensors, require the integration of local

information to make a final decision. The data-fusion technique provides such a solution and hasbeen successfully used in military and civilian applications [17]. Data fusion helps improve the

identification accuracy in pattern classification and is typically performed at three levels, i.e. (1)

sensor-level fusion, (2) feature-level fusion, and (3) decision-level fusion. More recently, the

decision-level fusion has been termed classifier fusion[1921]. In the work reported in this paper,

local decisions were drawn from each best basis of wavelet packets. The decision-level data fusion

is therefore used for integration. Different methods are used for decision-level fusion, such as the

weighted average method, majority voting technique, Bayesian inference and DempsterShafers

method. The discriminant distance in Eq. (4) supplies a reasonable decision weight for each best

basis. As a result, the weighted average method was adopted for decision fusion in this work.

ARTICLE IN PRESS



5/14

3. A procedure to implement integrated fault diagnosis

A procedure using the above techniques to implement integrated fault diagnosis based on

wavelet packets is illustrated in Fig. 1. It has the following steps.

(1) Wavelet packet transformation of signals: The m signals from c signal classes collected for

training and testing classifiers are decomposed by WPT. This step results in m wavelet packet

trees.

(2) Common best basis selection: The discriminant distance (Eq. (4)) is applied for the selection of

a set ofn best bases from m binary trees.

(3) Feature extraction from best basis: The data in a best basis is essentially a time-domain signal

confined in a specific frequency band. Features, such as signal energy or signal kurtosis, are

extracted to construct a feature vector x:A local feature set for a best basis from the m signalsis X

fx

g:

(4) Decision making on local feature set: The class-conditional probability is firstly estimated fromfeature setX:Given the prior probability, the posterior probability Poijxis computed usingBayesian inference for the unknown signal with a feature vector xin a best basis. The posterior

probability represents which class x belongs to and is prepared for final decision making.

(5) Data fusion for final decision making: The weighted average method is adopted for decision

fusion, while Bayesian inference produces probabilities or confidence values corresponding to

each class in an individual best basis. For a specific class oi; a fused probability is given by

Poijx Xnj1

wi;jPoi;jjx; 7

where,wi;j

is the normalised weight given by Eq. (4). A final decision thatxis assigned to class

Iis made by selecting a maximal averaged posterior probability.

PoIjx maxPoijx i1; . . .; c; 8

ARTICLE IN PRESS

WPT

Bayes infer. 1

Feature set 1

Best basis 1

Bayes infer. 2

Feature set 2

Best basis 2

Bayes infer. n

Feature set n

Best basis n

Fusion center

Fig. 1. A procedure for wavelet packets-based fault diagnosis by data mining/fusion.



6/14

Alternatively, a binary classification method can be used by voting the highest posterior

probability (Eq. (9)). This majority voting technique takes the winner-take-all principle.

PoIjx 1 max

Po

ijx;

i

1;

. . .;

c;

0 others: 9

This procedure facilitates an automatic integrated fault diagnosis approach, since WPT, best

basis selection and Bayesian inference are highly computational.

4. A case study

Rolling element bearings are key components in mechanical systems. Their failures account for

a large percentage of breakdowns in rotating machinery. Some of these breakdowns can be

catastrophic. Conducting diagnosis and prognosis on bearings is therefore fundamental tomaintaining the integrity of mechanical systems.

In this case study, experimental data of faulty ball bearings were used to test our methodology.

For these bearings, a single defect was introduced by laser processing on the outer-race, inner-race

and ball, respectively. The data were collected under different operation conditions, i.e. different

speeds and loads, to ensure that broad conditions are covered for the benefit of the generalisation

of classifiers.

Seven hundred samples were acquired for each fault class. Among them, 600 samples were used

for classifier training, while 100 samples were used for classifier testing. Since three types of faults

were involved, there were a total of 2100 samples.

Following the procedure in Section 3, the signals were decomposed by WPT up to level 3 by

Db20 wavelets. Fig. 2 illustrates a signal from the faulty outer-race and its WPT. The signal

energy and kurtosis were adopted as features separately and formed the training and testing

ARTICLE IN PRESS

Fig. 2. WPT for an outer race signal.



7/14

datasets [22,23]. Tables 1 and 2list the normalised discriminant distances for all nodes. The six

selected common best bases are illustrated in Fig. 3 and 4for the energy and kurtosis features,

respectively. The corresponding decision weights measured by the discriminant distances are

shown inTables 3 and 4.For each best basis, it was assumed that the prior probabilities for the three faults were equal,

i.e. Poi 13; i1; 2; 3; and the class-conditional probabilities were estimated from the trainingdatasets. According to Bayesian inference (Eqs. (5) and (6)), the testing signals were classified to

reach the local decisions, which were further fused (Eq. (7)) to produce a final decision using

Eq. (8) or (9).

ARTICLE IN PRESS

Table 1

Discriminant distance (energy)

0.1865

0.3949 0.1024

0.2558 0.2758 0.1148 0.1021

0.4918 0.2043 0.2306 0.4812 0.0618 0.1430 0.0897 0.2886

Table 2

Discriminant distance (kurtosis)

0.4898

0.2026 0.4791

0.3257 0.3362 0.1705 0.3982

0.0355 0.4888 0.4676 0.0236 0.1730 0.4503 0.3836 0.3728

(3,7)(3,3)(3,0)

(2,0)

(1,0)

(2,1)

Fig. 3. Common best basis, (energy).



8/14

ARTICLE IN PRESS

Table 3

Decision weights (energy)

Node (1,0) (2,0) (2,1) (3,0) (3,3) (3,7)

Weight 0.1805 0.1169 0.1261 0.2248 0.2199 0.1319

Table 4

Decision weights (kurtosis)

Node (0,0) (1,1) (2,3) (3,1) (3,2) (3,5)

Weight 0.1766 0.1727 0.1436 0.1762 0.1686 0.1623

(2,3)

(1,1)(0,0)

(3,5)(3,2)(3,1)

Fig. 4. Common best basis (kurtosis).

0 20 40 60 80 1000

0.05

0.1

Energy

0 5 10 15 200

0.2

0.4

Energy

0 0.5 1 1.5 2 2.50

1

2

3

Energy

PDF

PDF

PDF

Fig. 5. ASH estimated distribution (energy).



9/14

To obtain the class-conditional probabilities, the features are assumed to follow a normal

distribution or an unknown distribution. For the normal distribution, its mean and variance are

estimated relatively easily. The averaged shifted histogram (ASH), a non-parametric estimation

technique, was used to estimate the unknown probability distribution [24]. The estimated

ARTICLE IN PRESS

-50 0 50 1000

0.02

0.04

Energy

-5 0 5 10 15 200

0.2

0.4

Energy

-1 -0.5 0 0.5 1 1.5 20

1

2

Energy

PDF

PDF

PDF

Fig. 6. Estimated normal distribution (energy).

0 50 100 150 200 250 300 3500

0.02

0.04

0.06

Kurtosis

0 50 100 150 200 250 3000

0.05

0.1

Kurtosis

0 50 100 150 200 250 300 3500

0.01

0.02

0.03

Kurtosis

PDF

PDF

PDF

Fig. 7. ASH estimated distribution (kurtosis).



10/14


11/14

The misclassification rate was calculated using the winner-take-all principle.

ri11

JXJ

j1

bi;j; 11

where, bi;j was 1 if the related result of Bayesian inference for class i is maximum, otherwise it

was 0.

Table 5shows that when signal energy and signal kurtosis were employed as the feature. All 100

test signals in each class were correctly classified. However, employing energy as the feature

resulted in significantly higher decision confidences. It is concluded that signal energy is a better

feature choice. Another findingthe classification results by ASH estimation are comparable with

that obtained by normal distribution estimation for each feature case. This finding suggests that

both probability estimation methods work well for the case study.

For classification problems, a feature vector may alternatively be built in that its elements come

from different best bases. Instead of using the DDM approach, a final decision can be madedirectly based on this feature vector. A matched classifier is required. However, if Bayesian

inference is used, the assumption of multivariate normality distribution for the feature vectors is

always violated, resulting in unacceptable misclassifications. The non-parametric multivariate

probability density estimation is also difficult to implement for this case. As a comparison, a

6 nh 3 BP neural network[25,26] was designed. In the three-layer neural network, six inputnodes corresponded to the features extracted from the six best bases, and three output nodes

corresponded to the three types of faults. The target outputs were [1, 0, 0], [0, 1, 0] and [0, 0, 1],

respectively. The number of hidden nodes,nh, was varied from 5 to 20 to reach an optimal design.

The signal features in each common best basis were concatenated into a normalised feature vector

which constructed the training and testing datasets. During network training, the cross- validation

technique[26]was used to prevent over fitting. Four-fifths of the training samples were used for

ARTICLE IN PRESS

0 50 100 150 20010

-6

10-5

10-4

10-3

10-2

10-1

100

101

Epoch

Training

ValidationE

rror

Goal

Fig. 9. Learning curve (energy).



12/14

training and one-fifth was used for validation. The maximum iteration was 1000 and the target

error was 0.00001. The training of the BP neural networks ceased when either the maximum

iteration or the target error was reached. Another criterion to stop training was the cross-

validation. The initial values of weights and bias of the networks were randomly set. It was found

thatnh=15 generated best results for energy feature vectors, andnh=13 generated best results for

kurtosis feature vectors.Figs. 9 and 10present the learning curves. In our case study, the cross-validation ceased the training of BP neural networks with an acceptable total error lower than

0.01. In addition, 230 and 108 epochs were required for the energy and kurtosis features,

respectively. The classification results are also listed in Table 5. Similarly, the averaged confidence

and misclassification rate were computed by Eqs. (10) and (11).

FromTable 5, it is clear that BP networks relying on the concatenated feature vectors resulted

in deteriorated classification. Misclassifications occurred in each fault class for both individual

features. The signal kurtosis led to poorer results. The comparison using BP networks suggests the

proposed integrated method significantly outperforms adopted BP neural networks for

classification.

5. Conclusion

This work has presented an automatic fault classification technique based on the WPT and best

basis selection. The novel approach performs integrated fault diagnosis based on vibration

signals. The following conclusions are drawn:

(1) Signals can be classified (diagnosed) based on the best basis of WPT. For each best basis, the

capability to discriminate features served as the decision weight for final decision fusion.

ARTICLE IN PRESS

0 20 40 60 80 10010

-6

10-5

10-4

10-3

10-2

10-1

100

101

Epoch

Validation

Training

Goal

Error

Fig. 10. Learning curve (kurtosis).



13/14

(2) Both signal energy and kurtosis can be used to classify the signals 100% correctly by the

integrated method. Signal energy, however, resulted in higher decision confidences and is

preferred.

(3) The probability estimation methods by ASH and normal estimation led to comparable resultsin the case study.

(4) BP neural networks employing concatenated feature vectors with the element coming from

individual best basis, deteriorated classification results, in terms of both misclassification rate

and decision confidence.

References

[1] A. Davies, Handbook of Condition MonitoringTechniques and Methodology, Chapman & Hall, UK, 1998.[2] I. Daubechies, Ten lectures on wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 61,

SIAM, Philadelphia, PA, 1992.

[3] S. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Transactions on

Pattern Analysis and Machine Intelligence 11 (1989) 674692.

[4] B.K. Alsberg, A.M. Woodward, D.B. Kell, An introduction to wavelet transforms for chemometricians: a

timefrequency approach, Chemometrics and Intelligent Laboratory Systems 37 (1997) 215239.

[5] S.K. Goumas, M.E. Zervakis, G.S. Stavrakakis, Classification of washing machines vibration signals using discrete

wavelet analysis for feature extraction, IEEE Transactions on Instrumentation and Measurement 51 (3) (2002)

497508.

[6] G. Meltzer, Y.H. Ivanov, Fault detection in gear drives with non-stationary rotational speedpart II: the

timefrequency approach, Mechanical Systems and Signal Processing 17 (2) (2003) 273283.

[7] N.G. Nikolaou, I.A. Antoniadis, Rolling element bearing fault diagnosis using wavelet packets, NDT&EInternational 35 (2002) 197205.

[8] S.L. Chen, Y.W. Jen, Data fusion neural network for tool condition monitoring in CNC milling machining,

International Journal of Machine Tools and Manufacture 40 (2000) 381400.

[9] D.E. Hershberger, H. Kargupta, Distributed multivariate regression using wavelet-based collective data mining,

Journal of Parallel and Distributed Computing 61 (2001) 372400.

[10] K. Mehmed, Data Mining: Concepts, Models, Methods and Algorithms, IEEE Press, Wiley, New York, 2002.

[11] M. Cocchi, R. Seeber, A. Ulrici, WPTER: wavelet packet transform for efficient pattern recognition of signals,

Chemometrics and Intelligent Laboratory Systems 57 (2001) 97119.

[12] R.R. Coifman, M.V. Wickerhauser, Entropy-based algorithms for best basis selection, IEEE Transactions on

Information Theory 38 (2) (1992) 713718.

[13] B. Walczak, D.L. Massart, Wavelet packet transform applied to a set of signals: a new approach to the best-basis

selection., Chemometrics and Intelligent Laboratory Systems 38 (1997) 3950.

[14] N. Saito, R.R. Coifman, F.B. Geshwind, F. Warner, Discriminant feature extraction using empirical probability

density estimation and a local basis library, Pattern Recognition 35 (2002) 28412852.

[15] Y. Wu, R. Du, Feature extraction and assessment using wavelet packets for monitoring of machining processes,

Mechanical Systems and Signal Processing 10 (1) (1996) 2953.

[16] S. Zhang, J. Mathew, L. Ma, Common best basis selection of wavelet packets for machine fault diagnosis,

Proceedings of the 10th Asia-Pacific Vibration Conference, 2003, pp. 835840.

[17] D.L. Hall, J. Llinas, Handbook of Multisensor Data Fusion, CRC Press, Boca Raton, FL, 2001.

[18] B. Chen, P.K. Varshney, A Bayesian sampling approach to decision fusion using hierarchical model, IEEE

Transactions on Signal Processing 50 (8) (2002) 18091818.

[19] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Transactions on Pattern Analysis and

Machine Intelligence 20 (3) (1998) 226239.

ARTICLE IN PRESS



14/14

[20] D.M.J. Tax, M.V. Breukelen, R.P.W. Duin, J. Kittler, Combining multiple classifiers by averaging or by

multiplying?, Pattern Recognition 33 (2000) 14751485.

[21] S. Prabhakar, A.K. Jain, Decision-level fusion in fingerprint verification, Pattern Recognition 35 (2002) 861874.

[22] B. Samanta, K.R. Al-Balushi, Artificial neural network based fault diagnostics of rolling element bearings usingtime-domain features, Mechanical Systems and Signal Processing 17 (2) (2003) 317328.

[23] J. Shiroishi, Y. Li, S. Liang, T. Kurfess, S. Danyluk, Bearing condition diagnostics via vibration and acoustic

emission measurements, Mechanical Systems and Signal Processing 11 (5) (1997) 693705.

[24] W.L. Martinez, A.R. Martinez, Computational Statistics Handbook with MATLAB, Chapman & Hall/CRC,

New York, 2002.

[25] D. Rumelhart, G. Hinton, R. Williams, Learning representation by back-propagating errors, Nature 323 (1986)

533536.

[26] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 1995.

ARTICLE IN PRESS


Best Basis Intelligent Monitoring

Documents

Transcript of Best Basis Intelligent Monitoring