Bearing Hybrid

Expert Systems with Applications 38 (2011) 11311–11320

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A hybrid feature selection scheme for unsupervised learning and its applicationin bearing fault diagnosis

Yang Yang a,⇑, Yinxia Liao b, Guang Meng a, Jay Lee b

a State Key Laboratory of Mechanical System and Vibration, Shanghai Jiaotong University, Shanghai 200240, PR Chinab NSF I/UCR Center for Intelligent Maintenance Systems, 560 Rhodes Hall, University of Cincinnati, Cincinnati, OH 45221, USA

a r t i c l e i n f o

Keywords:Feature selectionUnsupervised learningFault diagnostics

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.02.181

⇑ Corresponding author. Tel.: +86 21 34206831x32E-mail address: [email protected] (Y. Yang).

a b s t r a c t

With the development of the condition-based maintenance techniques and the consequent requirementfor good machine learning methods, new challenges arise in unsupervised learning. In the real-world sit-uations, due to the relevant features that could exhibit the real machine condition are often unknown aspriori, condition monitoring systems based on unimportant features, e.g. noise, might suffer high false-alarm rates, especially when the characteristics of failures are costly or difficult to learn. Therefore, itis important to select the most representative features for unsupervised learning in fault diagnostics.In this paper, a hybrid feature selection scheme (HFS) for unsupervised learning is proposed to improvethe robustness and the accuracy of fault diagnostics. It provides a general framework of the feature selec-tion based on significance evaluation and similarity measurement with respect to the multiple clusteringsolutions. The effectiveness of the proposed HFS method is demonstrated by a bearing fault diagnosticsapplication and comparison with other features selection methods.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction that is capable of selecting the prominent features to achieve a bet-

As sensing and signal processing technologies advance rapidly,increasingly features have been involved in condition monitoringsystem and fault diagnosis. A challenge in this area is to selectthe most sensitive parameters for the various types of fault, espe-cially when the characteristics of failures are costly or difficult tolearn (Malhi & Gao, 2004). In reality, since the relevant or impor-tant features are often not available as priori, amount of candidatefeatures have been proposed to achieve a better representation ofthe machine health condition (Dash & Liu, 1997; Jardine, Lin, &Banjevic, 2006; Peng & Chu, 2004). Due to the irrelevant andredundant features in the original feature space, employing all fea-tures might lead to high complexity and low performance of faultdiagnosis. Moreover, most unsupervised learning methods assumethat all features have uniform importance degree during clusteringoperations (Dash & Koot, 2009). Even in the optimal feature set, itis also assumed that each feature has the same sensitivity through-out clustering operations. In fact, it is known that an important fea-ture facilitates creating clusters while an unimportant feature, onthe contrary, may jeopardize the clustering operation by blurringthe clusters. Thereby, it is better to select only the most represen-tative features (Xu, Xuan, Shi, & Wu, 2009) rather than simplyreducing the number of the features. Hence, it is of significanceto develop a systematic and automatic feature selection method

ll rights reserved.

2.

ter insight into the underlying machine performance. Summarilyspeaking, the feature selection is one of the essential and fre-quently used techniques in machine learning (Blum & Langley,1997; Dash & Koot, 2009; Dash & Liu, 1997; Ginart, Barlas, &Goldin, 2007; Jain, Duin, & Mao, 2000; Kwak & Choi, 2002), whoseaim is to select the most representative features and which bringsthe immediate effects for improving mining performance such asthe predictive accuracy and solution comprehensibility (Guyon &Elisseeff, 2003; Liu & Yu, 2005).

However, traditional feature selection algorithms for classifica-tion do not work for unsupervised learning since there is no classinformation available. Dimensionality reduction or feature extrac-tion methods are frequently used for unsupervised data, such asPrincipal Components Analysis (PCA), Karhunen–Loeve transforma-tion, or Singular Value Decomposition (SVD) (Dash & Koot, 2009).Malhi and Gao (2004) presented a PCA-based feature selection mod-el for the bearing defect classification in the condition monitoringsystem. Compared to using all features initially considered relevantto the classification results, it provided higher accurate classifica-tions for both supervised and unsupervised purposes with fewer fea-ture inputs. But the drawback is the difficulty of understanding thedata and the found clusters through the extracted features (Dash &Koot, 2009). Given sufficient computation time, the feature subsetselection investigates all candidate feature subsets and selects theoptimal one with satisfying the cost function. Greedy searchalgorithms like sequential forward feature selection (SFFS) (or back-ward search feature selection (BSFS)) and random feature selection

http://dx.doi.org/10.1016/j.eswa.2011.02.181

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2011.02.181

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

11312 Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

were commonly used. Oduntan, Toulouse, and Baumgartner (2008)developed a multilevel tabu search algorithm combined a hierarchi-cal search framework, and compared it to the sequential forwardfeature selection, the random feature selection and the tabu searchfeature selection (Zhang & Sun, 2002). Feature subset selections re-quire intensive computational time and show poor performance fornon-monotonic indices. In order to overcome these drawbacks, fea-ture selection methods tend to rank features or select a subset of ori-ginal features (Guyon & Elisseeff, 2003). Feature ranking techniquesresort the features according to cost functions and select a subsetfrom the ordered features. Hong, Kwong, and Chang (2008a) intro-duced an effective methods feature ranking from multiple view(FRMV). It scores each feature using a ranking criterion by consider-ing multiple clustering results, and selects the first several featureswith the best quality as the ‘‘optimal’’ feature subset. However,FRMV prefers to the importance of features that achieve better clas-sification rather than the redundancy of the selected features, whichresults in the selected features not necessarily being the optimalsubset. Considering the approaches for evaluating the cost functionof feature selection techniques, feature selection algorithms broadlyfall into three categories: the filter model, the wrapper model andthe hybrid model (Liu & Yu, 2005) .The filter model discovers thegeneral characteristics of data and considers the feature selectionas a preprocessing step which is independent of any mining algo-rithms. Filter method is less time consuming while less efficient.The wrapper model incorporates the one predetermined learningalgorithm and selects the feature subset aiming to improve its min-ing performance according to certain criteria. It is more time-consuming but more effective compared to the filter models. More-over, the predetermined learning algorithm remains bias towardsthe shape of the cluster, that is, it is sensitive to the data structureaccording to its operation concept. The hybrid model tends to takeadvantage of the two models in different search stages accordingto different criteria. Mitra, Murthy, and Pal (2002) described a filterfeature selection algorithm for high dimension data sets based onmeasuring similarity between features whereby the redundancytherein was removed, and a maximum information compression in-dex was also introduced to estimate the similarity between featuresin their work. Li, Dong, and Hua (2008) proposed a novel filter fea-tures selection algorithm through feature clustering (FFC) to groupthe features into different clusters based on the feature similarityand to select the representative features in each cluster to reducethe feature redundancy. Wei and Billings (2007) introduced a for-ward orthogonal search feature selection algorithm by maximizingthe overall dependency to find significant variables, which also pro-vided a rank list of selected features ordered according to the per-centage contribution for representing the overall structures. Liu,Ma, Zhang, and Mathew (2006) presented a wrapper model basedon fuzzy c-means (FCM) algorithm for rolling element bearing faultdiagnostics. Sugumaran and Ramachandran (2007) employed awrapper approach based on decision tree with information gainand entropy reduction as criteria to select representative featuresthat could discriminate faults of bearing. Hong , Kwong, and Chang(2008b) described a novel feature selection algorithm based onunsupervised learning ensembles and population based incremen-tal learning algorithm. It searches for a subset from all candidate fea-ture subsets so that the clustering operation based on this featuresubset could generate the most similar clustering result to the oneobtained by a unsupervised learning ensembles method. Huang,Cai, and Xu (2007a) developed a two stages hybrid genetic algorithmto find a subset of features. In the first stage, the mutual informationbetween the predictive labels and the true class labels served as a fit-ness function for the genetic algorithm to conduct the global searchin a wrapper way. Then in the second stage, the conditional mutualinformation served as an independent measure for the feature rank-ing considering both the relevance and the redundancy of features.

As mentioned above, these techniques either require the avail-able features to be independent initially, to which the realistic sit-uations are opposite, or remain bias toward the shape of the clusterdue to their fundamental concept. This paper introduces a hybridfeature selection scheme for unsupervised learning, which canovercome those deficiencies. The proposed scheme generates tworandom-selected subspaces for further clustering, combines differ-ent genres of clustering analysis to obtain a population of sub-decisions of feature selection based on significance measurement,and removes redundant features based on feature similarity mea-surement to improve the quality of selected features. The effective-ness of the proposed scheme is validated by an application ofbearing defects classification, and the experimental results illus-trate that the proposed method is able to (a) identify the featuresthat are relevant to the bearing defects, and (b) maximize the per-formance of unsupervised learning models with fewer features.

The rest of this paper is arranged as follows. Section 2 illustratesthe proposed HFS scheme in details. Section 3 discusses the appli-cation of the proposed feature selection scheme in bearing faultdiagnosis. Finally, Section 4 concludes this paper.

2. Hybrid feature selection scheme (HFS) for unsupervisedclassification

It is time consuming or a difficult mission even for an experiencedfault diagnosis engineer to determine which feature among all avail-able features is able to distinguish the characteristics of various fail-ures, especially there is no prior knowledge (class information)available. To tackle the problem of class information absence, FRMV(Hong et al., 2008a) extended the feature ranking methodology intothe unsupervised data clustering data. It offered a generic approachto boost the performance of clustering analysis. A stable and robustunsupervised feature ranking approach was proposed based on theensembles of multiple feature rankings obtained from differentviews of the same data set. When conducting the FRMV, data in-stances were first classified in a randomly selected feature subspaceto obtain a clustering solution, and then all features were rankedaccording to their relevancies with the obtained clustering solution.These two steps iterated until a population of feature rankings wasachieved. Thereby, all obtained features rankings were combinedby a consensus function into a single consensus one.

However, FRMV clustered the data instances in a subspace thatconsists of random-selected half number of features every time. Itwas likely that some valuable features might be ignored in thebeginning, that is, some features probably might never be includedin all iterations. Besides, it only focused on the ensemble of oneunsupervised learning algorithm results through its scheme, whichobviously overlooked the reality that it is likely for a learning algo-rithm to hold the bias toward the nature structure of data, such asthe hyper-spherical structure and the hierarchical structure (Frigui,2008; Greene, Cunningham, & Mayer, 2008). As simply illustratedin Fig. 1, the data set is consisted of 11 points. If two clusters arecontained in the data set, a classifier based on the hierarchicalconcept tends to assign point 1, 2, 3, 4, 5 to a cluster, and theremaining to the other cluster. While, a classifier based on the hy-per-spherical concept tempts to assign point 1, 2, 3, 4, 6, 7, 8 to acluster, and the remaining to the other cluster.

Furthermore, in FRMV all features were supposed to be inde-pendent before the selection process, which is usually oppositein the real world. Thereby, the well-ranked features were relatedto their neighbors with high probability, in other words, sometop ranked features might turn out to be redundant.

Since the abovementioned shortcomings in FRMV are the obsta-cles of boosting higher classification performance and constraintsof wider applications in the real world, a hybrid feature selection

Fig. 1. An example of data structures in 2D; (a) hierarchical cluster, (b) spherical cluster.

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320 11313

scheme (HFS) is proposed to overcome these deficiencies. Theremaining part of this section will present the HFS scheme forunsupervised learning and introduce the criterions used in HFS.

2.1. Procedure of the hybrid unsupervised feature selection method

The HFS is inspired by FRMV, which ranked each feature accord-ing to the relevancy between the feature and the combined cluster-ing solutions. Moreover, the HFS is developed to combine thedifferent genres of clustering analysis to a consensus decisionand rank features according to the relevancies between the fea-tures and the consensus decision and independencies betweenfeatures.

Generally, HFS involves two aspects: (1) significance evaluation,which determines the contribution of each feature on behalf ofmultiple clustering results; (2) redundancy evaluation, which re-tains the most significant and independent features concerningthe feature similarity.

Some notations used throughout this paper are given as follows.The input vector x of original feature space X with D candidate fea-tures is denoted as Xi ¼ fx1

i ; x2i . . . ; xD

i g ði ¼ 1; . . . ;MÞ, in which i isdenoted as ith instance and M is the number of instances. LetRFk = {rank(k)(F1), rank(k)(F2), . . . , rank(k)(Fn)} (1 < rank(k)(Fi) < D) bethe kth sub-decision of the feature ranking, where rank(k)(Fi) de-notes the rank of the ith feature F in the kth sub-decision. Assum-ing there are P sub-decisions of the feature rank{RF(1), RF(2), . . . RF(p)} ,a combine function determines a final

decision through combining the P sub-decisions into a single con-sensus feature decision RFpre-final, which is thereafter processedaccording to the feature similarity to obtain RFfinal. Detail of thescheme is described as follows.

Algorithm: Hybrid feature selection scheme for unsupervisedlearning

Input: feature space X, the number of cluster N, maximumiteration L

Output: decision of feature selection(1) Iterate until get a population of dub-decisionsFor k = 1:L, Do:

(1.1) Randomly divide original feature space into twosubspaces X1, X2

(1.2) Group data with the first and second group ofunsupervised learning algorithms in subspace X1, X2

separately(1.3) Evaluate significance of each feature based onsignificance measurement to obtain the kth sub-decision offeature selection RF(k)

End//combine all rankings into a single consensus one(2) RFpre-final = combiner{RF(1), RF(2), . . . , RF(2L)}(3) Redundancy evaluation based on feature similarities(4) Return RFfinal

Applications 38 (2011) 11311–11320

At the beginning, the original feature space X is random dividedinto two feature subspaces X1, X2, in which the instances are clus-

11314 Y. Yang et al. / Expert Systems with

tered correspondingly. In step 1.2, two different genres of cluster-ing analysis, e.g. hyper-spherical cluster and hierarchical cluster,are used to classify data instances in the two subspaces respec-tively. Thereby, a clustering solution is obtained in each subspace.Then all features are ranked with respect to their relevancies to theobtained clustering solutions in step 1.3, named the significanceevaluation. These above two steps iterate until a population offeature rankings, named sub-decisions, are achieved. In step 2, aconsensus function is utilized to combine all sub-decisions into apre-final decision. Thereafter, the final decision of feature selectionis accomplished by re-ranking the pre-final decision according tothe re-rank scheme based on feature similarity in step 3, namedthe redundancy evaluation. The details of HFS will be introducedin Sections 2.2 and 2.3. Fig. 2 illustrates the framework of the HFS.

Fig. 2. Flowchart of HFS for unsupervised learning.

Table 1Comparison between FRMV and HFS.

FRMV

Subspace Randomly select n2 features

Clustering analysis Remain the bias towards single data structureIndependent evaluation None

Note: N the number of all features.

Table 1 lists the differences between FRMV and the proposedHFS. First of all, in order to make sure that every feature in the ori-ginal feature set is able to contribute to the decision making, HFS ismaking use of both randomly divided subspaces from the originalfeature space instead of ignoring some features due to randomlyselecting the half of the original feature space in FRMV. Secondly,HFS considers the bias of individual unsupervised learning algo-rithm. Thereby, different genres of clustering methods are usedto cluster data in the subspace. Moreover, HFS provides a redun-dancy evaluation according to the feature similarity and re-ranksthe features. It is more appropriate for the real world applicationsthan FRMV.

2.2. Significance evaluation

The goal of unsupervised feature selection is to find as few fea-tures as possible that best uncovers ‘‘interesting natural’’ clustersfrom data, which could be found by unsupervised learning algo-rithm. Therefore, the relationship between the clustering solutionand feature is considered as the significance of the feature to theclustering solution. In step 1.3, the features are ranked with respectto their relevancies with the obtained clustering solutions, namedsignificance evaluation. The sub-decisions serve as the target andeach feature is considered as a variable. In this research, the widelyused linear correlation coefficient (LCC) (Hong et al., 2008a), thesymmetrical uncertainty (SU) (Yu & Liu, 2004) and the Davies–Bouldin index (DB) (Bouldin, 1979) are used for significance evalu-ation. The details of each criterion are introduced as follows. Forconvenience, denote Fk and R(k) as the kth feature and the kthsub-decision respectively.

First, the linear correlation coefficient studies the correlationsbetween the variables and the target, which is calculated asfollows:

LCCðFk;RkÞ ¼covðFk;RkÞrðFkÞrðRkÞ

; ð1Þ

where r(Rk) is the standard deviation of the kth target andcov(Fk, Rk) is the covariance between Fk and Rk.

Secondly, the symmetrical uncertainty is defined as follows:

SUðFk;RkÞ ¼ 2IGðFkjRkÞ

HðFkÞ þ HðRkÞ

� �; ð2Þ

with

IGðFkjRkÞ ¼ HðFkÞ � HðFkjRkÞ; ð3ÞHðFkÞ ¼ �

XF 0k2XðFkÞ

PðF 0kÞ logðPðF 0kÞÞ; ð4Þ

HðFkjRkÞ ¼ �X

R0k2XðRkÞ

PðR0kÞX

F 0k2XðFkÞ

PðF 0kjRÞ logðPðF 0kjRÞÞ

66647775; ð5Þ

PðF 0kÞ ¼PN

i¼1dðdi; F0kÞ

N; ð6Þ

dðdi; F0kÞ ¼

1; if di ¼ F 0k0; otherwise

�; ð7Þ

HFS

Randomly divide feature space to two subspaces X1, X2

Consider the bias towards date structure of individual algorithmRe-rank the features according to similarity measurement


where H(Fk) calculates the entropy of Fk and H(Fk|Rk) is the condi-tional entropy of Fk. X(Fk) denotes all possible values of Fk andX(Rk) is all possible values of Rk. PðF 0kÞ is the probability that Fk

equals to F 0k and PðF 0kjRkÞ is the probability that Fk equals to F 0k underthe condition that the instances are assigned into the group Rk. Inaddition, the value 1 of the symmetrical uncertainty SU(Fk, Rk) indi-cates that Fk is completely related to Rk. On the other hand, the va-lue 0 of the symmetrical uncertainty SU(Fk, Rk) means that Fk isabsolutely irrelevant with target (Hong et al., 2008a; Shao & Nezu,2000).

Thirdly, DB index is a function of the ratio of the sum of within-cluster scatter to between–cluster separations, which is computedas follows:

DB ¼ 1n

Xn

i¼1

min1–j

SnðQ iÞ þ SnðQ jÞSðQ i;QjÞ

� �; ð8Þ

where n is the number of clusters, Qi stands for the ith cluster, Sn de-notes the average distance of all objects from the clusters to theircluster centre, S(Qi, Qj) is the distance between cluster centers.The DB index is small if the clusters are compact and fat from eachother, in other word, a small DB index means a good clustering.

2.3. Combination and similarity measurement

Besides maximization of clustering performance, the otherimportant purpose is the selection of features based on the featuredependency or the similarity. Since any feature carrying little or noadditional information beyond that subsumed by the remainingfeatures, is redundant and should be eliminated (Mitra et al.,2002). That is, if there is a feature with high rank carrying valuableinformation and it is very similar to a lower-ranked feature, thuslatter one should be eliminated due to it carries no additional valu-able information. Therefore, the similarities between features areconsidered as reference for the redundancy evaluation.

In step 2, a consensus function is utilized to combine all sub-decisions into a pre-final decision, named combiner. A large num-ber of combiners used for combining the results of the classifierwere discussed in Dietrich, Palm, and Schwenker (2003). The mostcommon combiners are majority vote, simple average andweighted average. In the simple average, the average of learningmodel results is calculated and the variable owned the largestaverage value is selected as the final decision. The weighted aver-age is the same concept as the simple average except that theweights are selected heuristically. While, the majority vote assignsthe kth variable a rank j if more than half of sub-decisions vote it torank j.

Practically, the determination of the weights in the weightedaverage combiners relies on experience. On the other hand, themajority vote could lead to confusion of decision making, e.g. onefeature could be nominated with two ranks at the same time.Therefore, the simple average combiner is applied in this studyto combine the sub-decisions, which is computed as follows:

ARðjÞ ¼PM

k¼1rankðkÞðjÞM

; ð9Þ

where M is the population of sub-decisions, and rank(k)(j) is the sig-nificance measurement of feature j in kth sub-decision RF(k).

Thereafter, in step 3, in order to reduce the redundancy, thosehigh ranked but less independent features with respect to the ob-tained pre-final decision are eliminated. The similarity betweenfeatures could be utilized to estimate the redundancy. There arebroadly criteria for measuring similarity between two randomvariables, based on the linear dependency between them. The rea-son why chooses the linear dependency as a feature similaritymeasure is that the data is still linearly separable when all but

one of the linearly dependent features are eliminated if the datais linearly separable in the original representation. In this research,the most well known measure of similarity between two randomvariables, correlation coefficient, is adopted. The correlation coeffi-cient q between two variables x and y is defined as

qðx; yÞ ¼ covðx; yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffivarðxÞvarðyÞ

p ; ð10Þ

where var(x) denotes the variance of x and cov (x, y) is the covari-ance between two variables x and y.

The elimination procedure is then conducted according to thepre-final decision and the similarity measure between features.For example, the most significant (top one) feature is retained, towhich thereby the most related features based on the similaritymeasure are considered as the redundant features to be removed,and the successive features are processed likewise until the well-ranked features are linear independent.

3. HFS’s application in bearing fault diagnosis

This section applies the HFS in bearing fault diagnostics. Thecomparison results between HFS and other feature selection meth-ods will be demonstrated and discussed.

To validate the proposed feature selection scheme could im-prove the classification accuracy, a comparison between the pro-posed hybrid feature selection scheme and other five featureselection approaches was carried out. The eight learning algo-rithms are listed as follows:

(1) HFS with Symmetrical uncertainty (HFS_SU);(2) HFS with Linear Correlation Coefficient (HFS_LCC)(3) HFS with DB index (HFS_DB)(4) PCA-based feature selection (Malhi & Gao, 2004);(5) FRMV based on k-means clustering with Symmetrical uncer-

tainty (FRMV_KM) (Hong et al., 2008a);(6) Forward search feature selection (SFFS) (Oduntan et al.,

2008);(7) Forward orthogonal search feature selection algorithm by

maximizing the overall dependency (fosmod) (Wei &Billings, 2007);

(8) Feature selection through feature clustering (FFC) (Li, Hu,Shen, Chen, & Li, 2008).

The comparisons among them were in term of classificationaccuracy. According to Hong et al. (2008a), the iteration ofFRMV_KM was set to 100, the k-means clustering was used to ob-tain the population of clustering solutions and SU was adopted asthe evaluation criteria. In order to get the comparable populationof sub-decision, the iteration of the proposed algorithm was setto 50. The threshold of the fosmod was set to 0.2. Two commonlyused clustering algorithms were adopted in the HFS, fuzzy c-meanclustering and hierarchical clustering algorithms. In this research,the result of FCM was defuzzified as follows:

RðkÞ ¼1; if PðkÞ ¼ maxðPÞ0; otherwise

�; ð11Þ

where P and P(k) denote the membership of instance that belongs toeach cluster and the possibility of the instance that belongs to kthcluster, respectively.

Features discussed in this chapter for bearing defects includedthe features extracted from time-domain, frequency domain,time-frequency domain and empirical mode decomposition(EMD). Firstly, in the time domain, statistical parameters were ex-tracted from the waveform of the vibration signals directly. A wideset of statistical parameters, such as rms, kurtosis, skewness, crest


factor and normalized high order central moment, have beendeveloped (Jack & Nandi, 2002; Lei, He, & Zi, 2008; Samanta &Nataraj, 2009; Samanta, Al-Balushi, & AI-Araimi, 2003). Second,the characteristic frequencies related to the bearing componentswere located, e.g. ball spin frequency (BSF), ball-pass frequencyof inner ring (BPFI), and ball-pass frequency of outer ring (BPFO).Besides, in order to interpret real world signals effectively, theenvelope technique for the frequency spectrum was used to extractthe features of the modulated carrier frequency signals (Patil,Mathew, & RajendraKumar 2008). In addition, a new signal feature,proposed by Huang from envelope signal (Hung, Xi, & Li, 2007b),the power ratio of maximal defective frequency to mean or PMMfor short, was calculated as follows:

PMM ¼maxðpðfpoÞ;pðfpiÞ; pðfbcÞÞmeanðpÞ ; ð11Þ

where P(fPo), p(fpi) and p(fbc) are the average power of the defectivefrequencies of the outer-race, inner race and ball defects, respec-tively; and mean (p) is the average of overall frequency power.Thirdly, Yen introduced wavelet packet transform (WPT) in Yen(2000) as follows,

ejn ¼X

k

w2j;n;k; ð12Þ

where wj.n.k is the packet coefficient, j is the scaling parameter, k isthe translation parameter, and n is the oscillation parameter. Each

0 0.5 1 1.5 2 2.5 3-0.1

-0.05

0

0.05

0.1

Time (s)

Acce

lera

tion

(g)

Normal

0 0.5 1 1.5 2 2.5 3-0.5

0

0.5

1

Time (s)

Acce

lera

tion

(g)

Inner-race defect

0 0.5 1 1.5 2 2.5 3-0.5

0

0.5

1

Time (s)

Acce

lera

tion

(g)

Inner-race & Roller defect

0 0.5 1 1.5 2 2.5 3-2

-1

0

1

2

Time (s)

Acce

lera

tion

(g)

Outer & inner-race & Roller defect

Fig. 3. Vibration signal of the first test, including

wavelet packet coefficient measures a specific sub-band frequencycontent. In addition, EMD was used to decompose signal into sev-eral intrinsic mode functions (IMFs) and a residual. The EMD energyentropy in Yu, Yu, and Cheng (2006) was developed to calculate thefirst several IMFs of signal.

In this research, self-organized map (SOM) was used to validatethe classification performance based on the selected features. Thetheoretical background of unsupervised SOM has been extensivelystudied in the literature. A brief introduction of SOM can be foundin Liao and Lee (2009) for bearing faults diagnose. With availabledata from different bearing failure modes, the SOM can be appliedto build a health map in which different regions indicate differentdefects of a bearing. Each input vector could be represented by aBMU (Best Machining Unit) in the SOM. After training, the inputvectors of a specific bearing defect are represented by a cluster ofBMUs in the map, which forms a region indicating the defect. Ifthe input vectors are labeled, each region could be defined to rep-resent a defect.

3.1. Experiments

In this research, two tests were conducted on two types of bear-ing and the class information was considered as unknown in bothcases.

In the first test, bearings were artificially made to have rollerdefect, inner-race defect, outer-race defect and four different

0 0.5 1 1.5 2 2.5 3-0.2

0

0.2

0.4

0.6

Time (s)

Acce

lera

tion

(g)

Roller defect

0 0.5 1 1.5 2 2.5 3-0.1

-0.05

0

0.05

0.1

Time (s)

Acce

lera

tion

(g)

Outer-race defect

0 0.5 1 1.5 2 2.5 3-1

-0.5

0

0.5

1

Time (s)

Acce

lera

tion

(g)

Outer & Inner-race defect

0 0.5 1 1.5 2 2.5 3-0.5

0

0.5

Time (s)

Acce

lera

tion

(g)

Outer-race & Roller defect

normal pattern and seven failure patterns.

1 2 3 4 5 6 7 8 9 10 11 1288

90

92

94

96

98

100Validated by SOM

Number of features according to rankings

Accu

racy HFS SU

HFS LCCHFS DB

Fig. 5a. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB.


combinations of the single failures respectively. In this case,SKF32208 bearing was tested, with an accelerometer installed onthe vertical direction of its housing. The sampling rate for thevibration signal was 50 kHz. The BPFI, BPFO, and BSF for this casewere calculated as 131.73 Hz, 95.2 Hz and 77.44 Hz, respectively.Fig. 3 shows the vibration signal of all defects as well as the normalcondition in the first test.

In the second test, a set of 6308-2R single row deep groove ballbearings were run to failure resulting in roller defect, inner-racedefect and outer-race defect (Huang et al., 2007b). Totally 10 bear-ings were involved through the experiment. The data sampling fre-quency was 20 kHz. The BPFI, BPFO, and BSF in this case werecalculated as 328.6 Hz, 205.3 Hz and 274.2 Hz, respectively. Itshould be pointed out that the beginning of the second test wasnot stable, and then it fell into a long normal period. Hence, twoseparate segments from the stable normal period were selectedto be baseline for training and testing, respectively. On the otherhand, the data that exceeded mean value before end of the testwas supposed as potential failure patterns. Therefore, 70% of thefaulty patterns and the half of good patterns were used for trainingunsupervised learning model, while all the faulty patterns and theother half of good patterns for testing. Fig. 4 shows the part of thedata segments of one bearing from the run-to-failure experimentin the second test.

3.2. Analysis and result

In the first test, totally 24 features were computed as follows.Half of the data was used for training the SOM and the remainingpart for testing.

� Energies centered at 1xBPFO, 2xBPFO, 1xBPFI, 2xBPFI, 1xBSF,2xBSF.� 6 statistics for the raw signal (mean, rms, kurtosis, crest factor,

skewness, entropy).� 6 statistics for envelop signal obtained by hilbert transform.� 6 statistics for the spectrum results of the waveform by FFT.

Figs. 5 shows the results of first test, with the x axis represent-ing the number of selected features fed into the unsupervised SOMfor clustering and the y axis representing the classification accu-racy correspondingly. The first 12 features selected by each algo-rithm are shown for convenience. Take Fig. 5a as an example, the

0 500 1000 1500 2000-40

-20

0

20

40

Acce

lera

tion

(g)

unstable beginning

0 500 1000 1500 2000-50

0

50

Acce

lera

tion

(g)

stable 1

0 500 1000 1500 2000-50

0

50

Acce

lera

tion

(g)

stable 2

0 500 1000 1500 2000-200

-100

0

100

200

Acce

lera

tion

(g)

failure

Fig. 4. Vibration signal of one bearing of the second test; (1) unstable beginning ofthe test; (2) first stable segment; (3) second stable segment; (4) failure pattern(inner race defect).

classification accuracies based on HFS_SU, HFS_LCC and HFS_DBwith the top one ranked feature as input were 92.11%, 92.11%and 85.59%, respectively. When using the first three ranked fea-tures, the accuracy of 97.19%, 99.77% and 97.03% were achieved.In the comparison with HFS_SU, HFS_DB, features selected byHFS_LCC achieved higher classification accuracy of 99.77%. In otherword, HFS_LCC apparently selected most representative featuresfor this specific application. As shown in Fig. 5b, the highest classi-fication accuracy was 99.38% with 5 features for PCA. In Fig. 5c, theclassification accuracies based on HFS_SU, HFS_DB and HFS_LCCwere higher than the results based on FRMV_KM. For FRMV_KM,highest classification accuracy of 98.36% was achieved with 12 fea-tures. Fig. 5d compared SFFS and three HFS methods, HFS_LCC se-lected most representative features, and the accuracies reached byHFS were higher. For SFFS, highest classification accuracy of 98.43%was achieved with 9 features. As shown in Fig. 5e, although thefirst 11 features selected by fosmod ultimately reached the accu-racy of 99.14%, HFS not only obtained higher accuracy but alsoranked features with high reliability. Comparing to FFC as shownin Fig. 5f, features selected by HFS provided better classification

1 2 3 4 5 6 7 8 9 10 11 1288

90

92

94

96

98

100Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DBPCA

Fig. 5b. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand PCA based method.

1 2 3 4 5 6 7 8 9 10 11 1240

50

60

70

80

90

100Validated by SOM


Acc

urac

y

HFS SU

HFS LCCHFS DB

FRMV KM

Fig. 5c. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand FRMV_KM.

1 2 3 4 5 6 7 8 9 10 11 1288

90

92

94

96

98

100Validated by SOM


Acc

urac

y

HFS SU

HFS LCCHFS DB

SFFS

Fig. 5d. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand SFFS.

1 2 3 4 5 6 7 8 9 10 11 1250

55

60

65

70

75

80

85

90

95

100Validated by SOM


Acc

urac

y

HFS SU

HFS LCCHFS DB

fosmod

Fig. 5e. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand fosmod.

1 2 3 4 5 6 7 8 9 10 11 1240

50

60

70

80

90

100Validated by SOM


Acc

urac

y

HFS SU

HFS LCCHFS DB

FFC

Fig. 5f. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand FFC.


accuracy with less features. For FFC, the highest accuracy of 98.43%was reached with 9 features. The performance improvement of theproposed model over FRMV_KM, SFFS, fosmod and FFC was mainlydue to making use of every feature, combining the clustering solu-tions and independence evaluation, which overcomes deficienciesof less uncertainty of clustering solution and jeopardize of relatedfeatures.

In order to illustrate the effect of the redundancy of the featureset to the classification performance and demonstrate the robust-ness of HFS, more features were involved to be the candidates inthe second test. Totally 40 features were calculated and given asfollows.

� 10 statistics for the raw signal (var, rms, skewness, kurtosis,crest factor, 5th to 9th central moment).� Energies centered at 1xBPFO, 1xBPFI, 1xBSF for both raw signal

and envelop.� PMMs for both raw signal and envelop.� 16 WPNs.� 6 IMF energy entropies.

The results in Fig. 6 show the classification accuracy of the sec-ond test. As shown in Fig. 4(a), HFS_LCC reached highest classifica-tion accuracy of 88.56% with first 10 features, while HFS_DB andHFS_SU achieved their highest classification accuracy of 87.29%,85.17% with first 8 features and 11 features, respectively Fig. 6bshows that compare to PCA based feature selection method,HFS_LCC and HFS_DB outperformed PCA (highest accuracy86.02%) with respect to higher accuracy with the same numberof features or less features. In the comparison with FRMV_KM (asshown in Fig. 6c) (highest accuracy 83.90%), HFS group showedapparently better classification accuracy with less features. Asshown in Figs. 6d and 6e, the feature selected by SFFS and fosmodresulted in the accuracy of 84.75% and 85.17%, which were worsethan the only one feature selected by HFS_DB. Compared to FFC(as shown in Fig. 6f), HFS_LCC showed better performance. Since86.86% accuracy was reached by FFC with 6 features selected.

From the results of the two tests, the conclusion could be drawnthat the proposed HFS is robust and effective in selecting the mostrepresentative features, which maximizes the unsupervised classi-fication performance. It also should be noted that for both tests,FRMV_KM and HFS_SU shared the same evaluation criterion, butthe decision provided by the HFS_SU was always better than

1 2 3 4 5 6 7 8 9 10 11 1276

78

80

82

84

86

88

90Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DB

Fig. 6a. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB.

1 2 3 4 5 6 7 8 9 10 11 1276

78

80

82

84

86

88

90Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DBPCA

Fig. 6b. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand PCA based method.

1 2 3 4 5 6 7 8 9 10 11 1276

78

80

82

84

86

88

90Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DBFRMV KM

Fig. 6c. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand FRMV_KM.

1 2 3 4 5 6 7 8 9 10 11 1275

80

85

90Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DBSFFS

Fig. 6d. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand SFFS.

1 2 3 4 5 6 7 8 9 10 11 1275

80

85

90Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DBfosmod

Fig. 6e. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand fosmod.

1 2 3 4 5 6 7 8 9 10 11 1275

80

85

90Validated by SOM


Accu

racy

HFS SUHFS LCCHFS DBFFC

Fig. 6f. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DBand FFC.



FRMV_KM, which indicated that the proposed HFS scheme is supe-rior to FRMV with respect to the same evaluation criterion.

Besides, it is worth noticing that in both tests, the proposed HFSbased on three evaluation criterions, e.g. SU, LCC and DB index,generated results with slight difference. It suggested that the effec-tiveness of features selected by proposed HFS relied on the appliedevaluation criterion, and LCC was considered more appropriate forboth the two cases. Nonetheless, it is still appropriate to concludethat the overall performance based on features selected by HFSwas better comparing to other five methods.

4. Conclusion

This paper presented a hybrid unsupervised feature selection(HFS) approach to select the most representative features forunsupervised learning and used two experimental bearing datato demonstrate the effectiveness of HFS. The performance ofHFS approach was compared with other five feature selectionmethods with respect to the accuracy improvement of unsuper-vised learning algorithm SOM. The results showed that theproposed model could (a) identify the features that are relevantto the bearing defects, and (b) maximize the performance ofunsupervised learning models with fewer features. Moreover, itsuggested that HFS relied on the evaluation criterion to thecertain application. Therefore, the further research will focuson expand HFS to broader applications and online machinerydefect diagnostics and prognostics.

Acknowledgement

The authors gratefully acknowledge the support of 863 Program(No. 50821003), PR China, for this work.

References

Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples inmachine learning. Artificial Intelligence, 1–2, 245–271.

Bouldin, D. L. D. a. D. W. (1979). A cluster separation measure. IEEE Transactions onPattern Analysis and Machine Intelligence, 224–227.

Dash, M., & Koot, P. W. (2009). Feature selection for clustering. In Encyclopedia ofdatabase systems (pp. 1119–1125).

Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent DataAnalysis, 1–4, 131–156.

Dietrich, C., Palm, G., & Schwenker, F. (2003). Decision templates for theclassification of bioacoustic time series. Information Fusion, 2, 101–109.

Frigui, H. (2008). Clustering: Algorithms and applications. In 2008 1st internationalworkshops on image processing theory,tools and applications, IPTA 2008. Sousse.

Ginart, A., Barlas, I., & Goldin, J. (2007). Automated feature selection for embeddableprognostic and health monitoring (PHM) architectures. In AUTOTESTCON(Proceedings), Anaheim, CA (pp. 195–201).

Greene, D., Cunningham, P., & Mayer, R. (2008). Unsupervised learning andclustering. Lecture Notes in Applied and Computational Mechanics, 51–90.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.Journal of Machine Learning Research, 1157–1182.

Hong, Y., Kwong, S., & Chang, Y. (2008a). Consensus unsupervised feature rankingfrom multiple views. Pattern Recognition Letters, 5, 595–602.

Hong, Y., Kwong, S., & Chang, Y. (2008b). Unsupervised feature selection usingclustering ensembles and population based incremental learning algorithm.Pattern Recognition, 9, 2742–2756.

Huang, J., Cai, Y., & Xu, X. (2007a). A hybrid genetic algorithm for feature selectionwrapper based on mutual information. Pattern Recognition Letters, 13,1825–1844.

Huang, R., Xi, L., & Li, X. (2007b). Residual life predictions for ball bearings based onself-organizing map and back propagation neural network methods. MechanicalSystems and Signal Processing, 1, 193–207.

Jack, L. B., & Nandi, A. K. (2002). Fault detection using support vector machines andartificial neural networks, augmented & by genetic algorithms. MechanicalSystems and Signal Processing, 2–3, 373–390.

Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review.IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 4–37.

Jardine, A. K. S., Lin, D., & Banjevic, D. (2006). A review on machinery diagnostics andprognostics implementing condition-based maintenance. Mechanical Systemsand Signal Processing, 7, 1483–1510.

Kwak, N., & Choi, C. H. (2002). Input feature selection for classification problems.IEEE Transactions on Neural Networks, 1, 143–159.

Lei, Y. G., He, Z. J., & Zi, Y. Y. (2008). A new approach to intelligent fault diagnosis ofrotating machinery. Expert Systems with Applications, 4, 1593–1600.

Li, G., Hu, X., Shen, X., et al. (2008) A novel unsupervised feature selection methodfor bioinformatics data sets through feature clustering, In IEEE internationalconference on granular computing, GRC 2008. Hangzhou. (pp. 41–47).

Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. PatternRecognition Letters, 10–18.

Liao, L., & Lee, J. (2009). A novel method for machine performance degradationassessment based on fixed cycle features test. Journal of Sound and Vibration,326, 894–908.

Liu, X., Ma, L., Zhang, S., & Mathew, J. (2006). Feature group optimisation formachinery fault diagnosis based on fuzzy measures. Australian Journal ofMechanical Engineering, 2, 191–197.

Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms forclassification and clustering. IEEE Transactions on Knowledge and DataEngineering, 4, 491–502.

Malhi, A., & Gao, R. X. (2004). PCA-based feature selection scheme for machinedefect classification. IEEE Transactions on Instrumentation and Measurement, 6,1517–1525.

Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection usingfeature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence,3, 301–312.

Oduntan, I. O., Toulouse, M., & Baumgartner, R. (2008). A multilevel tabu searchalgorithm for the feature selection problem in biomedical data. Computers &Mathematics with Applications, 5, 1019–1033.

Patil, M. S., Mathew, J., & RajendraKumar, P. K. (2008). Bearing signature analysis asa medium for fault detection: A review. Journal of Tribology, 1.

Peng, Z. K., & Chu, F. L. (2004). Application of the wavelet transform in machinecondition monitoring and fault diagnostics: a review with bibliography.Mechanical Systems and Signal Processing, 2, 199–221.

Samanta, B., Al-Balushi, K. R., & AI-Araimi, S. A. (2003). Artificial neural networksand support vector machines with genetic algorithm for bearing fault detection.Engineering Applications of Artificial Intelligence, 7-8, 657–665.

Samanta, B., & Nataraj, C. (2009). Use of particle swarm optimization for machineryfault detection. Engineering Applications of Artificial Intelligence, 2, 308–316.

Shao, Y., & Nezu, K. (2000). Prognosis of remaining bearing life using neuralnetworks. Proceedings of the Institution of Mechanical Engineers. Part I. Journalof Systems and Control Engineering, 3, 217–230.

Sugumaran, V., & Ramachandran, K. I. (2007). Automatic rule learning usingdecision tree for fuzzy classifier in fault diagnosis of roller bearing. MechanicalSystems and Signal Processing, 5, 2237–2247.

Wei, H. L., & Billings, S. A. (2007). Feature subset selection and ranking for datadimensionality reduction. IEEE Transactions on Pattern Analysis and MachineIntelligence, 1, 162–166.

Xu, Z., Xuan, J., Shi, T., & Wu, B. (2009). Application of a modified fuzzy ARTMAPwith feature-weight learning for the fault diagnosis of bearing. Expert Systemswith Applications, 6, 9961–9968.

Yen, G. G. (2000). Wavelet packet feature extraction for vibration monitoring. IEEETransactions on Industrial Electronics, 3, 650–667.

Yu, Y., Yu, D., & Cheng, J. (2006). A roller bearing fault diagnosis method based onEMD energy entropy and ANN. Journal of Sound and Vibration, 1–2, 269–277.

Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance andredundancy. Journal of Machine Learning Research, 1205–1224.

Zhang, H., & Sun, G. (2002). Feature selection using tabu search method. PatternRecognition, 35, 701–711.

Bearing Hybrid

Documents

Transcript of Bearing Hybrid