06345866

Biomedical Data Analysis by Supervised Manifold Learning

A. M. Alvarez-Meza1, G. Daza-Santacoloma1, and G. Castellanos-Domnguez1

Abstract Biomedical data analysis is usually carried out byassuming that the information structure embedded into thebiomedical recordings is linear, but that statement actuallydoes not corresponds to the real behavior of the extractedfeatures. In order to improve the accuracy of an automaticsystem to diagnostic support, and to reduce the computationalcomplexity of the employed classifiers, we propose a nonlin-ear dimensionality reduction methodology based on manifoldlearning with multiple kernel representations, which learns theunderlying data structure of biomedical information. Moreover,our approach can be used as a tool that allows the specialistto do a visual analysis and interpretation about the studiedvariables describing the health condition. Obtained results showhow our approach maps the original high dimensional featuresinto an embedding space where simple and straightforwardclassification strategies achieve a suitable system performance.

I. INTRODUCTION

The analysis of biomedical data is a challenge thatmainly requires to discover the appropriated structure ofthe information embedded into the registers. Achieving asuitable analysis of the data allows to improve the per-formance of automatic systems to diagnostic support andsimplify its implementation. Often, the embedded structureof the information is assumed as linear by many techniques,such as: principal component analysis (PCA), linear dis-criminant analysis (LDA), multidimensional scaling (MDS),etc. Nevertheless, the linearity assumption does not usuallycorresponds to the real behavior of the biomedical data.Indeed, the most common recordings describing the humanhealth status (e.g. speech recordings, phonocardiogram, elec-trocardiograms, electroencephalograms, among others) arecomposed by several nonlinearly correlated variables lyingin high dimensional spaces. In this regard, it is necessaryto consider the analysis of the data by means of nonlineardimensionality reduction (NLDR) techniques.

In this way, it is possible to represent in a low dimensionalspace (or embedding space) the high dimensional data,generally assuming that the input data are sampled froma smooth underlying manifold. The NLDR methods aim toobtain useful and compact representations of the information.Nonetheless, there are some limitations in the application ofthese kind of techniques when data lie in separated groups[4]. Above issue is faced during the design of automatic

*Research carried out under the grants provided by Centro de In-vestigacion e Innovacion de Excelencia ARTICA - COLCIENCIAS,a PhD. scholarship funded by Universidad Nacional de Colombia, andproject 20201006594 funded by Universidad Nacional de Colombia andUniversidad de Caldas

1All authors are with Signal Processing and Recognition Group,Universidad Nacional de Colombia, Campus La Nubia, km 7 viaal Magdalena, Manizales-Colombia. {amalvarezme, gdazas,cgcastellanosd}@unal.edu.co

systems to diagnostic support. Indeed, most of the NLDRmethods are not conceived to consider the class labels (e.g.,control patients or pathological patients) of the data as extrainformation, which should enhance the low dimensionalrepresentation of the data, improving the system accuracy.

Several approaches relating supervised manifold learninghave been presented. In [7] is reported a supervised LocallyLinear Embedding (LLE) method, called -LLE, whichenforces the separation among classes according to a weight-ing parameter controlling the amount of class informationthat is incorporated. In [9], a local formulation of lineardiscriminant analysis is introduced. Next, in [6] the classlabels are used to determine the neighbors of the trainingdata so as to map overlapping high dimensional data intoclusters in the embedded space. Mostly, these approacheseither omit the preservation of the high dimensional localdata structure in the embedding space (yielding overtraining),or they require the use of free parameters controlling thesmoothing of the transformation to avoid the over fittingand to reduce the noise sensitivity. Furthermore, in [4] itis proposed a supervised version of LLE, termed C-LLE,which employs class labels as extra information to guide theprocedure of dimensionality reduction allowing to figure outa suitable representation for each one of them. The amountof class label information incorporated in the embeddingprocess is controlled by a tradeoff parameter, however, theupper bound of the tradeoff is not well defined, which canbe problematic for an inexpert user and could increase thecomputational load of the algorithm.

Recently, some machine learning approaches have shownthat using multiple kernels as similarity measure, instead ofjust one, can be useful to improve the feature extraction [5],[13]. In this sense, we propose a supervised NLDR methodbased on Laplacian Eigenmaps (LEM) algorithm [3] usingmultiple kernel representations (MKR). Our approach, whichwe named LEM-MKR, incorporates class label informationwhile the local structure topology of the data is preservedduring the mapping. The proposed LEM-MKR improvesthe accuracy of automatic systems to diagnostic support,reducing the computational complexity of the final classi-fiers, which could be useful to real-time implementations.Moreover, our approach also can be used as a tool that allowsthe specialist to do a visual analysis and interpretation aboutthe studied variables of the health condition.

The remainder of this work is organized as follow. In sec-tion II the proposed LEM-MKR methodology is described.In section III the experimental conditions and results arepresented. Finally, in sections IV and V we discuss andconclude about the attained results.

34th Annual International Conference of the IEEE EMBSSan Diego, California USA, 28 August - 1 September, 2012

41978-1-4577-1787-1/12/$26.00 2012 IEEE

II. BACKGROUND

Laplacian Eigenmaps (LEM) is a NLDR technique basedon preserving the intrinsic geometric structure of the mani-fold [3]. Let X np the input data matrix with samplevectors xi (i = 1, . . . ,n), LEM aims at providing a mappingto a low dimensional Euclidean space Y nm, withrow samples yi, being m p. The algorithm apprises thefollowing steps. First, an undirected weighted graph G(V,E)is built; where V are the vertices and E the edges. In thiscase, there are n vertices, one for each xi. Nodes i and j areconnected by Ei j = 1, if i is one of the k nearest neighbors ofj (or viceversa), being measured by the Euclidean distance[3]. Second, a weight matrix Wnn is calculated as Wi j = (xi,x j), if Ei j = 1, otherwise Wi j = 0, being (, ) a kernelfunction. After that, the L nn graph Laplacian matrix isgiven by L = DW, where D nn is a diagonal matrixwith Dii = jWi j. Finally, Y is calculated by minimizing

i j (yiy j)2Wi j. (1)Note that above minimization implies a penalty if neigh-

boring points xi and x j are mapped far apart. Equation(1) can be solved as the generalized eigenvalue problemLY:,s = sDY:,s; where s is the eigenvalue correspondingto the Y:,s eigenvector, with s = 1, . . . ,n. First eigenvectoris the unit vector with all equal components, while the re-maining m eigenvectors form the embedded space. However,conventional NLDR approaches are not suitable for patternrecognition tasks, e.g., biomedical data analysis, because ofthe lacking of methods to incorporate label information inthe mapping. A supervised NLDR mapping could enhancethe data separability in further classification stages [4], [12].

Recently, some machine learning approaches have shownthat using multiple kernels to infer the data similarity insteadof just one (MKR), can be useful to improve the datainterpretability [5], [13]. Thus, based on MKR we proposeto incorporate the label information of the data into theLEM mapping. Given Z kernel functions, a combined kernelfunction can be computed as (xi,x j) = Zz=1 zz (xi,xi),subject to z 0, and Zi=1 z = 1 (z ). In this work,we propose to analyze the input data X considering both thelocal and class membership relationships among samples.Note that the optimization (1) is directly related to theweight matrix W, which can be analyzed as a kernel matrix.Therefore, it is possible to employ different similaritiesmeasures in the LEM formulation by means of MKR, or aswe called, a LEM-MKR approach. Hence, we consider twoweight matrices, We and Wc, in the LEM mapping process.The former is computed as

Wei j ={

(xi,x j) Ei j = 10 Ei j = 0

, (2)

considering traditional LEM assumptions. Now, let cn1a class label vector with ci {1, . . . ,C}, being C the numberof classes in X, Wc can be expressed as

Wci j = (ci c j)[ (xi,x j)+ (xi, i)+

(x j, j

)](3)

It is important to note that the function (ci c j) in (3)penalizes the non-class memberships, where (ci c j) = 1,if and only if ci = c j, otherwise, it is equal to zero value.Besides, i and j are the average vectors of such classfrom xi and x j belong. The terms (xi, i) and

(x j, j

)aim to reveal the similarity of each sample with the averageintra-manifold structure of each class. Moreover, (xi,x j)considers the similarity between samples of the same class.Inspired by MKR, equations (2) and (3) can be used tocomputed the weight matrix WT as

WT = eWe+cWc, (4)

subject to e + c = 1. From equation (4), the combinedLaplacian matrix is calculated as LT = DT WT , whereDT nn is a diagonal matrix with DTii = jWTji . There-fore, a new NLDR objective function can be written asi j (yiy j)2WTi j , which can be solved as a generalizedeigenvalue problem, fixing e and c.

The e and c parameters in (4) give a tradeoff betweenthe local appearance and the class label similarities retainedin Y. If e = 0 (c = 1), we have the original mapping ofLEM. As c increases, then e decreases due to the constrainte+c = 1. Therefore, for a given pair of points (e,c), wecan infer both, the local and the class label representationerrors as e = i j (yiy j)2Wei j , and c = i j (yiy j)2Wci j ,respectively. Looking for the simultaneous minimization ofboth errors, we employ the parametric plot e versus c, asa tool to study the behavior of these quantities, with allthem normalized between 0 and 1. Based on the L-curvecriteria for Tikhonov regularization, the point with maximumcurvature results to be a good choice for e and c. Notethat e determines if the underlying data structure is notwell preserved in Y, while c establishes the quality of theseparability among classes.

On the other hand, even when NLDR algorithms providean embedding for a fixed dataset, it is necessary to generalizetheir results to new locations in the input space. Therefore,given the embedding space Y, a new sample xnew can bemapped by the minimization of xnewkr=1 vrr2, wherekr=1 vr = 1, being r one of the k nearest neighbors of xnewin X (see [14] for details). In Fig. 1 a comparison betweentraditional PCA projection and LEM-MKR is presented fora synthetic nonlinear structured data [4].

Nonlinear data PCA LEM-MKL

Fig. 1. Triple swiss-roll (n= 1000, p= 3, C= 3, m= 2). Traditional PCA projectiondoes not unfold the underlying structure of the input data, and it overlaps samples ofdifferent classes. LEM-MKR projection conserves the local structure of each swiss-rollwhile separates, as well as possible, samples from different classes.

42

III. EXPERIMENTS

We aim to demonstrate the advantages of our approach asa tool to outperform diagnosis support systems for automaticdetection of diseases. A general scheme of such kind ofsystem can be summarized as in Fig. 2. First, a preprocessingstage is used to prepare the raw data for further analysis.Then, some features are estimated according to the studiedphenomenon. In most of the cases, it is difficult to directlyinterpret the obtained information, due to the complexity andthe large amount of obtained features. Before a classifier canbe applied with a reasonable hope of generalization, a smallnumber of useful features will have to be extracted.

Biomedical data

PreprocessingFeature

estimation

Feature

extraction

Fig. 2. Diagnosis support system scheme for automatic detection of diseases.

LEM-MKR is tested as feature extraction stage, using alinear kernel to find the local and the class membershiprelationships among samples in (2) and (3). We comparethe performance of LEM-MKR against some linear andnonlinear unsupervised feature extraction methods: PCA,LLE, LEM [3], [14], and against the supervised NLDR tech-niques: -LLE and C-LLE [4], [12]. The dimension of theembedding space m is fixed looking for a 95% of expectedlocal variability [7]. The k value for the NLDR algorithmsis chosen according to [10], in which an specific number ofneighbors for each sample is computed. The parameter in-LLE is selected from the set {0,0.2, . . . ,1} according tothe training error. Two straightforward classifiers are tested:linear discriminant classifier (ldc), and k-nearest neighborsclassifier (knnc). A 10 folds cross-validation scheme isemployed to determine the performance of the system. Thenumber of neighbors for knnc is optimized with respect tothe leave-one-out error of the training set.

Four biomedical databases are tested. The former containsvoice records of children with and without Cleft Lift andPalate (CLP). The main goal is to detect and characterize thepresence of hypernasality in the speech registers. This datasetis provided by Signal Processing and Recognition Group-SPRG of the Universidad Nacional de Colombia, Manizales.The database holds 266 voice records from children between5 and 15 years old, who uttered the Spanish vowels. Thereare 110 children labeled by a phoniatry expert as healthy, and156 labeled as hypernasal. We employ the preprocessing andfeature estimation stages of the voice records according to[11], leading 147 acoustic features for each sample.

The second database is the phonocardiographic database(PCG), also provided by SPRG, which is composed by 35adult subjects (16 normals and 19 with murmur). Eightrecordings were taken from each patient, corresponding tothe four traditional focuses of auscultation (mitral, tricuspid,aortic and pulmonary areas) in the phase of post-expiratoryand post-inspiratory apnea. The signals were digitized at44.1kHz with 16-bits per sample. Furthermore, in order to

select beats without artifacts and another type of noise thatcan degrade the performance of the algorithms, a visual andaudible inspection was carried out by cardiologists. Thus,548 individual beats were extracted, 274 for each class,using an R-peak detector. The recordings are characterizedby means of a time-frequency representation, particularly, thespectrogram obtained by a Short-Time Fourier Transform (38frequencies and 480 instants of time, see [2] for details).

The third database is the Oxford Parkinsons diseasedetection (PARKINSON), which is composed of a rangeof biomedical voice measurements from 31 people, 23 withParkinsons disease (PD). The data target is to discriminatehealthy people from those with PD. The dataset contains 195voice recording, which are preprocessed and characterized asin [8], obtaining 23 acoustic features for each sample.

Finally, an EPILEPSY database is tested, which containsEEG signals of 29 patients with medically intractable focalepilepsies. They were recorded by the Department of Epilep-tology of the University of Bonn, by means of intracraniallyimplanted electrodes [1]. The database comprises five sets(denoted as Z,O,N,F,S) composed of 100 single channelEEG segments, which were selected and extracted aftervisual inspection from continuous multichannel EEG to avoidartifacts (e.g. muscular activity or eye movements). AllEEG signals were recorded with an acquisition system of128 channels, using average common reference. Data wasdigitized at 173.61 Hz, and timefrequency and timevaryingdecomposition methods are used as feature extraction stage.

Table I presents the dataset properties and the classificationaccuracy for all the studied feature extraction methods. More-over, the PCA and LEM-MKR 2D projections are shown forCLP and EPILEPSY (Figures 3(a) and 3(b)).

LEM-MKL

-0.02 -0.01 0 0.01 0.02

-0.1

-0.05

0

0.05

PCA

-5 0 5 10 15 20

-5

0

5

10

15

20

(a) CLP

PCA

-4 -3 -2 -1 0 1 2

x 104

-1

-0.5

0

0.5

1

1.5

2

x 104

LEM-MKL

-0.01 -0.005 0 0.005 0.01 0.015

-0.015

-0.01

-0.005

0

0.005

0.01

(b) EPILEPSY

Fig. 3. Datasets visualization using PCA and LEM-MKR.

IV. DISCUSSIONFrom Figures 3(a) and 3(b), it is possible to observe

that the low dimensional space found by PCA overlaps the

43

TABLE ICLASSIFICATION RESULTS (AVERAGE ACCURACY STANDARD DEVIATION FOR 10-FOLD CROSS VALIDATION)

Dataset Classifier Without DR PCA LLE LEM - LLE C - LLE LEM - MKRCLP ldc 75.6306.49 83.6608.64 83.5508.66 84.8704.89 63.4115.70 91.1603.71 92.8304.91

n= 238, p= 147, C = 2, m= 3 knnc 87.3906.62 81.9707.30 85.2707.74 82.7705.82 66.3809.31 91.6304.81 90.3305.06PCG ldc Diverges 89.5902.88 84.5105.00 82.8505.76 94.3504.87 92.6903.98 92.1304.28

n= 548, p= 18240, C = 2, m= 15 knnc Diverges 96.5301.61 88.1504.27 83.3803.34 92.9004.12 93.9903.63 93.5904.36PARKINSONS ldc 77.4406.48 75.8904.27 77.8406.31 74.3503.29 89.7605.31 86.7304.72 88.2905.90

n= 197, p= 23, C = 2, m= 4 knnc 87.0407.01 84.1106.88 78.8908.73 73.5209.63 88.1805.97 86.2304.60 85.7407.13EPILEPSY ldc 20.0000.00 67.2004.24 63.6005.87 66.0006.04 83.6004.88 85.6004.40 90.8005.18

n= 500, p= 2052, C = 5, m= 5 knnc 94.0002.83 74.2005.53 79.6004.70 72.6006.11 83.0009.81 90.2003.19 92.6004.90

observations, because it does not consider neither the localrelationships nor the class membership similarities amongsamples. Hence, PCA is not able to unfold the underlyingdata structure, and its embedding is not suitable to sepa-rate different classes. Otherwise, LEM-MKR (our approach)preserves the local geometry of the original space, keepingaway samples of different classes. The above statement canbe explained by the tradeoff between the local similarity andthe class membership matrices, which allows to unfold themain structure of the data in a space with lower dimensionm than the original input space dimension p (see Table I).

In regard to the attained classification results presentedin Table I, our approach presents, in most of the cases, asuitable classification accuracy, for both kind of classifiersldc and knnc. Therefore, LEM-MKR allows to identify thenonlinear structure of the input data, finding an embeddingspace where a classifier with a simple decision boundary (i.e.,ldc) can be used. Indeed, it is important to emphasize thatthe proposed tradeoff selection avoids the need of a manualtuning, finding a tradeoff that compensates both the intrinsicgeometry conservation and the separability margin.

Moreover, it can be noticed how traditional PCA, andunsupervised NLDR algorithms, LLE and LEM, do not attainreliable classification performances. On the other hand, the-LLE algorithm, overall, seems to obtain a high classi-fication performance, when the technique is used in con-junction with a nonlinear boundary classifier. Nonetheless,as the number of classes grows or when the underlyingdata structure is more complex, the classification accuracystrongly diminishes, because of the induced overtraining.Finally, C-LLE methodology seems to be a good alternativefor classifications tasks. However, the lower performancein some databases in comparison with our proposal (i.e.,PARKINSONS and EPILEPSY), can be explained by thelack of the tuning of the required free parameter, which isnot well defined in the original formulation [4].

V. CONCLUSIONSA feature extraction methodology was proposed to learn

the underlying data structure of biomedical information.For such purpose, a manifold learning framework basedon LEM is enhanced by MKR, in order to learn both thelocal data structure and the class label relationships amongsamples. In addition, we established an scheme for automaticselection of the required free parameters, based on a tradeoffbetween the contribution of the local and the class mem-bership relationships among observations. Proposed LEM-

MKR minimizes, as well as possible, the local structure andthe class separability representation errors in the embeddingspace. Our approach was tested as feature extraction stage todevelop diagnosis support systems for automatic detection ofdiseases from biomedical data. Attained results showed howour methodology unfold the data main structure, mapping theoriginal high dimensional space (original estimated features)to an embedding space where simple and straightforwardclassification strategies can be used to obtain a suitablesystem performance. Moreover, our approach is also a toolto visually analyze the studied phenomenon by the specialist.As future work, it should be interesting to test our method-ology using other kind of similarity/dissimilarity measures,according to the application of interest.

REFERENCES[1] R. Andrzejak, K. Lehnertz, F. Mormann, C. Rieke, P. David, and

C. Elger, Indications of nonlinear deterministic and finite-dimensionalstructures in time series of brain electrical activity: Dependence onrecording region and brain state, Phys. Rev. E, vol. 64, pp. 7186,2001.

[2] L. D. Avendano, G. C. Domnguez, and J. I. G. Llorente, Featureextraction from parametric time frequency representations for heartmurmur detection, in EMBC, 2010.

[3] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionalityreduction and data representation, Neural Computation, vol. 15, no. 6,pp. 13731396, 2003.

[4] G. Daza, G. Castellanos, and J. C. Principe, Locally linear embeddingbased on correntropy measure for visualization and classification,Neurocomputing, vol. 80, no. 0, pp. 19 30, 2012.

[5] M. Gonen and E. Alpaydin, Localized multiple kernel regression, inICPR, 2010.

[6] Q. Jiang, M. Jia, J. Hu, and F. Xu, Machinery fault diagnosisusing supervised manifold learning, Mechanical Systems and SignalProcessing, vol. 23, pp. 23012311, 2009.

[7] O. Kouropteva, O. Okun, and M. Pietikainen, Supervised locallylinear embedding algorithm for pattern recognition, in IbPRIA, 2003.

[8] M. Little, P. McSharry, S. Roberts, D. Costello, and I. Moro, Ex-ploiting nonlinear recurrence and fractal scaling properties for voicedisorder detection, BioMedical Engineering, vol. 6, pp. 119, 2007.

[9] M. Loog and D. de Ridder, Local discriminant analysis, in ICPR,2006.

[10] A. Alvarez, J. Valencia, G. Daza, and G. Castellanos, Global andlocal choice of the number of nearest neighbors in locally linearembedding, Pattern Recognition Letters, vol. 32, no. 16, pp. 2171 2177, 2011.

[11] J. Orozco, S. Murillo, A. Alvarez, J. Arias, E. Delgado, J. Vargas,and G. Castellanos, Automatic selection of acoustic and non-lineardynamic features in voice, in INTERSPEECH, 2011.

[12] M. Pillati and C. Viroli, Supervised locally linear embedding forclassification: An application to gene expression data analysis, inCLADAG, 2005.

[13] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, Sim-pleMKL, Journal of Machine Learning Research, vol. 9, pp. 24912521, 2008.

[14] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction bylocally linear embedding, Science, vol. 290, pp. 23232326, 2000.

44

06345866

Documents

Transcript of 06345866