Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia...
-
Upload
julia-harrell -
Category
Documents
-
view
213 -
download
0
Transcript of Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia...
![Page 1: Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesole b,c and Nicola.](https://reader037.fdocuments.in/reader037/viewer/2022110208/56649ddf5503460f94ad8daa/html5/thumbnails/1.jpg)
Cancer classification by Regularized Least Square Classifiers
Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesoleb,c and Nicola
Ancona aa)Istituto di Studi sui Sistemi Intelligenti per l’Automazione, CNR, Via Amendola 122/D-I, 70126 Bari,
Italy,b)Istituto di Tecnologie Biomediche-Sezione di Bari,CNR, Via Amendola 122/D, 70126 Bari Italyc)Dipartimento Scienze Biomolecolari e Biotecnologie, Università di Milano, Via Caloria 26, 20133
Milano, ItalyAbstractSVM[1] are the state-of-the-art supervised learning techniques for cancer classification. Other machine learning approaches such as RLS[2] classifiers may represent highly suitable alternative for their simplicity and reliability. We compared the performances of the RLS classifiers with SVM on three different benchmark data sets, also with respect to the number of selected genes and different gene selection strategies. We show that RLS classifiers have performances comparable to SVM classifiers expressed in terms of the LOO-error. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to the number of training examples. Moreover RLS machines allow to get an exact measure of the LOO error with just one training.
Benchmark Data set description• Leukemia data set [3]. 25 examples of Acute Myeloid Leukemia (AML) vs 47 examples of Acute Lymphoblastic one (ALL), divided into training and test set; Each sample consists of 7129 human gene expression levels (see www.genome.wi.mit.edu/MPR).•Colon data set [4]. 40 examples of Tumor Colon tissue vs 22 Normal Colon tissue samples. Each sample consists of 2000 human gene expression levels (see www.molbio.princeton.edu/colondata).•Multi-cancer data set [5]. 190 examples relative to Cancer tissues, spanning 14 common tumor types, vs 90 Normal tissue samples; each example consists of the expression levels of 16063 genes (see www.genome.wi.mit.edu/MPR/GCM.html).
SVM RLS
LOO error on Leukemia training set 2 2
Leukemia test error 3 3
LOO error on Leukemia data set 1 2
LOO error on Colon data set 8 9
LOO error on Multi-Cancer data set 88 90
RLS computes the LOO error in just one training by using all the training exmples
l
i ii
isiiiRLS KG
xfyyyVLOO
1 )(1)(
,
GENE SELECTION strategies Two techniques are used to rank the genes and a not parametric permutation test is used to determine how many genes are really important for classifying a given specimen: 999 genes in the Leukemia data set, 500 in the Colon one and 1400 in the Multi-Cancer one.
)()()()(
)(jjjj
jT NS11
112
S2N Statistic
jw wjT )(
NRFE Statistic
with j=1, 2, …., number of genes
Visualization of the Statistic S2N
47 examples ALL 25 examples AML
HP
HN
Observed TS2N(j) distribution computed on the Leukemia data set compared to randomly permutated class distinctions.
S2N Statistic
Leukemia Colon Multi-Cancer
genes SVM RLS genes SVM RLS genes SVM RLS
999 1 2 500 4 6 1400 53 46
99 1 2 400 5 6 1000 50 47
49 1 1 300 5 6 500 52 42
39 2 2 200 7 6 300 51 43
29 2 2 100 8 7 200 50 45
19 3 3 50 8 7 100 66 40
9 1 1 10 8 8 50 56 37
5 2 4 5 7 9 10 65 59
NRFE Statistic
Leukemia Colon Multi-Cancer
genes SVM RLS genes SVM RLS genes SVM RLS
999 0 0 500 4 3 1400 46 37
99 0 0 400 4 3 1000 41 39
49 0 0 300 4 3 500 32 30
39 1 1 200 3 3 300 29 29
29 0 0 100 3 3 200 27 27
19 3 3 50 3 3 100 51 35
9 6 9 10 11 12 50 52 43
5 6 11 5 15 14 10 70 70
ConclusionsThe RLS classifiers have performances comparable to the ones of SVM classifiers for the problem of cancer classification by gene expression data and are a valuable alternative to SVM because they enjoy several interesting properties. RLS machines are fast and easy to implement and, more important, they allow to measure the exact LOO error performing one training only. References[1] Vapnik, V. Statistical Learning Theory, John Wiley & Sons, INC.,1998.[2] Tikhonov, A.N. Arsenin, V. Y. Solutions of ill-posed problems, W.H. Winston Washington D.C. , 1977[3]Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caliguri, M.A., Bloomfield, C.D., Lander, E.S., (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286, 531-537. [4]Alon,U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.(1999) Broad patterns of gene expression revealed by clustering analysis of tumor and colon tissues probed by oligonucleotide arrays, PNAS, 96,6745-6750.[5]Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R. (2001) Multi-class cancer diagnosis using tumor gene expression signatures PNAS, 98,15149-15154.