Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia...

1
Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a , Rosalia Maglietta a , Sabino Liuni b , Graziano Pesole b,c and Nicola Ancona a a)Istituto di Studi sui Sistemi Intelligenti per l’Automazione, CNR, Via Amendola 122/D-I, 70126 Bari, Italy, b)Istituto di Tecnologie Biomediche-Sezione di Bari,CNR, Via Amendola 122/D, 70126 Bari Italy c)Dipartimento Scienze Biomolecolari e Biotecnologie, Università di Milano, Via Caloria 26, 20133 Milano, Italy Abstract SVM[1] are the state-of-the-art supervised learning techniques for cancer classification. Other machine learning approaches such as RLS[2] classifiers may represent highly suitable alternative for their simplicity and reliability. We compared the performances of the RLS classifiers with SVM on three different benchmark data sets, also with respect to the number of selected genes and different gene selection strategies. We show that RLS classifiers have performances comparable to SVM classifiers expressed in terms of the LOO-error. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to the number of training examples. Moreover RLS machines allow to get an exact measure of the LOO error with just one training. Benchmark Data set description Leukemia data set [3]. 25 examples of Acute Myeloid Leukemia (AML) vs 47 examples of Acute Lymphoblastic one (ALL), divided into training and test set; Each sample consists of 7129 human gene expression levels (see www.genome.wi.mit.edu/MPR). Colon data set [4]. 40 examples of Tumor Colon tissue vs 22 Normal Colon tissue samples. Each sample consists of 2000 human gene expression levels (see www.molbio.princeton.edu/colondata). Multi-cancer data set [5]. 190 examples relative to Cancer tissues, spanning 14 common tumor types, vs 90 Normal tissue samples; each example consists of the expression levels of 16063 genes (see www.genome.wi.mit.edu/MPR/GCM.html). SVM RLS LOO error on Leukemia training set 2 2 Leukemia test error 3 3 LOO error on Leukemia data set 1 2 LOO error on Colon data set 8 9 LOO error on Multi-Cancer data set 88 90 RLS computes the LOO error in just one training by using all the training exmples l i ii i s i i i RLS KG x f y y y V LOO 1 ) ( 1 ) ( , GENE SELECTION strategies Two techniques are used to rank the genes and a not parametric permutation test is used to determine how many genes are really important for classifying a given specimen: 999 genes in the Leukemia data set, 500 in the Colon one and 1400 in the Multi-Cancer one. ) ( ) ( ) ( ) ( ) ( j j j j j T N S 1 1 1 1 2 S2N Statistic j w w j T ) ( NRFE Statistic with j=1, 2, …., number of genes Visualization of the Statistic S2N 47 examples ALL 25 examples AML HP HN Observed T S2N (j) distribution computed on the Leukemia data set compared to randomly permutated class distinctions. S2N Statistic Leukemia Colon Multi-Cancer gene s SVM RLS genes SVM RLS genes SVM RLS 999 1 2 500 4 6 1400 53 46 99 1 2 400 5 6 1000 50 47 49 1 1 300 5 6 500 52 42 39 2 2 200 7 6 300 51 43 29 2 2 100 8 7 200 50 45 19 3 3 50 8 7 100 66 40 9 1 1 10 8 8 50 56 37 5 2 4 5 7 9 10 65 59 NRFE Statistic Leukemia Colon Multi-Cancer gene s SVM RLS gene s SVM RLS genes SVM RLS 999 0 0 500 4 3 1400 46 37 99 0 0 400 4 3 1000 41 39 49 0 0 300 4 3 500 32 30 39 1 1 200 3 3 300 29 29 29 0 0 100 3 3 200 27 27 19 3 3 50 3 3 100 51 35 9 6 9 10 11 12 50 52 43 5 6 11 5 15 14 10 70 70 Conclusions The RLS classifiers have performances comparable to the ones of SVM classifiers for the problem of cancer classification by gene expression data and are a valuable alternative to SVM because they enjoy several interesting properties. RLS machines are fast and easy to implement and, more important, they allow to measure the exact LOO error performing one training only. References [1] Vapnik, V. Statistical Learning Theory, John Wiley & Sons, INC.,1998. [2] Tikhonov, A.N. Arsenin, V. Y. Solutions of ill-posed problems, W.H. Winston Washington D.C. , 1977 [3]Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caliguri, M.A., Bloomfield, C.D., Lander, E.S., (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286, 531-537. [4]Alon,U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.(1999) Broad patterns of gene expression revealed by clustering analysis of tumor and colon tissues probed by oligonucleotide arrays, PNAS, 96,6745-6750. [5]Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R. (2001) Multi-class cancer diagnosis using tumor gene expression signatures PNAS, 98,15149- 15154.

Transcript of Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia...

Page 1: Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesole b,c and Nicola.

Cancer classification by Regularized Least Square Classifiers

Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesoleb,c and Nicola

Ancona aa)Istituto di Studi sui Sistemi Intelligenti per l’Automazione, CNR, Via Amendola 122/D-I, 70126 Bari,

Italy,b)Istituto di Tecnologie Biomediche-Sezione di Bari,CNR, Via Amendola 122/D, 70126 Bari Italyc)Dipartimento Scienze Biomolecolari e Biotecnologie, Università di Milano, Via Caloria 26, 20133

Milano, ItalyAbstractSVM[1] are the state-of-the-art supervised learning techniques for cancer classification. Other machine learning approaches such as RLS[2] classifiers may represent highly suitable alternative for their simplicity and reliability. We compared the performances of the RLS classifiers with SVM on three different benchmark data sets, also with respect to the number of selected genes and different gene selection strategies. We show that RLS classifiers have performances comparable to SVM classifiers expressed in terms of the LOO-error. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to the number of training examples. Moreover RLS machines allow to get an exact measure of the LOO error with just one training.

Benchmark Data set description• Leukemia data set [3]. 25 examples of Acute Myeloid Leukemia (AML) vs 47 examples of Acute Lymphoblastic one (ALL), divided into training and test set; Each sample consists of 7129 human gene expression levels (see www.genome.wi.mit.edu/MPR).•Colon data set [4]. 40 examples of Tumor Colon tissue vs 22 Normal Colon tissue samples. Each sample consists of 2000 human gene expression levels (see www.molbio.princeton.edu/colondata).•Multi-cancer data set [5]. 190 examples relative to Cancer tissues, spanning 14 common tumor types, vs 90 Normal tissue samples; each example consists of the expression levels of 16063 genes (see www.genome.wi.mit.edu/MPR/GCM.html).

SVM RLS

LOO error on Leukemia training set 2 2

Leukemia test error 3 3

LOO error on Leukemia data set 1 2

LOO error on Colon data set 8 9

LOO error on Multi-Cancer data set 88 90

RLS computes the LOO error in just one training by using all the training exmples

l

i ii

isiiiRLS KG

xfyyyVLOO

1 )(1)(

,

GENE SELECTION strategies Two techniques are used to rank the genes and a not parametric permutation test is used to determine how many genes are really important for classifying a given specimen: 999 genes in the Leukemia data set, 500 in the Colon one and 1400 in the Multi-Cancer one.

)()()()(

)(jjjj

jT NS11

112

S2N Statistic

jw wjT )(

NRFE Statistic

with j=1, 2, …., number of genes

Visualization of the Statistic S2N

47 examples ALL 25 examples AML

HP

HN

Observed TS2N(j) distribution computed on the Leukemia data set compared to randomly permutated class distinctions.

S2N Statistic

Leukemia Colon Multi-Cancer

genes SVM RLS genes SVM RLS genes SVM RLS

999 1 2 500 4 6 1400 53 46

99 1 2 400 5 6 1000 50 47

49 1 1 300 5 6 500 52 42

39 2 2 200 7 6 300 51 43

29 2 2 100 8 7 200 50 45

19 3 3 50 8 7 100 66 40

9 1 1 10 8 8 50 56 37

5 2 4 5 7 9 10 65 59

NRFE Statistic

Leukemia Colon Multi-Cancer

genes SVM RLS genes SVM RLS genes SVM RLS

999 0 0 500 4 3 1400 46 37

99 0 0 400 4 3 1000 41 39

49 0 0 300 4 3 500 32 30

39 1 1 200 3 3 300 29 29

29 0 0 100 3 3 200 27 27

19 3 3 50 3 3 100 51 35

9 6 9 10 11 12 50 52 43

5 6 11 5 15 14 10 70 70

ConclusionsThe RLS classifiers have performances comparable to the ones of SVM classifiers for the problem of cancer classification by gene expression data and are a valuable alternative to SVM because they enjoy several interesting properties. RLS machines are fast and easy to implement and, more important, they allow to measure the exact LOO error performing one training only. References[1] Vapnik, V. Statistical Learning Theory, John Wiley & Sons, INC.,1998.[2] Tikhonov, A.N. Arsenin, V. Y. Solutions of ill-posed problems, W.H. Winston Washington D.C. , 1977[3]Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caliguri, M.A., Bloomfield, C.D., Lander, E.S., (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286, 531-537. [4]Alon,U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.(1999) Broad patterns of gene expression revealed by clustering analysis of tumor and colon tissues probed by oligonucleotide arrays, PNAS, 96,6745-6750.[5]Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R. (2001) Multi-class cancer diagnosis using tumor gene expression signatures PNAS, 98,15149-15154.