Supervised gene expression data analysis using SVMs and MLPs

25
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini e-mail: [email protected]

description

Supervised gene expression data analysis using SVMs and MLPs. e-mail: [email protected]. Giorgio Valentini. Outline. A real problem: Lymphoma gene expression data analysis by machine learning methods: Diagnosis of tumors using a supervised approach - PowerPoint PPT Presentation

Transcript of Supervised gene expression data analysis using SVMs and MLPs

Page 1: Supervised gene expression data analysis using SVMs and MLPs

Supervised gene expression data analysis using SVMs and MLPs

Giorgio Valentinie-mail: [email protected]

Page 2: Supervised gene expression data analysis using SVMs and MLPs

Outline

A real problem: Lymphoma gene expression data analysis by machine learning methods:• Diagnosis of tumors using a supervised approach• Discovering groups of genes related to

carcinogenic processes• Discovering subgroups of diseases using gene

expression data.

Page 3: Supervised gene expression data analysis using SVMs and MLPs

DNA microarray

DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell

They offer a snapshot of the overall functional status of a cell: virtually all differences in cell type or state are related with changes in the mRNA levels of many genes.

DNA microarrays have been used in mutational analyses, genetic mapping studies, in genome monitoring of gene expression, in pharmacogenomics, in metabolic pathway analysis.

Page 4: Supervised gene expression data analysis using SVMs and MLPs

A DNA microarray image (E. coli)

• Each spot corresponds to the expression level of a particular gene

• Red spots correspond to over expressed genes

• Green spots to under expressed genes

• Yellow spots correspond to intermediate levels of gene expression

Page 5: Supervised gene expression data analysis using SVMs and MLPs

Analyzing microarray data by machine learning methods

Unsupervised approachNo or limited a priori knowledge.Clustering algorithms are used to group together similar expression patterns :

• grouping sets of genes • grouping different cells or

different functional status of the cell.

Example: hierarchical clustering, fuzzy or possibilistic clustering, self-organizing maps.

Supervised approach “A priori” biological and medical knowledge on the problem domain.Learning algorithms with labeled examples are used to associate gene expression data with classes:

• separating normal form cancerous tissues

• classifying different classes of cells on functional basis

• Prediction of the functional class of unknown genes.

Example: multi-layer perceptrons, support vector machines, decision trees, ensembles of classifiers.

The large amount of gene expression data requires machine learning methods to analyze and extract significant knowledge from DNA microarray data

Page 6: Supervised gene expression data analysis using SVMs and MLPs

A real problem: A gene expression analysis of lymphoma

1. Separating cancerous and normal tissues using the overall information available.

2. Two step method: A priori knowledge and unsupervised methods to select “candidate” subgroups SVM or MLP identify the most correlated subgroups

2. Identifying groups of genes specifically related to the expression of two different tumour phenotypes through expression signatures.

Biological problems

1. - Support Vector Machines (SVM) : linear, RBF and polynomial kernels

- Multi Layer Perceptron (MLP) - Linear Perceptron (LP)

Machine learning methods

Page 7: Supervised gene expression data analysis using SVMs and MLPs

The data• Data of a specialized DNA microarray, named "Lymphochip", developed at the Stanford University School of Medicine:

96 tissue samples from normal and cancerous populations of human lymphocytes

4026 different genes preferentially expressed in lymphoid cells or with known roles in processes important in immunology or cancer

High dimensional data

Small sample size

A challenging machine learning problem

Page 8: Supervised gene expression data analysis using SVMs and MLPs

Types of lymphomaThree main classes of lymphoma:

• Diffuse Large B-Cell Lymphoma (DLBCL), • Follicular Lymphoma (FL) • Chronic Lymphocytic Leukemia (CLL) • Transformed Cell Lines (TCL)

and normal lymphoid tissues 

Type of tissue Number of samplesNormal lymphoid cells 24DLBCL 46FL 9CLL 11TCL 6

Page 9: Supervised gene expression data analysis using SVMs and MLPs

Visualizing data with

Tree View

Page 10: Supervised gene expression data analysis using SVMs and MLPs

The first problem: Separating normal from cancerous tissues.

Our first task consists in distinguishing cancerous from normal tissues using the overall information available, i.e. all the gene expression data.

From a machine learning standpoint it is a dichotomic problem.

Data characteristics:• Small sample size• High dimension• Missing values• Noise

Main applicative goal:

Supporting functional-molecular diagnosis of tumors and polygenic diseases

Page 11: Supervised gene expression data analysis using SVMs and MLPs

Supervised approaches to molecular classification of diseases

Several supervised methods have been applied to the analysis of cDNA microarrays and high density oligonucleotide chips:

• Decision trees

• Fisher linear discriminant

• Multi-Layer Perceptrons

• Nearest-Neighbours classifiers

Proposed by different authors:

Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001), Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001), Dudoit et al. (2002).

• Linear discriminant analysis

• Parzen windows

• Support Vector Machines

Page 12: Supervised gene expression data analysis using SVMs and MLPs

Why using Support Vector Machines ?“General” motivations•SVM are two-class classifiers theoretically founded on Vapnik' s Statistical Learning Theory. • They act as linear classifiers in a high dimensional feature space originated by a projection of the original input space. • The resulting classifier is in general non linear in the input space.• SVM achieves good generalization performances maximizing the margin between the classes. • SVM learning algorithm has no local minima

“Specific” motivations• Kernel are well-suited to working with high dimensional data.

• Small sample sizes require algorithms with good generalization capabilities.

• Automatic diagnosis of tumors requires high sensitivity and very effective classifiers.

• SVM can identify mis-labeled data (i.e. incorrect diagnosis).

• We could design specific kernel to incorporate “a priori” knowledge about the problem.

Page 13: Supervised gene expression data analysis using SVMs and MLPs

SVM to classify cancerous and normal cells

We consider 3 standard SVM kernels:

• Gaussian

• Polynomial

• Dot-product

Varying:

• Values of the the kernel parameters

• The regularization factor C

Estimation of the generalization error through:

• 10-fold cross-validation

• leave-one-outComparing them with:

• MLP

• LP

Varying:

• Number of hidden units

• Backpropagation parameters

Page 14: Supervised gene expression data analysis using SVMs and MLPs

Results Learning machine model Gen. error St. dev. Prec. Sens. SVM-linear 1.04 3.16 98.63 100.0SVM-poly 4.17 5.46 94.74 100.0SVM-RBF 25.00 4.48 75.00 100.0MLP 2.08 4.45 98.61 98.61LP 9.38 10.24 95.65 91.66

• 10-fold cross-validation ~ leave-one-out estimation of error• SVM-linear achieves the best results.• High sensitivity, no matter what type of kernel function is used. • Radial basis SVM high misclassification rate and high estimated VC dimension

Page 15: Supervised gene expression data analysis using SVMs and MLPs

ROC analysis • The ROC curve of the SVM-linear is ideal

• The polynomial SVM also achieves a reasonably good ROC curve

• The SVM-RBF show a diagonal ROC curve: the highest sensitivity is achieved only when it completely fails to correctly detect normal cells.

• The ROC curve of the MLP is also nearly optimal

• Linear perceptron shows a worse ROC curve, but with reasonable values lying on the highest and leftmost part of the ROC plane.

Page 16: Supervised gene expression data analysis using SVMs and MLPs

Summary of the results on the first problem

Using hierarchical clustering 14,6% of the examples are misclassified (Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the MLP and the 9.38% of the LP.

Supervised methods exploit a priori biological knowledge (i.e. labeled data), while clustering methods use only gene expression data to group together different tissues, without any labeled data.

Linear SVM achieve the best results, but also MLP and 2nd degree polynomial show a relatively low generalization error.

Linear SVM and MLP can be used to build classifiers with a high-sensitivity and a low rate of false positives.

These results must be considered with caution because the size of the available data set is too small to infer general statements about the performances of the proposed learning machines.

Page 17: Supervised gene expression data analysis using SVMs and MLPs

The second problem: Identifying DLBCL subgroups

It starts from an hypothesis of Alizadeh et al. about the existence of two distinct functional types of lymphoma inside DLBCL.

Actually, we consider two problems:1. Validation of Alizadeh’s hypothesis

• They identified two subgroups of molecularly distinct DLBCL: germinal centre B-like (GCB-like) and activated B-like cells (AB-like).

• These two classes correspond to patients with very different prognosis.

2. Finding groups of genes mostly related to this separation

Different subsets of genes could be responsible for the distinction of these two DLBCL subgroups: the expression signatures Proliferation, T-cell, Lymphnode and GCB (Lossos,2000).

Page 18: Supervised gene expression data analysis using SVMs and MLPs

A feature selection approach based on “a priori” knowledge

Finding the most correlated genes involves an exponential combination of genes (2n-1), where n is usually of the order of thousands.

We need greedy algorithms and heuristic methods.

Can we exploit “a priori” biological knowledge about the problem ?

Page 19: Supervised gene expression data analysis using SVMs and MLPs

An heuristic method (1)

A two-stage approach:

I. Select groups of coordinately expressed genes.

II. Identify among them the ones mostly correlated to the disease.

• We do not consider single genes.

• We consider only groups of coordinately expressed genes.

Page 20: Supervised gene expression data analysis using SVMs and MLPs

An heuristic method (2)

I. Selecting groups of coordinately expressed genes:

• Use “a priori” biological and medical knowledge about groups of genes with known or suspected roles in carcinogenic processes

And/or

• Use unsupervised methods such as clustering algorithms to identify coordinately expressed sets of genes

II. Identify subgroups of genes mostly related to the disease: 1. Train a set of classifiers

using only the subgroups of genes selected in the first stage.

2. Evaluate and rank the performance of the trained classifiers.

3. Select the subgroups by which the corresponding classifiers achieve the best ranking.

Page 21: Supervised gene expression data analysis using SVMs and MLPs

Applying the heuristic method1. Selecting “candidate”

subgroups of genes:We used biological knowledge and

hierarchical clustering algorithms to select four subgroups:

• Proliferation: sets of genes involved the biological process of proliferation

• T-cell: genes preferentially expressed in T-cells

• Lymphnode: Sets of genes normally expressed in lymphnodes

• GCB: genes that distinguish germinal centre B-cells from other stages in B-cell ontogeny

2. Identify subgroups of genes most related to the the separation GCB-like / AB-like

• Training of SVM, MLP and LP as classifiers using each subgroup of genes and all the subgroups together (All)

• Leave-one-out methods used with gaussian, polynomial and linear SVM

• 10-fold cross-validation with gaussian, polynomial and linear SVM, MLP and LP.

5 classification tasks

Page 22: Supervised gene expression data analysis using SVMs and MLPs

GCB signatureLearn. machine model Gen.

error St. dev. Prec. Sens. SVM-linear 10.50 11.16 90.00 90.00SVM-poly 8.70 14.54 96.67 88.33SVM-RBF 4.50 9.55 100.0 90.00MLP 8.70 10.50 90.90 90.90LP 8.70 10.50 90.90 90.90

All signaturesLearn. machine model Gen.

error St. dev. Prec. Sens. SVM-linear 15.00 11.16 85.00 85.00SVM-poly 14.00 18.97 93.33 76.67SVM-RBF 10.00 10.54 100.00 76.67MLP 8.70 13.28 95.00 86.36LP 10.87 14.28 86.96 90.90

Page 23: Supervised gene expression data analysis using SVMs and MLPs

Results

Page 24: Supervised gene expression data analysis using SVMs and MLPs

The second problem: summary

• The results support the hypothesis of Alizadeh about the

existence of two distinct subgroups in DLBCL.

• The heuristic method identifies the GCB signature as a

cluster of coordinately expressed genes related to the

separation between the GCB-like and AB-like DLBCL

subgroups.

Page 25: Supervised gene expression data analysis using SVMs and MLPs

DevelopmentsI. Methods to discover subclasses of tumors on molecular basis.

Integrating “a priori” biological knowledge, supervised machine learning methods and unsupervised clustering methods

Stratifying patients into molecularly relevant categories, enhancing the discrimination power and precision of clinical trials

New perspectives on the development of new cancer therapeutics based on a molecular understanding of the cancer phenotype.

II . Methods to identify small subsets of genes correlated to tumors

- Refinements of the proposed heuristic method using clustering algorithms with semi-automatic selection of the number of the significant subgroups of genes.

- Greedy algorithms based on mutual information measures.

Enhancing biological knowledge about tumoral processes

Automatic diagnosis of tumors using DNA microchips

Discovery of new subclasses of tumors