Download - Classification of Cancer Patients

Classification(Supervised Clustering)

Naomi Altman

Nov '06

Objective

Starting from a sample from known groups:

1) Select a set of genes that identify the groups

2) Compute a function of the expression values that can be used to classify a new sample.

e.g. Normal and cancer prostate tissues from 24 patients

1) a) Find the set of genes that may be involved in the disease process (differential expression analysis).

b) Find a set of genes that mark the disease (possibly not all the genes involved.)

2) Take a sample from a new patient - does this person have prostate cancer?

The Main PictureFor Linear Discriminant Analysis

separating hyperplane

linear discrimination direction

To classify a new point, see what side of the hyperplane it lies on

The Main PictureFor Support Vector Machines

separating hyperplane

To classify a new point, see what side of the hyperplane it lies on

The Main PictureFor Linear Discriminant Analysis

To classify a new point, see what side of the hypercurve it lies on

The Main PictureFor Recursive Partitioning

To classify a new point, see the classification of its partition in space

Linear and Quadratic Discriminant Analysis, Logistic Regression

Each sample belongs to A or B.

Linear and quadratic discriminant analysis are essentially regressions on a 0/1 indicator variable.

Suppose we have samples of sizes m from A and n from B. Sample ts (t=group, s=sample within group) has gene expression values Y1ts ... YGts

For each group we can compute the mean expression values

for each gene,Ŷt, the variance of each gene, sit2 and the

covariance between genes sijt.

We can also compute the pooled variance and covariance of each gene, which is essentially the average over the 2 groups.

Linear Discriminant Analysis

sBA YSYY *1)'(

where S is the pooled variance matrix is the linear discriminant function.

)()'(2

1 1BABA YYSYY

In the simplest case, we classify each sample depending on whether it is above (A) or below (B) the midpoint of the line, which is

If the 2 conditions are not equally likely, we may wish to weight so that we classify new samples proportionally to the expected percentages.

)/ln()()'(2

1)'(

2

121

1*

1 BABAsBA YYSYYYSYY

Linear Discriminant AnalysisThis is extended to p groups by considering the discriminant score, which is another SVD decomposition and is similar to multivariate ANOVA.

1. Consider the covariance matrix of the sample means weighted by the sample sizes.

)1/()( 2 pYYn itt

)1/())(( pYYYYn jtitt

between variance

between covariance

Assemble these into the Between group variance matrix V.

2. Consider the pooled covariance matrix S, (which in this context is often called W for Within group variance matrix).

Now consider the SVD of S-1/2BS-1/2. (It is symmetric, so the left and right eigenvectors are the same.)

The first eigenvector is the direction of greatest separation of the means, in terms of the axes of the ellipses defining the groups.

The 2nd eigenvector is the direction of 2nd greatest separation that is orthogonal to the first. etc.

The rank of B is p-1, so there are only p-1 non-zero eigenvalues.

Each sample is assigned to the group with nearest mean in the eigenvector coordinates.

This is equivalent to looking at the combinations of the pairwise

discriminant functions and mapping every sample to the group with the nearest mean.



SVD LDA LDA Regions

As in the 2-group case, you can weight the discriminant scores by the prior probability of group membership

Quadratic Discriminant AnalysisIs very similar to linear discriminant analysis,

except that every group is allowed to have its own variance matrix, allowing the ellipses to have a different orientation.

Logistic Regression

Let t be the probability of membership in group t.

Use maximum likelihood to fit

log(t/(1- t)) = 0 + iYits

Classify a sample into group t if the predicted

log(t/(1- t)) is the maximum over all groups.

Again, we can weight by prior probability.

Recursive Partitioning

PL< 2.45

se(50 0 0)

PW< 1.75

ve (0 49 5) vi (0 1 45)

Assessing AccuracyCount the number of misclassifications of the training sample

(optimistic).

Cross-validation: Do not use a fraction of the data (test data).

"Train" using the remainder of the sample, with the same rule used for the complete data.

Count the number of misclassifications of the test data.

Repeat.

About 1/3 test data appears to be best.

But...a) If the number of genes exceeds the number of samples, we

always "overfit" - e.g. with logistic regression we can almost always achieve perfect classification

b) Rank S=min(row rank, col rank) so S is not invertible (LDA) and neither are the within treatment variance matrices (QDA)

b) Most of the methods use all of the genes.

i.e. With microarray data, we will need to select a smaller set of genes to work with.

For medical diagnostics we often want a very small set of markers.

Reducing the Number of Genes1.With n samples, use the n-k most significantly differentially

expressing genes.

2. Cluster the genes and take the most significantly differentially expressing gene in each cluster.

3. Add variables to your discrimination function stepwise.

4. PAM - shrink the group center to the overall center, and then apply a robust QDA with moderated variance estimates (like SAM). The method ends up with the within group centroid=total centroid for most genes. So the differences among groups rely only on the other genes, which are the only genes used in the QDA.

Problem: (All methods) Often replicability is lost when studies are repeated. e.g. we can tell the difference between ALL and AML in all studies, but different discriminant functions are required, maybe different genes.