Improved Statistical Test

32
Improved Statistical Test for Multiple- Condition Microarray Data Integrating ANOVA with Clustering Jared Friedman 1 This research was done at The Rockefeller University and supervised by Brian Kirk 2 1

description

My Intel paper. Now it is finally published.

Transcript of Improved Statistical Test

Page 1: Improved Statistical Test

Improved Statistical Test for Multiple-Condition Microarray Data Integrating ANOVA with

Clustering

Jared Friedman1

This research was done at The Rockefeller University and supervised by Brian Kirk2

1. 180 East End Ave, New York, NY, 10128. Email: [email protected]. Phone: 212-249-0134

2. Email: [email protected]. Phone: 212-327-7064

Summary1

Page 2: Improved Statistical Test

Microarrays are exciting new biological instruments. Microarrays promise to have

important applications in many fields of biological research, but they are currently fairly

inaccurate instruments and this inaccuracy increases their expense and reduces their usefulness.

This research introduces a new technique for analyzing data from some types of microarray

experiments that promises to produce more accurate results. Essentially, the method works by

identifying patterns in the genes.

Abstract

Motivation: Statistical significance analysis of microarray data is currently a pressing

problem. A multitude of statistical techniques have been proposed, and in the case of simple two

condition experiments these techniques work well, but in the case of multiple condition

experiments there is additional information that none of these techniques take into account. This

information is the shape of the expression vector for each gene, i.e., the ordered set of expression

measurements, and its usefulness lies in the fact that genes that are actually affected biologically

by some experimental circumstance tend to fall into a relatively small number of clusters. Genes

that do appear to fall into these clusters should be selected for by the significance test, those that

do not should be selected against.

Results: Such a test was successfully designed and tested using a large number of

artificially generated data sets. Where the above assumption of the correlation between

clustering and significance is true, this test gives considerably better performance than

conventional measures. Where the assumption is not entirely true, the test is robust to the

deviation.

Introduction

Microarrays are part of an exciting new class of biotechnologies that promise to allow

monitoring of genome-wide expression levels (1). Although the technique presented in this

paper is quite general and would, in principle, work with other types of data, it has been

developed for microarray data. Microarray experiments work by testing expression levels of

some set of genes in organisms exposed to at least two different conditions, and usually then

trying to determine which (if any) of these genes are actually changing in response to the change 2

Page 3: Improved Statistical Test

in experimental condition (2). Unfortunately, the random variation in microarrays is often large

relative to the changes trying to be detected. This makes it very hard to eliminate false positives,

genes whose random fluctuations make them erroneously seem to respond to the experimental

condition, and false negatives, genes whose random fluctuations mask the fact that they are

actually responding to the experimental condition. It is standard statistical practice to handle this

problem by using replicates, i.e., several microarrays run in identical settings (1-9). Although

replicates are highly effective, the expense of additional replicates makes it worthwhile to

minimize the number needed by using more effective statistical tests.

When there are only two conditions in an experiment, the conventional statistical test for

differential expression is the t-test (2,6). But Long, et. al. (5, also 6,7) realized that by running a

separate t-test on each gene, valuable information is lost, namely the fact that the actual

population standard deviation of all the genes is rather similar. Due to the small number of

replicates in microarray experiments there are usually a few genes that, by chance, appear to

have extremely low standard deviations and thus register even a very small change as highly

significant by a conventional t-test. In reality -- and as additional replicates would confirm -- the

standard deviations of these genes are much higher, and the change is not at all significant. In

response to this issue, Long et. al. derived using Bayesian statistics a method of transforming

standard deviations that, in essence, moves outlying standard deviations closer to the mean

standard deviation. Unfortunately, the resulting algorithm, which they call Cyber-T, works only

with two condition experiments. Since this research is concerned with multiple condition

experiments, it was necessary to expand the method to work with this type of experiment, the

standard statistical test for which is called ANOVA (ANalysis Of Variations) (8). Here we

report the successful extension of the Cyber-T algorithm to multiple condition experiments,

creating the algorithm we call Cyber-ANOVA.

In microarray experiments involving several conditions, the expression vector of each

gene formed by consecutive measurements of its expression level has a distinctive shape, which

contains useful information. The more conditions in the experiment (and thus numbers in the

expression vector), the more distinctive will be the shape. Often biologists are interested in

finding genes with expression vectors of similar shape, and this interest has generated a well-3

Page 4: Improved Statistical Test

developed process for grouping genes by shape known as clustering (9). The basis of clustering

techniques is the concept of correlation, the similarity between two vectors. We chose to use the

Pearson Correlation, because this measure generally captures more accurately the actual

biological idea of correlation (2). With the Pearson correlation, as elsewhere, distance is defined

as one minus the correlation and measures the dissimilarity between two vectors. Using this

measure of distance, it is possible to solve the problem of taking a large number of unorganized

vectors and placing them into clusters of similar shapes: this is known as K-means clustering

(12). Once the K-means algorithm has been completed, it is often useful to determine how well

each gene has clustered. This is done by computing a representative vector, often called a center,

for each cluster and calculating the distance between each gene and the center of that gene’s

cluster (14,15).

The fundamental thesis of this research is that genes actually affected by experimental

circumstances tend to fall into a relatively small number of clusters, and that this information can

be used to make a more accurate statistical test. While it is in principle possible, it is not often

the case that large numbers of affected genes behave in seemingly random and entirely dissimilar

patterns. Given all the clusters of clearly significant genes, genes that fit well in a cluster are

more likely to be differentially expressed than those that do not. Unfortunately, this hypothesis

is very difficult to test, for no microarray experiments have been replicated enough to allow for

accurate determination of differentially expressed genes. The studies that have come closest to

this conclusion are Zhao et al, (10) and Jiang et al., (11). Both studies ran K-Means clustering on

multiple condition data and found several distinct cluster shapes that accounted for all clearly

significant genes.

We used this connection between clustering and significance to create a more accurate

statistical test whose essential steps are the following. First Cyber-ANOVA is run on the entire

dataset and the highly significant genes are clustered. The other genes are placed into the cluster

in which they fit best, and the distance between each gene and the center of its cluster is

computed. There are now two values for each gene, the P value (probability) from the Cyber-

ANOVA test and the D value (distance) from the clustering. These two values are combined and

the resulting value is more accurate than either one alone. 4

Page 5: Improved Statistical Test

The potential advantage of this algorithm was validated by the creation and successful

testing of a large number of artificial data sets. Artificial data was chosen over real experimental

data for two main reasons: (1) in experimental data, uncertainty always exists over which genes

are actually differentially expressed; this presents major problems in determining how well some

given algorithm works; and (2) no single microarray experiment is actually representative of all

possible microarray experiments – to determine accurately the performance advantage of an

algorithm, a large number of very different experiments must be used, a prohibitively expensive

and difficult process. Artificial creation of data sets allows every parameter of the data to be

perfectly controlled and the algorithm’s performance can be tested on every possible

combination of parameters.

The basic method of analysis was to generate an artificial data set from a number of

parameters, run the clustering algorithm on the data and measure its performance, run Cyber-

ANOVA on that data and on similar data but with additional replicates and measure its

performance, and then compare the results. The performance gain of the clustering algorithm is

probably best expressed as equivalent replicate gain: the number of additional replicates that

would be needed to produce the results of the clustering algorithm using a standard Cyber-

ANOVA. On reasonable simulated data, we found that the benefit was approximately one

additional replicate added to an original two. With additional improvements to the algorithm

design, this saving could probably be increased still further, and the advantage is clearly

significant enough to warrant use in actual microarray experiments.

Methods

Algorithm Design

The algorithm is implemented as a large (>2000 lines) macro script in Microsoft Visual

Basic for Applications controlling Microsoft Excel. The essential steps of the algorithm are the

following:

1. Run Cyber-ANOVA on all genes;

2. Adjust P values for multiple testing;

5

Page 6: Improved Statistical Test

3. Use the results of (2) to choose an appropriate number of most significant genes;

4. Cluster this selection of genes using K-Means;

5. Take the remaining genes and group each into the cluster with which it correlates best;

6. Compute the distance between each gene and the center of its cluster;

7. Compute a combined measurement of significance for each gene.

1. Run Cyber-ANOVA on all genes. Cyber-ANOVA is just a simple extension of Cyber-T, but

its novelty causes it to warrant some discussion. Cyber-T takes as input a list of genes, means in

each condition, standard deviations in each condition, and number of replicates. It gives as

output a set of P-values much like those from a conventional t-test, but more accurate. The

improvement is based on the idea of regularizing the standard deviations, essentially moving

outlying standard deviations closer to the mean. At the same time, it recognizes that standard

deviation may change significantly (and inversely) with intensity level, and thus genes are

regularized only with genes of similar intensity. The algorithm works by first computing for

each gene a background standard deviation, the average standard deviation of the hundred genes

with closest intensity, and then combining the background standard deviation with the observed

standard deviation in a type of weighted average. A standard t-test is then run, but using the

regularized standard deviation instead of the observed one.

Cyber-ANOVA works by the same method, taking as input means and standard

deviations in any number of conditions. Each background standard deviation is calculated using

the intensities of the genes in their own condition, and the values are combined using the same

formula. But because there are more than two conditions, a t-test cannot be applied; instead we

use ANOVA, inputting the regularized standard deviation instead of the observed one. The

result gives a dramatic increase in accuracy, an equivalent replicate gain of roughly one.

2. Adjust P values for multiple testing. To run the clustering algorithm, we need to select

a number of genes that are highly significant to form the seed clusters around which all other

genes will cluster. The difficult step in the process is deciding how many genes to select: it is

critical to get at least a few seeds for all clusters but also critical not to have too many false 6

Page 7: Improved Statistical Test

positives. Since the first requirement is impossible to determine by calculation, we use the

second. In order to choose an appropriate number of expected false positives, we must have an

accurate estimate of the false positive rate. The simplest way of estimating the false positive rate

is a Bonferroni correction, which simply multiplies each p value by the total number of tests.

This method assumes that all genes are independent, and tends to be too conservative,

particularly with small numbers of replicates, and a number of more accurate techniques have

been proposed. However, for this step, only a rough estimate is necessary, and thus the

Bonferroni correction suffices.

3. Use the results of (2) to choose an appropriate number of most significant genes. Once

the expected false positive rate for any number of the most significant genes has been estimated,

some number of genes must be selected such that the number of expected false positives is some

percentage of the number selected genes. There is no particular correct way to define this

percentage; it depends on how significant the genes in the least significant cluster are, which

cannot be determined directly. However, we have found that the algorithm is tolerant to

deviations from optimality (Table 6).

4. Cluster this selection of genes using K-Means. The chosen group of most significant

genes is clustered using the K-Means algorithm with the Pearson correlation. The primary

problem with all K-means clustering algorithms is that the number of clusters must be declared

at the start (13). But like the determination of the number of genes to include in the initial

clustering, there is no problem as long as K is rather high. This can be explained conceptually by

considering what happens as K increases to differentially expressed and non-differentially

expressed genes. If there are only differentially expressed genes in the original set, then as K is

raised, genes that were originally placed in one cluster will separate into multiple clusters.

Fortunately, this is guaranteed by the process of K-Means to lower the D values and create better

cluster shapes. If there are non-differentially expressed genes, then as K is increased these genes

will tend to separate out from the differentially expressed genes and form their own clusters – a

highly desirable occurrence. Unfortunately, if K becomes very large, the noise genes may start

to cluster very well with other noise genes forming erroneously low distances, and the other

noise genes added in the next step will have a better chance of incidentally finding a cluster with 7

Page 8: Improved Statistical Test

which they correlate well. In principle, too large a K might be partially monitored by checking

for very small clusters and for clusters with very high average distances, both of which could be

removed and their genes forced to join a different cluster. These methods have not yet been

implemented.

5. Take the remaining genes and group each into the cluster with which it correlates best.

The clusters formed in step 4 act as seeds for the remaining genes. The clusters formed

optimally include all of the cluster shapes of actually differentially expressed genes and no

others. It was possible to form these clusters, even if they were less than ideal, by using only a

subset of the genes in the data set. Now it is possible to look at the remaining genes and to see

how well each belongs in one of the clusters. Ideally, each differentially expressed gene would

correlate perfectly with exactly one cluster and each noise gene would correlate poorly with all

of them. In practice, of course, this is far from the case, but the principle remains. Since each

gene is, of course, only expected to correlate well with one cluster, each gene in the rest of the

dataset is added to the cluster with which it correlates best.

6. Compute the distance between each gene and the center of its cluster. The

measurement of how well each gene belongs in its cluster is now determined by computing the

Pearson correlation between each gene and the center of its cluster. The resulting correlations

are converted to distances (distance = 1- correlation). These distance values are the key

numbers produced by the first half of the algorithm. They represent whether a gene vector’s

shape is similar to the shape of gene vectors known to be significant. The basis of this research

is that low distances tend to imply differentially expressed genes. Unfortunately, distance values

alone are not an accurate measure of significance – ranking genes by distance alone produces

results far worse that the original Cyber-ANOVA. Instead, the distance (“D”) values must be

combined with the original P values to give new values better than either alone.

7. Compute a combined measurement of significance for each gene. The goal now is to

take the P value and the D value and combine them in some way to make a new value. Just like

with the calculation of the previous probabilities, what we are really interested in is finding the

probability of a non-differentially expressed gene with a certain standard deviation attaining both 8

Page 9: Improved Statistical Test

a P value less than or equal to the observed P value and a D value less than or equal to the

observed D value. Unfortunately, the theoretical basis for the connection between P and D

values is not well defined, and it may be very complicated and experiment-dependent; there may

well be no useful analytic solution to this problem. Instead, we generate the distribution of P and

D values for each experiment. First thousands of non-differentially expressed genes are

generated using the standard deviation that would be expected at their intensity level and their P

and D values are calculated. Then, for each gene in the data set, the number of non-differentially

expressed genes that have both P and D values equal to or lower than the P and D values of that

gene is counted. That number is divided by the number of non-differentially expressed genes

generated, and that quotient gives an approximate probability.

It is important to note that for highly significant genes, this method does not give accurate

probabilities, for if 105 non-differentially expressed genes are generated, only genes with an

actual probability of occurring of about 10-5 are likely to be generated. Thus, genes in the data

set that are actually significant to a probability of below 10-5 will almost certainly not have any

generated genes more significant and will all be assigned probabilities of zero. But this is not a

major concern, since these genes are so significant that there is little question that they are

differentially expressed, and their precise rank is not likely to be important. If it is, then the

genes with zero combined values can be ranked internally by their P values. In this case, within

the very most significant genes, the algorithm will not have had any effect.

Generation of the Data

To test the performance of the algorithm, a number of artificial data sets were generated.

The goal was to see how much benefit could be derived from the algorithm and how different

parameter combinations would impact the results. Above all, the data is intended to mimic real

microarray data, at least in all respects that are likely to significantly influence the results of the

test. The basic process of data generation was to use certain parameters to create a mean value

and standard deviation for each gene and condition and to use a random number generator

assuming a normal distribution to generate plausible experimental values. Some of the most

9

Page 10: Improved Statistical Test

important parameters are the numbers of differentially and non-differentially expressed genes,

number of groups, cluster shapes, standard deviations, and fold changes.

For each data set, about 8500 genes were created, with the number of differentially

expressed genes ranging from about 100 to about 500. The rest were genes whose mean value

did not change across conditions, but that mean value did vary between different non-

differentially expressed genes. The number of groups used is a key determinant of the

performance of the algorithm, more groups causing better performance. Most of the data sets

generated used six groups, but a set with twenty groups was tried.

The cluster shapes of non-differentially expressed genes are, of course, flat lines, for that

is the definition of being non-differentially expressed. The cluster shapes of the differentially

expressed genes were more complicated. They are drawn loosely from Zhao et. al. (10) but

really the specific shapes chosen are not all that important. An important property of the Pearson

correlation is that, given some expression vector, the probability of some other expression vector

having a distance to the original vector below some value is the same regardless of the shape of

the original vector; it depends only on the number of conditions (8). On the other hand, given

two original vectors, the probability of a test vector having a distance below some value to either

one of them does depend on the shapes of the original vectors. As intuition suggests, vectors of

opposite shape (like a linear increasing vector and linear decreasing vector) make it more likely

for a test vector to cluster well with one of them. To the extent that cluster shape does affect the

algorithm, the more similar the cluster shapes, the better the algorithm will perform. Cluster

shapes can be graphed and assigned names based on their shape. Data was generated using up to

twelve cluster shapes (Figure 1a, 1b).

10

Page 11: Improved Statistical Test

The method of assigning standard deviations was chosen to be as realistic as possible.

Cyber-ANOVA takes into account the fact that standard deviation often changes significantly

with absolute intensity in microarray experiments, and the generated data model this 11

Page 12: Improved Statistical Test

phenomenon. The data of Long, et. al, (5) which is freely available, was graphed, mean intensity

vs. standard deviation, and a quadratic regression calculated. This regression equation is an

explicit function for standard deviation in terms of mean, and it was used with a modification to

calculate standard deviations for the generated data. The modification is required because it is

not the case that standard deviation depends solely on intensity. The mean intensity vs. standard

deviation graph does not make a perfect line; there is considerable width to that curve, and this

too was simulated. The standard deviation itself was randomized with a small meta-standard

deviation. The assumption of normality in this generation is probably false, but the change in the

algorithm performance caused by introducing the meta-randomization at all is so small that it

seems highly unlikely that a better model would produce a significant difference.

To take full advantage of the standard deviation dependence on mean, non-differentially

expressed genes were generated with means separated by hundredths between 8 and 16 (on a log2

scale). This created a considerable range of standard deviations for each data set. The means of

differentially expressed genes were generated starting from 8 different baselines, the integers

from 8 to 15.

Each differentially expressed gene was generated at five different fold changes: 1.5, 1.7,

2.0, 3.0, and 6.0 on an unlogged scale. The six and three fold change genes generally made up

the seed clustering group, and the 1.5 and 1.7 fold change genes were responsible for most of the

difference in performance between algorithms.

Scoring Algorithms

Once data has been generated and assigned P values using Cyber-ANOVA and combined

values using the clustering algorithm, there must be some way to compare the performance. We

introduce a new method for determining the performance of an algorithm, which we believe to

fix a shortcoming in the standard method.

The most common method of scoring appears to be a consistency test (1-7) which works

the following way. Say there is a data set with N genes known to be differentially expressed. 12

Page 13: Improved Statistical Test

First, some significance test is performed and each gene assigned a significance level. Next, the

genes are ranked by significance level, most significant first. Out of the top N most significant

genes, the number D of genes that are actually differentially expressed is determined, and the

score is D / N.

The problem with the consistency test is that ignores the fact that it often will matter

whether the (N-D) genes are listed consecutively right after the Nth gene or at the very bottom of

the list. Biologists would prefer that the differentially expressed genes be listed as close to the

top as possible, for it is easier to find an interesting gene or group of genes if it is higher on the

list.

As a solution, we introduce an algorithm called rank sum. Rank sum also takes a list of

genes ranked by significance level, but computes the score differently. Rank sum is equal to the

sum of the ranks of all the N differentially expressed genes minus (1+2+3+…N), the minimum

score possible. Lower rank sums indicate better performance, and a perfect significance test

gives a rank sum of zero.

But rank sum has significant flaws, too, for it gives too much weight to the genes placed

after the Nth rank. In the worst case, one single misplaced gene could change the score from 0 to

the number of genes in the experiment, 8000 in this case. Here is a successful modification.

Instead of summing the actual ranks, we sum for all differentially expressed genes F(Rank),

where F is some increasing concave-down function. There is no particular right choice for F, for

the preferred function will vary depending on the experiment and experimenter, but both

logarithmic and polynomial (to a power less than, say, 1/2) functions give almost identical

results.

The modified rank sum successfully determines how well a significance test has ranked

the genes. But the score from the rank sum itself is not particularly useful information; it must

be placed in a context. Since the goal of any significance test is ultimately to reduce the number

of replicates necessary to attain accurate results, we convert the rank sum to a new measure that

we call equivalent replicate gain. Equivalent replicate gain is defined as the number of replicates 13

Page 14: Improved Statistical Test

needed to reach the same rank sum using only a standard Cyber-ANOVA. Specifically, this is

done by running Cyber-ANOVA on several data sets with parameters all identical except for the

number of replicates. The graph of number of replicates vs. rank sum is drawn and a regression

calculated. The clustering algorithm is then run and a rank sum obtained. That rank sum is

entered into the inverse of the regression function to find the approximate number of replicates

required to attain the same performance using only Cyber-ANOVA. The equivalent replicate

gain is the difference between that number of replicates and the number of replicates actually

used. Another useful measure is the equivalent replicate gain percentage, the equivalent

replicate gain divided by the equivalent replicate number. This gives an approximation of the

percent cost saving possible by using the algorithm and fewer replicates. The combination of the

techniques of rank sum and equivalent replicate gain percentage has been very successful.

Results

We have generated a number of artificial data sets using a wide range of parameters and

found the equivalent replicate gain percentage of the algorithm in many situations. The most

important parameters were found be the number of conditions, the number of cluster shapes, the

number of replicates, the number of non-differentiated genes in the initial clustering, and the

number of clusters omitted from the initial clustering.

In absolutely ideal situations, the algorithm can give an enormous benefit. An

experiment with twenty conditions, one cluster shape, two replicates, and perfect initial

clustering approaches such an ideal situation and gives excellent results (Table 1) .

Number of Conditions Equivalent Replicate

Number

Equivalent Replicate

Gain Percentage

6 3.43 41.7%

14

Page 15: Improved Statistical Test

20 6.28 68.1%

Table 1. Performance of the clustering algorithm on data with two numbers of conditions, one

cluster shape, two replicates, and perfect initial clustering.

A more realistic case would involve fewer conditions, say, six, and particularly with this

smaller number of conditions, the number of cluster shapes becomes a factor in the performance

of the algorithm. The performance benefit is still quite significant: recall that equivalent

replicate gain percentage is an approximation of the cost saving due to fewer replicates. As

expected, the performance benefit is reduced by additional cluster shapes (Table 2).

Number of Cluster

Shapes

Equivalent Replicate

Number

Equivalent Replicate

Gain Percentage

12 2.84 29.6 %

2 3.34 40.1 %

Table 2. Performance of the clustering algorithm on data with two numbers of cluster shapes,

using six conditions, two replicates, and perfect initial clustering.

Even when identical parameters are used to generate the data, the rank sum of any

algorithm will fluctuate between data sets differing only by random number generation.

Fortunately, this variation appears to affect Cyber-ANOVA and the Clustering algorithm

equally, for the equivalent replicate gain percentage does not change significantly (Table 3).

This invariance has the beneficial effect of making it unnecessary to run multiple data sets for

each parameter combination; more than one adds little information.

Data Set Equivalent Replicate

Number

Equivalent Replicate

Gain Percentage

1 3.34 40.1%

15

Page 16: Improved Statistical Test

2 3.47 42.3%

3 3.29 39.2%

4 3.25 38.5%

Table 3. Performance of the clustering algorithm on data generated randomly four times using

identical parameters: six conditions, two replicates, two cluster shapes, and perfect initial

clustering.

Change in the standard deviation or in the fold changes of the genes will certainly affect

both the rank sum from both Cyber-ANOVA and clustering. Over small changes, the equivalent

replicate gain percentage is not seriously affected, but if the standard deviation becomes so large

as to corrupt the initial clustering, the performance loss is significant (Table 4).

Standard Deviation

(in multiples of the

normal SD used

elsewhere)

Equivalent Replicate

Number

Equivalent Replicate

Gain Percentage

x1 3.34 40.1 %

x2 3.14 36.3%

x4 2.63 24.0%

Table 4. Performance of the clustering algorithm on data with three levels of standard deviation,

using six conditions, two replicates, two cluster shapes, and perfect initial clustering.

Like Cyber-ANOVA, and in fact all statistical tests, the equivalent replicate gain

percentage of the algorithm decreases as the number of replicates becomes larger and the test

becomes more accurate (Table 5). If the initial clustering is correct, there is no loss in

performance, but in experiments with parameters similar to those of Table 5, the algorithm

probably ceases to be worthwhile after four replicates.

Algorithm Tested Number of Replicates Equivalent Replicate

Number

Equivalent Replicate

Gain Percentage

Clustering 2 2.84 29.6%

16

Page 17: Improved Statistical Test

Clustering 4 4.74 15.7%

Clustering 8 8.12 1.51%

Cyber-ANOVA 2 2.93 31.7%

Cyber-ANOVA 4 5.23 23.5%

Cyber-ANOVA 8 8.75 8.60%

Table 5. Performance of the clustering algorithm on data with three numbers of replicates, using

six conditions, twelve cluster shapes, and perfect initial clustering. The rate of decrease in

performance at higher number of replicates is compared with that rate in Cyber-ANOVA. Note

that for the clustering algorithm, the equivalent replicate gain percentage is expressed as a gain

from Cyber-ANOVA, but that for Cyber-ANOVA, the equivalent replicate gain percentage is

expressed as a gain from a standard ANOVA test. Thus, on the scale that Cyber-ANOVA is

being compared to, the clustering algorithm result for a two replicate data set is equivalent to the

plain ANOVA of four replicates.

The most serious impact to the algorithm’s performance occurs if the initial clustering is

done incorrectly. The most difficult and important step is choosing the number of genes to

include in the initial group; if either too many or too few are chosen, the performance of the

algorithm will be diminished significantly. The ultimate goal is to choose a set of genes that

includes all the clusters of differentially expressed genes without including any non-differentially

expressed genes. It is true that this might be very difficult to do with experimental data, but it is

fortunately true that the algorithm is robust to small departures from optimality (Table 6). The

only time that the algorithm’s performance drops below the performance of Cyber-ANOVA is

when almost all of the clusters are left out.

Number of

Genes Selected

Expected

Number of

Number of

Clusters

Number of

non-

Equivalent

Replicate

Equivalent

Replicate Gain

17

Page 18: Improved Statistical Test

False Positives Omitted differentially

expressed

genes included

Number Percentage

3 0.005 10 0 1.57 -21.6

10 0.02 9 0 1.93 -3.3

20 0.05 5 0 2.11 5.18

30 0.09 3 0 2.12 5.84

50 0.24 0 0 2.34 14.7

100 1.3 0 0 2.63 23.9

300 91 0 1 2.55 21.4

400 300 0 23 2.37 15.8

500 496 0 95 2.09 4.33

Table 6. Performance of the clustering algorithm on data with several expected false positive

rates for the initial clustering, using six conditions, two replicates, and twelve cluster shapes.

Discussion

The potential cost saving of the algorithm in cases when the correlation between distance

and significance is strong is highly significant. It is interesting to note that for normal data with

two replicates such as that in Table 5, the percent equivalent replicate gain between Cyber-

ANOVA and Clustering is roughly equal to the performance gain between standard ANOVA and

Cyber-ANOVA, on the order of 30%.

It is true that this performance degrades in some cases, but this is also true of Cyber-

ANOVA, and also of other algorithms (3, 4). One situation that causes the percentage equivalent

replicate gain of all these algorithms to decrease is higher number of replicates. Unfortunately,

the performance decreases more rapidly in the clustering algorithm. This phenomenon is

perhaps best explained by a quality of information argument: additional replicates increase the

quality of information of the P values, but have little effect on the quality of the information of

the D values. Additional replicates will cause differentially expressed genes to cluster better, in

a sense reducing the number of false negatives, but will not reduce the false positives, because

18

Page 19: Improved Statistical Test

non-differentially expressed genes are just as likely to randomly have low distances given any

number of replicates. Thus, as the number of replicates increases, the D values gradually cause

less improvement.

The most major degradation of the clustering algorithm’s performance, however, is in a

situation that does not affect the other algorithms, that of bad initial clustering. Reducing the

severity of this problem will be the primary direction for further research on this algorithm.

Some possible solutions include (1) clustering all the genes but weighing their significance in the

clustering algorithm by P value; and (2) setting the number of genes to use as seeds by trying

many values and seeing the point above which new clusters stop appearing to form. Regardless

of whether this issue can be resolved in the general case, if in some experiment it is strongly

suspected (perhaps on biological grounds) that no clusters have been omitted from the initial set

of genes, this clustering method can be used and safely expected to produce a substantial

performance benefit.

References

1. Dudoit, S. Yang, Y., Callow, M.J., and Speed, T.P. (2000). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical report #578, Stat Dept, UC-Berkeley.

2. Knudsen, S. (2002) A Biologist’s Guide to Analysis of DNA Microarray Data. John Wiley & Sons.

3. Pan, W., (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18(4):546-54

4. Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98:5119-5121.

5. Long, D., Mangalam, H., Chan, B., Tolleri, L., Hatfield, G., Baldi, P., Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical Framework. Journal of Biological Chemistry. 276 (3): 19937-19944.

6. Baldi, P., and Brunak, S., (1998) Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge MA.

19

Page 20: Improved Statistical Test

7. Baldi, P. and Long, A.D. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics 17:509-519

8. Kreyszig, E. (1970) Mathematical Statistics: Principles and Methods. John Wiley &Sons, Inc.

9. Ewens, W., and Grant, G., (2001) Statistical Methods in Bioinformatics: An Introduction. Spriger-Verlag, NY.

10. Zhao, R., Gish, K., Murphy, M., Yin, Y., Notterman, D., Hoffman, W., Tom., E, Mack, D., and Levine, A. (2000). Analysis of p53-regulated gene expression patterns using oligonucleotide arrays. Genes and Development. 14 (8): 981-993.

11. Jiang M, Ryu J, Kiraly M, Duke K, Reinke V, and Kim SK. Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. Proc. Natl. Acad. Sci. U S A 2001 Jan 2;98(1):218-23

12. Duda, R.O, and Hart, P.E. (1973) Pattern Classification and Scene Analysis. John Wiley and Sons.

13. Sharan, R., and Shamir, R. CLICK: a clustering algorithm with applications to gene expression analysis. In Proceedings of the 2000 Conference on Intelligent Systems for Molecular Biology (ISMB00), La Jolla, CA, 307-316

14. Alon, U., Barkai, N,. notterman, D.A., Gish K., Ybarra, S., Mack, D., and Levine, A.J. (1999) Broad patterns of gene expression revalued by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acd. Sci. USA, 96:6745-6750

15. Sherlock, G. Analysis of large-scale gene expression data. Curr Opin Immunol 2000 Apr;12(2):201-5

20