Statistical Applications in Genetics and Molecular Biologyhji403/Publications/Jiang... · pose an...

Statistical Applications in Geneticsand Molecular Biology

Volume5, Issue1 2006 Article 28

A Two-Step Multiple Comparison Procedurefor a Large Number of Tests and Multiple

Treatments

Hongmei Jiang∗ Rebecca W. Doerge†

∗Northwestern University, [email protected]†Purdue University, [email protected]

Copyright c©2006 The Berkeley Electronic Press. All rights reserved.

A Two-Step Multiple Comparison Procedurefor a Large Number of Tests and Multiple

Treatments∗

Hongmei Jiang and Rebecca W. Doerge

Abstract

For situations where the number of tested hypotheses is increasingly large, the power to detectstatistically significant multiple treatment effects decreases. As is the case with microarray tech-nology, often researchers are interested in identifying differentially expressed genes for more thantwo types of cells or treatments. A two-step procedure is proposed for the purpose of increasingpower to detect significant effects (i.e., to identify differentially expressed genes). Specifically, inthe first step, the null hypothesis of equality across the mean expression levels for all treatmentsis tested for each gene. In the second step, only pairwise comparisons corresponding to the genesfor which the treatment means are statistically different in the first step are tested. We propose anapproach to estimate the overall FDR for both fixed rejection regions and fixed FDR significancelevels. Also proposed is a procedure to find the FDR significance levels used in the first step andthe second step such that the overall FDR can be controlled below a pre-specified FDR signif-icance level. When compared via simulation the two-step approach has increased power over aone-step procedure, and controls the FDR at a desire significance level.

KEYWORDS: false discovery rate, multiple comparisons, multiple tests, testing differential ex-pression

∗Acknowledgments: We are very grateful to two reviewers and the Associate Editor for theirhelpful comments and suggestions.

1 Introduction

Advances in many areas of technology (e.g., communication, health care, andbiotechnology) are giving rise to vast experiments that provide data for test-ing a very large number of repetitive tests. These situations require a multiplecomparison correction that not only accommodates the number of tests thatare being conducted, but also controls the rate of false positives at a desiredlevel. While this problem presents itself in a variety of applications the one thatmotivated this work is microarray technology; a powerful tool that is widelyapplicable to almost every area of science (e.g., basic science, agriculture, andmedical research). Microarrays provide a systematic way to study transcriptvariation for thousands of genes simultaneously. The key question addressed bymost microarray experiments is to ask which genes are differentially expressedgenes between a pair of conditions (i.e., control and treatment). Numerousapproaches that range from traditional statistical analyses to new statisticalmodels have been proposed for testing differential gene expression (Schenaet al., 1996; Baldi and Long, 2001; Efron, 2003; Newton et al., 2001; Gottardoet al., 2003; Tusher et al., 2001; Kerr et al., 2000; Wolfinger et al., 2001) be-tween pairs of conditions. Since the traditional familywise error rate (FWER)multiple comparisons procedures, such as Bonferroni’s procedure, are too con-servative, false discovery rate (FDR) controlling procedures (Benjamini andHochberg, 1995) have been widely used in microarray studies. Benjamini andHochberg (2000) propose an adaptive procedure, that has increased powerover the original procedure, by incorporating the estimate of the proportionof true null hypotheses. A variety of methods have been proposed to estimatethe proportion of true null hypotheses for multiple testing problems, such asStorey’s bootstrap method (Storey, 2002), Storey and Tibshirani’s smootherestimate (Storey and Tibshirani, 2003), and Langaas et al.’s method based onnonparametric maximum likelihood estimation of the p-value density, underthe restriction of decreasing and convex decreasing densities (Langaas et al.,2005).

Although testing for differential expression of a gene between pairs of con-ditions or treatments is informative, in a microarray study it is quite commonfor researchers to be interested in comparing more than two treatment condi-tions for thousands of genes in the experiment. For instance, Hedenfalk et al.(2001) studied gene expression changes among breast cancers due to mutationsin either the gene BRCA1 or the gene BRCA2 and sporadic tumor (i.e., threeconditions) using 5,361 genes. With a large number (m) of genes, the numberof pairwise comparisons are typically very large (3m for 3 treatments, and 6mfor 4 treatments, etc.). Therefore, when the goal is to identify statistically

1Jiang and Doerge: A Two-Step Multiple Comparison Procedure

Published by The Berkeley Electronic Press, 2006

differentially expressed genes between each pair of conditions, in the typicalone-step multiple comparison procedure C ×m (C is number of pairwise com-parisons for each gene) hypothesis tests are treated as a family, and a falsediscovery rate (FDR) controlling procedure such as Benjamini and Hochberg’sprocedure (Benjamini and Hochberg, 1995) is applied at a significance levelα. In situations where the majority of genes are not differentially expressedacross the treatments, applying the FDR controlling procedure to a large fam-ily of multiple comparisons may not be most powerful simply because when thenumber of hypothesis increases, the power of detecting differentially expressedgenes decreases. Lu et al. (2005) explored this issue and proposed a two-stepstrategy. In the first step, a subset of genes that are potentially differentiallyexpressed among the treatments are identified with a loose criterion. In thesecond step, these potential genes are combined for detecting differentially ex-pressed genes with a more stringent criterion. It is expected that the smallernumber of genes in the second step will give rise to a more powerful test. Inboth steps of the procedure Lu et al. (2005) employ a Bonferroni adjustment toaddress the multiple comparison problem. Lu et al. (2005) point out that Ben-jamini and Hochberg’s FDR controlling procedure (Benjamini and Hochberg,1995) can be used in both steps but do not address the family-wise error rate(FWER) or the FDR for the whole/entire procedure. Specifically, suppose theFDR significance levels used in the two steps are 0.05 and 0.01, respectively.The FDR for the whole procedure must be taken into account, and not limitedto the individual FDRs at each step, since the false rejections in the first stepwill affect the results of the second step.

Using this as our motivation, a two-step multiple comparison procedure isproposed for testing pairwise comparisons of more than two treatments for alarge number of genes such that the power to detect differentially expressedgenes, while controlling the FDR at a pre-chosen significance level, will behigher than a one-step procedure. Although Lu et al. (2005) used a mixedmodel approach for their two-step procedure, our proposed two-step proce-dure is not limited by the specifics of the model. Specifically, in the first step,the null hypothesis of equality across the mean expression levels for all treat-ments is tested for each gene. In the second step, only pairwise comparisonscorresponding to the genes for which the treatment means are statisticallydifferent in the first step are tested. The two-step procedure can be appliedin practice in three different ways: 1. The rejection regions in the first andsecond step both can be fixed. That is, equality tests of expression levels forthe genes in the first step with corresponding p-values less than or equal to c1

are considered statistically significant, and pairwise comparisons in the secondstep with p-values less than or equal to c2 are statistically significant, where

2 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28

http://www.bepress.com/sagmb/vol5/iss1/art28

c1 and c2 are fixed and known. Although it is typical to use the term rejectionregion in conjunction with the term test statistic(s), here we rely on the termrejection region in conjunction with the term p-value(s) for ease of explana-tion; 2. One can apply an FDR controlling procedure at significance level α1

in the first step, and an FDR controlling procedure at significance level α2 inthe second step, where α1 and α2 are fixed and known; 3. One can pre-specifythe overall FDR α to control the overall FDR below α. In this work we pro-pose an approach to estimate the overall FDR for both fixed rejection regions(situation 1) and fixed FDR significance levels (situation 2). We also proposea procedure to find the FDR significance levels used in the first step and thesecond step such that the overall FDR can be controlled below a pre-specifiedFDR significance level. Using simulated data we demonstrate that our pro-posed two-step procedure has increased power over a one-step procedure andcontrols the FDR for the entire procedure at a desired significance level.

2 A two-step multiple comparison procedure

A novel two-step multiple comparison procedure is proposed in the contextof testing for differential expression. Initially, we present it generally with nospecific FDR controlling procedure specified:

Step 1. The null hypothesis that a gene is not differentially expressed acrossall treatment conditions is tested for each gene (e.g., the global F-testfrom ANOVA model). For the family of m tests corresponding to them genes, an FDR controlling procedure is applied to control the FDRat level α1. Suppose there are K tests that are significant. Let Adenote the collection of the genes which have statistically significanttreatment effects. If K=0, the procedure is stopped and it is concludedthat no pairwise comparisons are significant and that there are nodifferentially expressed genes; otherwise, go to Step 2.

Step 2. (a) For genes not belonging to A, conclude pairwise comparisons amongthe treatments for these genes are not significant.

(b) For genes belonging to A, perform pairwise (C) comparisons foreach gene. Since there are K genes, in total there are C × Kpairwise comparisons. Apply an FDR controlling procedure forthis family of C × K tests at level α2.

Using FDR significance levels α1 and α2 our two-step procedure follows(this can also be accomplished using fixed rejection regions in a similar way).



Step 1. The null hypothesis that a gene is not differentially expressed acrossall treatment conditions is tested for each gene (e.g., the global F-test from ANOVA model). Tests with p-values ≤ c1 are consideredas statistically significant. Suppose there are K tests that are signifi-cant. Let A denote the collection of the genes which have statisticallysignificant treatment effects. If K=0, the procedure is stopped andit is concluded that no pairwise comparisons are significant and thatthere are no differentially expressed genes; otherwise, go to Step 2.

Step 2. (a) For genes not belonging to A, conclude pairwise comparisonsamong the treatments for these genes are not significant.

(b) For genes belonging to A, perform pairwise (C) comparisons foreach gene. Since there are K genes, in total there are C × Kpairwise comparisons. Pairwise comparisons with p-values ≤ c2

are considered as statistically significant.

We assume that if a gene does not have a significant treatment effect (testedin Step 1), then all of the pairwise comparisons among the treatments cor-responding to that gene are not significant. Only genes with a statisticallysignificant treatment effect will enter into the second step to be tested forpairwise comparisons. However, if a gene has a significant treatment effect(Step 1), some or all the pairwise comparisons may not be significant.

For the fixed FDR significance levels α1 and α2, or the fixed rejectionregions [0, c1] and [0, c2] in the respective Step 1 and Step 2, determinationof the overall FDR remains necessary. Choosing the significance level α1 inStep 1 and α2 in Step 2 so that the FDR for the entire two-step procedureis controlled at a desired significance level α is an additional issue that is ofinterest. To address these issues the two-step multiple comparison procedureis investigated further to gain an appreciation of the overall FDR relative tothe FDR in each step of the procedure.

3 Estimating FDR for fixed rejection regions

3.1 Derivation of the FDR

Assume the two-step procedure with fixed rejection regions are used. That is,assume that genes with p-values ≤ c1 have a significant treatment effect (i.e.,at least one treatment mean is different from others) in Step 1; and the pairwisecomparisons with p-values ≤ c2 are identified as statistically significant in Step2, where c1 and c2 are known. Our goal is to compute the overall FDR for the



two-step multiple comparison procedure. The approach is similar to Storey’spositive false discovery rate (pFDR) procedure (Storey, 2002, 2003b) whereone estimates the FDR for a given rejection region.

Let H i0 denote the null hypothesis of no treatment effect for the ith gene

and let H ij0 denote the null hypothesis that the jth pair of treatment means

are not different for the ith gene. For instance, if three treatments are ofinterest, j = 1, 2, 3; if four treatments are of interest, j = 1, 2, · · · , 6. LetDi = 0 indicate that there is no treatment effect for the ith gene, and letDi = 1 indicate a treatment effect for the ith gene. Furthermore, let Dij = 0indicate that the means of the jth pair of treatments for gene i are the same,and Dij = 1 when they are different. If Di = 0, then Dij = 0 for all j. Finallylet pi denote the p-value for testing the null hypothesis H i

0 in Step 1; and pij

denote the p-value for testing the null hypothesis H ij0 in Step 2.

Our two-step multiple comparison approach is different from the one-stepmultiple comparison procedure where the decision to reject depends on onlypij, since the decision whether to reject H ij

0 or not in the two-step multiplecomparison procedure depends on both pi and pij. Essentially, the two-stepmultiple comparison procedure has two criteria. The null hypothesis H ij

0 isrejected if and only if both conditions pi ≤ c1 and pij ≤ c2 are satisfied. Ob-viously, the two-step comparison procedure is exactly the one-step procedurewhen c1 ≥ 1. In fact, if c1 is large enough such that the two events, {pij ≤ c2}for some j, and {pi ≤ c1}, occur simultaneously for every gene i, then thetwo-step comparison procedure will produce the same results as the one-stepprocedure.

Theorem 1. In a two-step multiple comparison procedure, suppose that ob-jects/genes with p-values ≤ c1 are considered as having a significant treatmenteffect (i.e., at least one treatment mean is different from others) in Step 1;and the pairwise comparisons with p-values ≤ c2 are identified as statisticallysignificant in Step 2. Assume c1 and c2 are known, and the objects/genes areindependent. The pFDR of this two-step multiple comparison procedure is:

pFDR = pFDR1 ·P (pij ≤ c2 | Di = 0, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)+ (1 − pFDR1) ·

P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1)P (Dij = 0 | Di = 1, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1), (1)

where pFDR1 = P (Di = 0 | pi ≤ c1), which is the pFDR in Step 1.



Proof. Since the goal of the two-step multiple comparison procedure is toidentify statistically significant pairwise comparisons, only the rejections inStep 2 are of interest. Assume the objects/genes are independent. Usingthe Bayesian interpretation of pFDR (Storey, 2003b), the pFDR for the wholeprocedure is the probability of having a false rejection of a pairwise comparisongiven that it is in the rejection region (i.e., the probability that Dij = 0 giventhat pi ≤ c1 and pij ≤ c2),

pFDR = P (Dij = 0 | pi ≤ c1, pij ≤ c2)

=P (Dij = 0, pij ≤ c2 | pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1). (2)

To compute the numerator of equation (2), falsely rejected genes in the firststep are treated separately from the rejected genes that in fact have differenttreatment effects.

P (Dij = 0, pij ≤ c2 | pi ≤ c1)

= P (Dij = 0, pij ≤ c2 | Di = 0, pi ≤ c1) · P (Di = 0 | pi ≤ c1)

+P (Dij = 0, pij ≤ c2 | Di = 1, pi ≤ c1) · P (Di = 1 | pi ≤ c1)

= P (Dij = 0, pij ≤ c2 | Di = 0, pi ≤ c1) · pFDR1

+P (Dij = 0, pij ≤ c2 | Di = 1, pi ≤ c1) · (1 − pFDR1)

= P (pij ≤ c2 | Dij = 0, Di = 0, pi ≤ c1) · P (Dij = 0 | Di = 0, pi ≤ c1)

·pFDR1 + P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1) · (1 − pFDR1)

·P (Dij = 0 | Di = 1, pi ≤ c1).

Assume that all pairwise comparisons for a gene are not significant if thatgene does not have a significant treatment effect, then

P (Dij = 0 | Di = 0, pi ≤ c1) = P (Dij = 0 | Di = 0) = 1.

Then,

P (Dij = 0, pij ≤ c2 | pi ≤ c1)

= P (pij ≤ c2 | Di = 0, pi ≤ c1) · pFDR1 + (1 − pFDR1)·

P (pij ≤ c2 | Dij = 0, Di = 0, pi ≤ c1)P (Dij = 0 | Di = 1, pi ≤ c1). (3)

Combining equation (3) with equation (2) gives rise to the pFDR formu-lation as in equation (1).



3.2 Estimation of the FDR

With respect to microarray studies, the probability of having at least onerejection, P (R > 0) is almost 1, making the FDR and the pFDR essentiallythe same (Storey et al., 2004; Black, 2004). Therefore, the pFDR can bereplaced with FDR in equation (1), and the FDR for a two-step multiplecomparison procedure is,

FDR = FDR1 ·P (pij ≤ c2 | Di = 0, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)+ (1 − FDR1) ·

P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1)P (Dij = 0 | Di = 1, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1). (4)

To estimate the FDR of the two-step multiple comparison procedure withfixed rejection region, the five components of equation (4) have to be estimated:

(1) P (pij ≤ c2 | pi ≤ c1) can be estimated using the proportion of rejectionsamong the pairwise comparisons occurred in Step 2. That is,

P̂ (pij ≤ c2 | pi ≤ c1) =#{pij : pij ≤ c2, pi ≤ c1}

#{pi : pi ≤ c1} · C, (5)

where C is the number of pairwise comparisons for each gene, #{pi :pi ≤ c1} is the number of statistically significant genes (i.e., with p-values≤ c1) in Step 1, and #{pij : pij ≤ c2, pi ≤ c1} is the number of significantpairwise comparisons (i.e., with p-values ≤ c2) in Step 2.

(2) The FDR in Step 1, FDR1, can be estimated using the approach of Storey(2002) :

F̂DR1 =c1 · π̂01

#{pi : pi ≤ c1}/m, (6)

where m is the total number of genes, #{pi : pi ≤ c1} is the numberof p-values ≤ c1 in Step 1, and π̂01 is the estimate for π01 which is theproportion of true null hypotheses in Step 1 (i .e., the proportion of geneswhich in fact have no treatment effect among all m genes). Details aboutestimating the proportion of true null hypotheses are not covered here;references are given in Section 1.



(3) P (pij ≤ c2 | Di = 0, pi ≤ c1) is the probability of claiming a statisticallysignificant pairwise comparison which is associated with a falsely rejectedgene (tested in Step 1). A resampling technique can be employed to es-timate this probability. The following procedure is applied to the caseswhere a global F-test from an ANOVA model with constant variance andnormal distribution assumption is employed to test for the treatment ef-fect in Step 1. The concept is to generate a large data set under the truenull hypothesis (i.e., all treatment means are the same for all genes) andthen analyze these data in the same manner as the real (actual) data. Theproportion of rejections in Step 2 (ratio of the number of rejections to thetotal number of pairwise comparisons) is then computed. The specifics areas follows:

(i) Using the same sample size as the real data, generate a random samplefrom a standard normal distribution for a large number of genes (e.g.,M = 100, 000). Assume there are 3 treatment conditions and nobservations within each treatment condition, making the randomsample of size 3nM . These data are then analyzed using the sameanalysis as used for the real data. The p-value (p∗

i ) for testing the nullhypothesis that the treatment means are equal, and the p-values (p∗

ij)for testing the pairwise comparisons for i = 1, · · · ,M are computed.

(ii) Let #{p∗i : p∗i ≤ c1} be the number of p-values such that p∗i ≤ c1 and

#{p∗ij : p∗ij ≤ c2, p∗i ≤ c1} be the number of p-values such that p∗

ij ≤ c2

where i is chosen such that p∗i ≤ c1. These quantities as gainedby resampling provide an estimate of the probability of claiming astatistically significant pairwise comparison that is associated with afalsely rejected genes, namely

P̂ (pij ≤ c2 | Di = 0, pi ≤ c1) =#{p∗ij : p∗ij ≤ c2, p

∗i ≤ c1}

#{p∗i : p∗i ≤ c1} · C, (7)

where C is the number of pairwise comparisons for each gene.

In Section 6, we present an algorithm for situations when the experimentaldesign is unbalanced and the data are not normally distributed. A per-mutation method is used to estimate the true null distribution of the teststatistics.

(4) The estimate of P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1) is c2 when the



probability P (pi ≤ c1 | Dij = 0, Di = 1) = 1. Notice that

P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1) − P (pij ≤ c2 | Dij = 0, Di = 1)

=P (pij ≤ c2, pi ≤ c1 | Dij = 0, Di = 1)

P (pi ≤ c1 | Dij = 0, Di = 1)− P (pij ≤ c2 | Dij = 0, Di = 1)

≤P (pij ≤ c2 | Dij = 0, Di = 1)

P (pi ≤ c1 | Dij = 0, Di = 1)− P (pij ≤ c2 | Dij = 0, Di = 1)

= P (pij ≤ c2 | Dij = 0, Di = 1)1 − P (pi ≤ c1 | Dij = 0, Di = 1)

P (pi ≤ c1 | Dij = 0, Di = 1).

Since the p-value pij corresponding to Di = 1 and Dij = 0 is uniformlydistributed on the interval (0,1), then P (pij ≤ c2 | Dij = 0, Di = 1) = c2.Hence,

P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1) − c2

≤ c21 − P (pi ≤ c1 | Dij = 0, Di = 1)

P (pi ≤ c1 | Dij = 0, Di = 1). (8)

Therefore, when P (pi ≤ c1 | Dij = 0, Di = 1) = 1, P (pij ≤ c2 | Dij =0, Di = 1, pi ≤ c1) = c2 holds.

For an infinite sample size, the event {pi ≤ c1 | Dij = 0, Di = 1} isdeterministic regardless of the value that c1 takes. For a finite sample size,P (pi ≤ c1 | Dij = 0, Di = 1) can be very close to, or equal to 1 for areasonable value of c1. For example, suppose there are three treatmentconditions with an equal sample size n under each of the three conditions.Suppose further that a gene has treatment means (0, 0, 3). Using the non-central F-distribution under the assumption of the normal distribution,P (pi ≤ 0.01 | Dij = 0, Di = 1) = 0.9846 when n = 6, and 0.9999 whenn = 10, and 1 when n = 30; P (pi ≤ 0.001 | Dij = 0, Di = 1) = 0.8563when n = 6, and 0.9991 when n = 10, and 1 when n = 30.

When c1 is extremely small, P (pi ≤ c1 | Dij = 0, Di = 1) can be muchsmaller than 1 for a finite sample size. Using equation (8) the followingmethod can be employed to provide an overestimate of P (pij ≤ c2 | Dij =0, Di = 1, pi ≤ c1).

P̂ (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1)

= c2 + c21 − P̂ (pi ≤ c1 | Dij = 0, Di = 1)

P̂ (pi ≤ c1 | Dij = 0, Di = 1). (9)



Let E be the set of genes which enter the second step of the two-step pro-cedure, but do not have all pairwise comparisons statistically significant,i.e.,

E = {gene g : pg ≤ c1,∃ at least one j such that pgj > c2}.

Since the true means are unknown they have to be estimated. For geneg ∈ E, let x̄gj denote the sample mean for gene g under treatment condi-tion j, and [j] denote the treatment which has the jth largest magnitude(absolute value) of the sample mean. For example, if the three treatmentmeans for gene g satisfy |x̄g3| < |x̄g1| < |x̄g2|, then [1] = 3, [2] = 1 and[3] = 2. For gene g ∈ E, define the pseudo means under the J treat-ment conditions as following: µg[1] = · · · = µg[J−1] = 0 and µg[J ] = µ̂g

where µ̂g = max{|x̄gi − x̄gj|, i, j = 1, · · · , J, i 6= j}. It becomes neces-sary to compute the probability that a gene with these pseudo means willhave a p-value for testing the equality of means below c1. Under the as-sumption of normality, the global F-test statistic for testing the equalityof the means has a non-central F-distribution with non-centrality param-

eter ncpg =

(j=J−1∑

j=1

n[j](0 − µ̂g/J)2 + n[J ](µ̂g − µ̂g/J)2

)/σ̂2

g , where nj is

the sample size under treatment j and σ̂2g is the estimate of the variance

for gene g. Then

P̂ (pg ≤ c1 | Dgj = 0, Dg = 1) = P (fJ−1,N−J,ncpg≥ F−1

J−1,N−J(1 − c1)),

where N =∑

nj, and fJ−1,N−J,ncpgis a random variable of non-central

F-distribution with degrees of freedom J −1 and N −J and non-centralityparameter ncpg, F−1

J−1,N−J(1 − c1) is the (1 − c1) × 100th percentile for aF-distribution with degrees of freedom J − 1 and N − J . Thus,

P̂ (pi ≤ c1 | Dij = 0, Di = 1) = average of P̂ (pg ≤ c1 | Dgj = 0, Dg = 1),(10)

where g ∈ E.

When the assumption of normality does not hold, a permutation methodis presented (in Section 6) to estimate this probability.

(5) The last component of equation (4), P (Dij = 0 | Di = 1, pi ≤ c1), canbe estimated using the proportion of non-significant pairwise comparisonsamong all pairwise comparisons associated with correctly rejected genes in



Step 1. However, it is impossible to separate the correctly rejected genesfrom the falsely rejected genes, hence an overestimate is pursued.

Define π̂02 as the estimate of the proportion of “true” null hypothesesgiven the distribution of the p-values in Step 2. We emphasize “true”here because π̂02 is computed based on the the distribution of p-values inStep 2 using the same methods as those used to estimate π01, and it isnot exactly P (Dij = 0|pi ≤ c1). Let K denote the number of genes inStep 2, then C ×K × π̂02 estimates the number of “true” null hypothesesbased on the p-values, and C × K × (1 − FDR1) is the estimated numberof pairwise comparisons generated by correctly rejected genes. Since thep-value (pij) corresponding to Di = 1 and Dij = 0 is approximately uni-formly distributed, and the estimate C ×K × π̂02 also includes some truenull hypotheses corresponding to Di = 0 and Dij = 0, the number of truenull hypotheses (Dij = 0) corresponding to Di = 1 is less than or equal toC × K × π̂02. Therefore,

P̂ (Dij = 0 | Di = 1, pi ≤ c1) =C × K × π̂02

C × K × (1 − FDR1)=

π̂02

(1 − FDR1).

Using equations (5) – (9), along with the estimates of the proportions oftrue null hypotheses (π̂01 and π̂02) in Step 1 and Step 2, the FDR (equation4) of the two-step multiple comparison procedure can be estimated by

F̂DR =P̂ (pij ≤ c2 | Di = 0, pi ≤ c1)

P̂ (pij ≤ c2 | pi ≤ c1)· F̂DR1

+P̂ (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1) · π̂02

P̂ (pij ≤ c2 | pi ≤ c1). (11)

3.3 Simulation study and results

A simulation study is employed to illustrate the accuracy of the proposedmethod for estimating the FDR of the two-step multiple comparison procedure.Assume there are 3 treatments, and m = 1000 genes. Allow a proportion (R1)of the genes to have a treatment effect. For any gene having a treatment effect,there are two cases: it is differentially expressed across all three treatments; orit is not differentially expressed between two treatments, but differentially ex-pressed under the third treatment. Among the genes which have a treatmenteffect, assume a proportion (R2) of them are not differentially expressed be-tween two treatments, but differentially expressed under the third treatment.



That is, R1 × m genes have treatment effects, and R1 × R2 × m genes havetreatment means (µa, µ0, µ0) or (µ0, µa, µ0) or (µ0, µ0, µa), where µ0 and µa aredifferent; and R1×(1−R2)×m genes have treatment means (µ1, µ2, µ3), whereµ1, µ2, and µ3 are different. In this simulation, half of the R1 × R2 × m genesare chosen to have mean (2,0,0) and the other half have mean (4,0,0); and theR1×(1−R2)×m genes have means (4,2,0). For the (1−R1)×m genes not hav-ing a treatment effect, the mean vector is (0,0,0). The values for R1 are 0.10,0.20, 0.30, 0.40, and 0.50, and the values for R2 are 0.0, 0.20, 0.40, 0.60, 0.80and 1. Large values of R1 are not used in this simulation because the propor-tion of significant genes in most microarray studies is relatively small. Assumefor each gene that there are n = 6 observations under each of the treatments.For each combination of R1 and R2, 1000 data sets (each with size of 1000genes × 6 replicates × 3 treatments) are generated from normal distributionswith standard deviation 1. For each simulated data, 1000 global F-test statis-tics corresponding to the m = 1000 genes are computed for testing equalityof the three treatment means across the 1000 genes. If a gene has a p-valuesmaller than or equal to a pre-specified level c1, then it is considered as havingsignificant treatment effect, and thus enters the second step. In the secondstep, for the genes with statistically significant treatment effects from Step 1,pairwise comparisons are performed using t-tests. Pairwise comparisons witha p-value less than or equal to a pre-specified level c2 are considered as statis-tically significant. Various values of c1 and c2 are used in the simulation. Foreach data simulation, π̂01 and π̂02, the estimates of the proportion of true nullhypotheses in Step 1 and Step 2, are computed using Storey and Tibshirani’ssmoother estimate (Storey and Tibshirani, 2003), and the FDR is estimatedusing equation (11). The average of the estimated FDR from 1000 simulationsfor (c1, c2) = (0.10, 0.05), (0.10, 0.01) and (0.05, 0.01) are presented in Table 1.The average of the true FDR from the 1000 simulations is also presented. Forthe estimated FDR presented in Table 1, P (pij ≤ c2 | Dij = 0, Di = 1, pi ≤ c1)is estimated using c2 instead of equation (9). It is clear that the estimatedFDR is very close to the true FDR when c1 is not too small which indicatesP (pi ≤ c1 | Dij = 0, Di = 1) is close to 1.

As seen in Table 1 the proposed method yields accurate estimates of theoverall FDR. As one would expect the overall FDR for any two-step proceduredepends on the configuration of R1 and R2. For our two-step approach withc1 = 0.10, c2 = 0.05 when R1 = 0.10 and R2 = 1.0 the FDR can be as big as0.39, yet when R1 = 0.50 and R2 = 0.0 the FDR can be as small as 0.046. Forthe same value of R1 and the same rejection regions [0, c1] in Step 1 and [0, c2]in Step 2, the FDR increases as R2 increases. On the other hand, for the samevalue of R2 and the same rejection regions [0, c1] in Step 1 and [0, c2] in Step



2, the FDR decreases as R1 (the proportion of genes having treatment effect)increases.

4 Estimating FDR for fixed FDR significance

levels

The two-step multiple comparison procedure can also be applied using fixedFDR significance levels in Step 1 and Step 2, respectively. For instance, anFDR controlling procedure at FDR significance level α1 (α1 is known andfixed) is applied to the p-values in Step 1, and statistically significant genes areidentified. Let A denote the collection of statistically significant genes. Defined1 be the smallest p-value in Step 1 which is not statistically significant, i.e.,d1 = min{pi, i ∈ Ac}, where Ac is the complement of A. In Step 2, pairwisecomparisons associated with the statistically significant genes (i.e., genes inset A) are tested using an FDR controlling procedure at FDR significance levelα2 (α2 is known and fixed) and statistically significant effects are identified.Let d2 be the smallest p-value for pairwise comparisons in Step 2 which are notstatistically significant. Since the goal is to compute the overall FDR, this canbe achieved by replacing c1 and c2 with the respective d1 and d2 when usingthe method for estimating the FDR for fixed rejection regions (11). That is,assuming d1 and d2 are known,

F̂DR(α1, α2) =P̂ (pij ≤ d2 | Di = 0, pi ≤ d1)

P̂ (pij ≤ d2 | pi ≤ d1)· F̂DR1

+P̂ (pij ≤ d2 | Dij = 0, Di = 1, pi ≤ d1) · π̂02

P̂ (pij ≤ d2 | pi ≤ d1). (12)

It is worth noting that for this approach, d1 is determined by the p-values inStep 1, α1, and the FDR controlling procedures applied in Step 1; and d2 isdetermined by the p-values in both steps, α1, α2, and the FDR controllingprocedures applied in Step 1 and Step 2, respectively.

5 Controlling the FDR at a desired signifi-

cance level

Instead of estimating the FDR for a fixed rejection region, traditional multi-ple comparison procedures (Hochberg and Tamhane, 1987; Hsu, 1996) reject



Table 1: Simulation results. Estimated FDR (F̂DR) and true FDR of pairwisecomparisons for 3 treatments and 1000 genes as applied to the two-step mul-tiple comparison procedure using fixed rejection regions c1 and c2 in Steps 1and 2, respectively. R1: the proportion of genes having a treatment effect; R2:the proportion of genes with a treatment effect having one treatment meandifferent and the other two the same.

R1 R2 = 0.0 0.20 0.40 0.60 0.80 1.0

c1 = 0.10 F̂DR 0.10 0.299 0.315 0.331 0.350 0.367 0.392

c2 = 0.05 0.20 0.160 0.171 0.183 0.196 0.212 0.230

0.30 0.100 0.109 0.118 0.129 0.141 0.156

0.40 0.067 0.074 0.082 0.090 0.101 0.113

0.50 0.046 0.052 0.058 0.066 0.075 0.085

True 0.10 0.296 0.311 0.328 0.346 0.369 0.391

FDR 0.20 0.157 0.168 0.182 0.197 0.213 0.230

0.30 0.098 0.107 0.118 0.129 0.142 0.158

0.40 0.065 0.073 0.082 0.091 0.102 0.115

0.50 0.044 0.051 0.058 0.066 0.076 0.087

c1 = 0.10 F̂DR 0.10 0.104 0.111 0.118 0.126 0.134 0.144

c2 = 0.01 0.20 0.049 0.053 0.057 0.061 0.066 0.072

0.30 0.029 0.032 0.035 0.038 0.041 0.046

0.40 0.019 0.021 0.023 0.026 0.028 0.032

0.50 0.013 0.015 0.016 0.018 0.020 0.023

True 0.10 0.102 0.107 0.116 0.122 0.130 0.142

FDR 0.20 0.048 0.051 0.056 0.060 0.066 0.071

0.30 0.028 0.031 0.034 0.037 0.041 0.045

0.40 0.018 0.021 0.023 0.025 0.028 0.032

0.50 0.012 0.014 0.016 0.018 0.020 0.023

c1 = 0.05 F̂DR 0.10 0.104 0.110 0.117 0.124 0.133 0.143

c2 = 0.01 0.20 0.049 0.052 0.056 0.061 0.066 0.072

0.30 0.029 0.032 0.034 0.038 0.041 0.045

0.40 0.019 0.021 0.023 0.025 0.028 0.031

0.50 0.013 0.014 0.016 0.018 0.020 0.023

True 0.10 0.101 0.107 0.114 0.122 0.130 0.142

FDR 0.20 0.048 0.052 0.055 0.060 0.065 0.071

0.30 0.029 0.031 0.034 0.037 0.041 0.045

0.40 0.018 0.020 0.023 0.025 0.028 0.031

0.50 0.012 0.014 0.016 0.018 0.021 0.023



the null hypotheses at a pre-chosen significance level. If the desired FDRsignificance level of the two-step multiple comparison is α, then the problembecomes choosing the FDR significance levels α1 and α2 in Step 1 and Step 2,respectively, so that the overall FDR is controlled by α.

5.1 An approximate upper bound for FDR

Although the resampling procedure that is required for estimating the FDR(equation 11) may appear to be a disadvantage, when the experimental designis complicated, it may in fact be difficult to generate data under the nullhypothesis. Fortunately, an upper bound of

P (pij≤c2|Di=0,pi≤c1)

P (pij≤c2|pi≤c1)is possible, thus

estimating P (pij ≤ c2 | Di = 0, pi ≤ c1) via simulation can be avoided.

Theorem 2. In the two-step multiple comparison procedure,

P (pij ≤ c2 | Di = 0, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)≤ 1.

Proof.

P (pij ≤ c2 | Di = 0, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)

=P (pij ≤ c2 | Dij = 0, Di = 0, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)

=

P (pij≤c2,Dij=0,Di=0|pi≤c1)

P (Dij=0,Di=0|pi≤c1)

P (pij ≤ c2 | pi ≤ c1)

=

P (pij≤c2,Dij=0,Di=0|pi≤c1)

P (pij≤c2|pi≤c1)

P (Dij = 0, Di = 0 | pi ≤ c1)

=P (Dij = 0, Di = 0 | pi ≤ c1, pij ≤ c2)

P (Dij = 0, Di = 0 | pi ≤ c1)

≤ 1. (13)

When c2 ≤ 1 this equality (equation 13) holds for two specific reasons. First,the probability of a false rejection in Step 1 (reject the null hypothesis H 0

i whenit is true) only depends on the p-values pi and c1. Second, with a constraintin Step 2 (pij ≤ c2 and c2 < 1), the chance of making a false rejection (rejectthe null hypothesis H0

ij when it is true) will be smaller than when comparedto the procedure for which no constraint is applied.



When P (pi ≤ c1 | Di = 1, Dij = 0 for some j) = 1, the FDR (equation 4)is

FDR = FDR1 ·P (pij ≤ c2 | Di = 0, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)

+(1 − FDR1)c2P (Dij = 0 | Di = 1, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1).

Define π02, FDR2 and pFDR2 to be the proportion of true null hypotheses,the FDR and the pFDR in Step 2 based on the empirical distribution of thep-values. Then

FDR2 = pFDR2 =c2 · π02

P (pij ≤ c2 | pi ≤ c1).

Notice that

c2 · P (Dij = 0 | Di = 1, pi ≤ c1)

P (pij ≤ c2 | pi ≤ c1)≤

c2 · π02/(1 − FDR1)

P (pij ≤ c2 | pi ≤ c1)=

FDR2

1 − FDR1

,

thus an upper bound for the overall FDR (equation 4) is,

FDR ≤ FDR1 + FDR2. (14)

Therefore, the overall FDR can be controlled below level α as long as the FDRsignificance levels α1 and α2 used in the respective Step 1 and Step 2 satisfyα1 + α2 ≤ α.

However, when P (pi ≤ c1 | Di = 1, Dij = 0 for some j) is far less than 1,the realized FDR may exceed FDR1 + FDR2. One strategy is to put moreweight of the overall FDR on FDR1 so that P (pi ≤ c1 | Di = 1, Dij =0 for some j) is closer to 1, and at the same time more genes can be includedin the analysis in Step 2. Next, we investigate the performance of the two-stepprocedure with fixed FDR significance levels in Step 1 and Step 2, and proposea method to choose FDR significance levels in the two steps so that the overallFDR can be controlled below a pre-chosen overall FDR significance level.

5.2 Fixing the FDR significance levels

A simulation study is employed to illustrate the improved power of the two-stepmultiple comparison procedure over the one-step procedure. The simulationscenario is the same as Section 3.3. There are 3 treatment conditions, a sam-ple size of n = 6 within each treatment condition, and m = 1000 genes. For



each combination of R1 and R2, 1000 data sets are generated from standardnormal distributions, and there are 3nm data points within each data set. TheFDR controlling procedure is then applied to the corresponding 1000 genes ata FDR significance level α1. In the second step, for the genes with significanttreatment effects from Step 1, pairwise comparisons are performed with theFDR controlling procedure at FDR significance level α2. The respective FDRsignificance levels used in the first and second step are (α1, α2) = (0.04, 0.01),and (0.03, 0.02), and the estimated FDR, the true FDR and average powerare listed in Tables 2 and 3. Here, the average power is defined to be the ex-pected proportion of correct rejections among the true alternative hypotheses.For the purpose of comparing the results with the one-step FDR controllingprocedure, the estimated FDR, the true FDR, and the average power for theone-step procedure are also listed in Table 2. For the one-step procedure,an FDR controlling procedure is applied to the family of 3m pairwise com-parisons. Specifically, Benjamini and Hochberg’s adaptive FDR controllingprocedure (Benjamini and Hochberg, 2000) with the incorporation of the esti-mate of the proportion of null hypotheses by Storey and Tibshirani’s smootherestimate (Storey and Tibshirani, 2003) is employed. When the proportion ofgenes having a treatment effect (R1) is small, the two-step multiple comparisonprocedure is more powerful than the one-step multiple comparison procedurebecause of the reduced number of tests in Step 2. For example, in this simu-lation, when R1 = 0.2 and R2 = 0.2, the one-step procedure has 80% power,while the two-step procedure has approximate power 96%. As observed fromthe simulations when R2, the proportion of significant genes for which onetreatment effect is different but the other two are the same, increases, thepower of the two-step procedure decreases. This is due to the fact that whenR2 increases, fewer genes are included in Step 2.

From this simulation, the power for α1 = 0.04, α2 = 0.01 is slightly biggerthan that for α1 = 0.03, α2 = 0.02 when R1 is small. Furthermore, whenα1 = 0.04 more genes are included in the Step 2. Simulations have beenperformed for different values of FDR level α that vary from 0.01, 0.02, · · · ,0.2. The FDR controlling procedure with the incorporation of the estimate oftrue null hypotheses is applied in both steps of the two-step procedure, andStep 1 and Step 2 FDR levels are set to α1 = 4/5α and α2 = 1/5α. Thesesimulations (Figure 1) demonstrate that the overall FDR is controlled at FDRlevel α for all values of α. Based on this work and experience our ad hocsuggestion is to use α1 = 4/5α and α2 = 1/5α if the overall FDR is requiredto be controlled at FDR level α.



Table 2: Simulation results. Estimated FDR (F̂DR), true FDR, and averagepower for pairwise comparisons for 3 treatment conditions and 1000 genesusing both the two-step and one-step procedure, respectively. For the two-step procedure, the FDR significance levels α1 = 0.04 and α2 = 0.01 areused in Step 1 and Step 2, respectively. For the one-step procedure, the FDRsignificance level is 0.05.

R1 R2 = 0.0 0.20 0.40 0.60 0.80 1.0

Two- F̂DR 0.10 0.053 0.053 0.052 0.052 0.053 0.057Step 0.20 0.053 0.048 0.046 0.046 0.047 0.049

0.30 0.053 0.046 0.044 0.043 0.043 0.0440.40 0.053 0.044 0.042 0.041 0.041 0.0410.50 0.053 0.042 0.040 0.039 0.038 0.039

True 0.10 0.037 0.045 0.046 0.047 0.048 0.051FDR 0.20 0.038 0.042 0.043 0.043 0.045 0.046

0.30 0.037 0.041 0.042 0.042 0.042 0.0430.40 0.037 0.040 0.040 0.040 0.040 0.0400.50 0.037 0.038 0.038 0.037 0.037 0.037

Power 0.10 0.993 0.949 0.898 0.849 0.800 0.7610.20 0.997 0.960 0.919 0.881 0.851 0.8250.30 0.999 0.965 0.929 0.899 0.877 0.8610.40 0.999 0.968 0.936 0.910 0.892 0.8810.50 0.999 0.970 0.941 0.917 0.901 0.890

One- True 0.10 0.050 0.051 0.050 0.050 0.051 0.056Step FDR 0.20 0.049 0.049 0.051 0.051 0.050 0.050

0.30 0.050 0.050 0.052 0.051 0.050 0.0500.40 0.050 0.050 0.050 0.050 0.052 0.0480.50 0.050 0.051 0.051 0.051 0.050 0.051

Power 0.10 0.695 0.700 0.715 0.712 0.726 0.7480.20 0.800 0.798 0.797 0.800 0.805 0.8160.30 0.861 0.858 0.853 0.850 0.851 0.8570.40 0.903 0.894 0.891 0.889 0.885 0.8860.50 0.931 0.926 0.921 0.914 0.910 0.909



Table 3: Simulation results. Estimated FDR (F̂DR), true FDR, and averagepower for pairwise comparisons for 3 treatment conditions and 1000 genesusing the two-step procedure at the FDR significance levels α1 = 0.03 andα2 = 0.02 in Step 1 and Step 2, respectively.

R1 R2 = 0.0 0.20 0.40 0.60 0.80 1.0

F̂DR 0.10 0.042 0.056 0.053 0.054 0.055 0.0590.20 0.042 0.051 0.049 0.049 0.051 0.0520.30 0.042 0.049 0.047 0.048 0.048 0.0500.40 0.042 0.048 0.046 0.046 0.047 0.0480.50 0.042 0.046 0.045 0.045 0.045 0.046

True 0.10 0.030 0.048 0.051 0.052 0.055 0.057FDR 0.20 0.030 0.047 0.049 0.050 0.052 0.054

0.30 0.030 0.046 0.047 0.049 0.049 0.0510.40 0.030 0.045 0.046 0.047 0.048 0.0490.50 0.030 0.043 0.044 0.045 0.046 0.046

Power 0.10 0.990 0.953 0.904 0.854 0.796 0.7360.20 0.997 0.968 0.931 0.890 0.850 0.8090.30 0.999 0.975 0.944 0.912 0.882 0.8540.40 1.000 0.980 0.954 0.928 0.905 0.8840.50 1.000 0.983 0.961 0.940 0.922 0.906



0.05 0.10 0.15 0.20

0.0

50.1

00.1

50.2

0

α

Tru

e F

DR

R1=0.2, R2=0.2R1=0.2, R2=0.8R1=0.4, R2=0.2R1=0.4, R2=0.8

Figure 1: Simulation results of the FDR for the two-step multiple comparisonprocedure using α1 = 4

5α and α2 = 1

5α for different levels of α. In total there

are m = 1000 genes, 3 treatment conditions, and four different combinationsof R1 and R2: R1 = 0.2, R2 = 0.2 (short dashed line), R1 = 0.2, R2 = 0.8(dotted line), R1 = 0.4, R2 = 0.2 (dotted-dashed line) and R1 = 0.4, R2 = 0.8(long dashed line). The black straight line represents the pre-chosen FDRlevel. Here R1 is the proportion of genes having a treatment effect; R2 isthe proportion of genes with a treatment effect having one treatment meandifferent but the other two the same.



5.3 Choosing the FDR significance levels

Here we propose an adaptive approach for choosing α1 and α2, and suggestsome guidelines and direction for selecting α1 and α2. First, α1 should bebigger than α2. When a looser criterion is used in Step 1, more genes areavailable to enter the second step. Second, α1 and α2 should be chosen suchthat the overall FDR is close to but below the pre-specified significance level.Hence, the power for detecting a significant effect will be maximized. Third,the choice of α1 and α2 should lead to the largest number of rejections occurringin Step 2.

With these guidelines in mind, we propose the following directive for findingthe significance levels α1 and α2. Let S be a set of values of (i ∗ α)/n wherei = 1, · · · , n−1 and n is a positive integer. That is, S = {α/n, 2α/n, · · · , (n−

1)α/n}. Let F̂DR(α1, α2)) be the estimated overall FDR and R(α1, α2) thenumber of rejections (or statistically significant pairwise comparisons) in Step2 when a two-step procedure with respective significance levels α1 and α2 inStep 1 and 2 is applied. Then α∗

1 and α∗2 are chosen such that

(α∗1, α

∗2) = argα1,α2

{ maxα1,α2∈S,α1>α2,α1+α2≤α,F̂DR(α1,α2)≤α

R(α1, α2)}. (15)

Using the same simulation as in Section 3.3, for each of the 1000 datasets, we apply our guidelines to find α∗

1 and α∗2. Suppose the overall FDR

significance level α = 0.05 and S = {α/5, 2α/5, 3α/5, 4α/5}, then α∗1 and

α∗2 can be chosen from (α1, α2) = (0.02, 0.01), (0.03, 0.01), (0.03, 0.02), and

(0.04, 0.01). Table 4 gives the frequency distribution of α∗1 and α∗

2 based onthese 1000 simulations. As can be seen, when R1 = 0.20, R2 = 0.60, thechoice of (α∗

1, α∗2) is (0.03, 0.01) for 12 simulated data sets, (0.03, 0.02) for 877

simulated data sets, and (0.04, 0.01) for 111 simulated data sets. The chosensignificance levels in the two step method are more diverse when R1 is small,and then they converge to (α∗

1, α∗2) = (0.03, 0.02) as R1 gets larger. Evidently,

the case where R2 = 0.0 (genes which have a treatment effect where all meansare different from each other) yields random results. This is most likely dueto the fact that almost all pairwise comparisons in Step 2 are significant.Given the choices of α∗

1 and α∗2 (Table 4), the average FDR is controlled below

α = 0.05 (Table 5), and the two-step procedure has more power than theone-step procedure (Table 2). For these results, α∗

1 and α∗2 take values from

S = {α/5, 2α/5, 3α/5, 4α/5}. However, for more accurate results, we suggestS = {α/20, 2α/20, · · · , 19α/20}.



Table 4: Frequency distribution of α∗1 and α∗

2 from 1000 simulations for pairwisecomparisons for 3 treatment conditions and 1000 genes. Here α∗

1 and α∗2 are

determined using the stated guidelines, and by controlling the overall FDR forthe two-step procedure below α = 0.05.

α∗1 = 0.01 0.03 0.03 0.04

R1 R2 α∗2 = 0.01 0.01 0.02 0.01

0.10 0.0 105 351 178 3660.20 53 345 152 4500.40 18 214 280 4880.60 17 185 224 5740.80 16 251 120 6131.0 125 458 10 407

0.20 0.0 37 221 256 4860.20 1 83 505 4110.40 0 23 792 1850.60 0 10 761 2290.80 0 5 527 4681.0 0 13 90 897

0.30 0.0 9 129 309 5530.20 0 20 843 1370.40 0 5 970 250.60 0 4 957 390.80 0 1 856 1431.0 0 0 503 497

0.40 0.0 3 76 299 6220.20 0 4 963 330.40 0 1 995 40.60 0 0 999 10.80 0 0 986 141.0 0 0 934 66

0.50 0.0 1 48 294 6570.20 0 2 995 30.40 0 0 1000 00.60 0 0 1000 00.80 0 0 1000 01.0 0 0 1000 0



Table 5: Simulation results. Estimated FDR (F̂DR), true FDR, and powerfor pairwise comparisons for 3 treatment conditions and 1000 genes using thetwo-step procedure. The FDR for the entire procedure is controlled below 0.05with significance levels α∗

1 and α∗2 chosen automatically (results are listed in

Table 4).

R1 R2 = 0.0 0.20 0.40 0.60 0.80 1.0

F̂DR 0.10 0.045 0.050 0.050 0.050 0.050 0.0510.20 0.048 0.050 0.051 0.051 0.049 0.0490.30 0.048 0.050 0.050 0.050 0.049 0.0480.40 0.049 0.049 0.050 0.048 0.048 0.0470.50 0.049 0.049 0.047 0.047 0.047 0.046

True 0.10 0.036 0.047 0.049 0.050 0.050 0.052FDR 0.20 0.036 0.045 0.048 0.050 0.051 0.050

0.30 0.036 0.046 0.046 0.047 0.050 0.0490.40 0.037 0.045 0.046 0.046 0.048 0.0480.50 0.037 0.043 0.044 0.045 0.046 0.047

Power 0.10 0.992 0.950 0.903 0.851 0.799 0.7550.20 0.997 0.967 0.930 0.890 0.853 0.8240.30 0.999 0.975 0.944 0.912 0.882 0.8610.40 0.999 0.980 0.954 0.928 0.904 0.8850.50 0.999 0.983 0.961 0.939 0.921 0.908



6 A case study

We present an example where the experimental design is unbalanced (unequalsample sizes) and the normality assumption on the data is not valid. Thesedata are employed to illustrate the estimation the FDR via resampling (per-mutation) techniques. In Hedenfalk et al. (2001), cDNA spotted microarrayswere used to assess 6512 complementary DNA (cDNA) clones correspondingto 5361 genes for the purpose of studying the hereditary of two types of breastcancer: the BRCA1 mutation and the BRCA2 mutation. There were n1 = 7RNA tumor samples from seven patients with the BRCA1 mutation, n2 = 8RNA samples from seven patients with with BRCA2 mutation, and n3 = 7RNA samples from seven patients with sporadic tumors. After data filtering,from the remaining m = 3226 genes, 51 genes were identified by Hedenfalk etal., to best differentiate the three types of breast cancers using F-tests fromANOVA and permutation techniques at significance level α = 0.001.

Both the proposed two-step multiple comparison procedure and the one-step procedure are applied to identify genes that are differentially expressedfor all three pairwise comparisons for the three types of breast cancer. Thesame 3226 genes as from the original analysis (Hedenfalk et al., 2001) wereanalyzed using FDR significance level α = 0.05. For each gene, the null hy-pothesis of equality across the three mean expression levels is tested using aglobal F-test statistic, and the null hypothesis of equal expression between anytwo tumor types is tested using a two-sample t test statistic. Let Fi, ti1, ti2,and ti3 represent the test statistics for the global F-test statistic, pairwise com-parisons between BRCA1 mutation and BRCA2 mutation, BRCA1 mutationand sporadic cancer, and BRCA2 mutation and sporadic cancer, respectively,for gene i. Since the gene expression levels do not follow a normal distribution,permutation resampling is used to generate the null distribution for the F andt test statistics. For each permutation b = 1, · · · , B = 1000, we randomlyassign the 22 arrays to BRCA1 (7 arrays), BRCA2 (8 arrays) and sporadictumor (7 arrays), and compute the test statistics F b

i , tbi1, tbi2, and tbi3. Pooling

simulated null statistics across genes, the p-value for gene i from the globalF-test is:

pi =

B∑b=1

#{l : F bl ≥ Fi, l = 1, · · · ,m}

m · B,

and the p-value for the jth (j = 1, 2, 3) pairwise comparison of gene i is

pij =

B∑b=1

#{l : tblj ≥ tij, l = 1, · · · ,m}

m · B.



The optimal significance levels α∗1 and α∗

2 in Step 1 and Step 2 for the pro-posed two-step procedure are found in the following way. For given significancelevels α1 and α2,

(1) Apply the adaptive FDR controlling controlling procedure to pi (i =1, · · · ,m) at FDR significance level α1. Let A be the collection of genesfor which the tests are statistically significant, and K be the number ofgenes in set A. Let d1 be the smallest p-value which is not statisticallysignificant, and h1 be the corresponding largest F test statistic, i.e.,

h1 = max{Fi, i ∈ Ac},

where Ac is the complement of A. The FDR in Step 1 can be estimatedby

F̂DR1 =d1π̂01

K/m,

where π̂01 is the proportion of true null hypotheses tests in Step 1 and isestimated using Storey and Tibshirani’s smoother method.

(2) For the K statistically significant genes, apply the FDR controlling con-trolling procedure to the 3K p-values pij (i ∈ A and j = 1, 2, 3) at FDRsignificance level α2. Let R(α1, α2) denote the number of statisticallysignificant pairwise comparisons (rejections) in this step. Let d2 be thesmallest p-value which is not statistically significant. Note that the threepairwise comparisons might have different distributions due to differentsample sizes, therefore they are treated separately. Let h2j be the largestnon-statistically significant t statistic for the jth pairwise comparison, i.e.,

h2j = max{tij, i ∈ A, pij is not statistically significant}.

Let π̂02 be the estimated proportion of true null hypotheses tests among the3K pairwise comparisons using Storey and Tibshirani’s smoother method.Compute

P̂ (pij ≤ d2|pi ≤ d1) =R(α1, α2)

3K.

(3) Count the number of the null test statistics gained via permutation thatare claimed as statistically significant if the same cut-off points are used

as in step (1) and (2). The quantityB∑

b=1

m∑i=1

1{F bi >h1} is the number of

false rejections of equality of the means across the three types of tumors,



andB∑

b=1

m∑i=1

1{tbij>h2j}· 1{F b

i >h1} is the number of false rejections for the jth

pairwise comparisons between two tumor types. Therefore, P (pij ≤ d2 |Di = 0, pi ≤ d1) can be estimated by

P̂ (pij ≤ d2 | Di = 0, pi ≤ d1) =

B∑b=1

m∑i=1

3∑j=1

1{tbij>h2j}· 1{F b

i >h1}

3B∑

b=1

m∑i=1

1{F bi >h1}

.

(4) P (pij ≤ d2 | Dij = 0, Di = 1, pi ≤ d1) is estimated by

d2 + d21 − P̂ (pi ≤ d1 | Dij = 0, Di = 1)

P̂ (pi ≤ d1 | Dij = 0, Di = 1),

where P̂ (pi ≤ d1 | Dij = 0, Di = 1) is computed using a permutationmethod as the data set may not be normally distributed. Let E be theset of genes which enter the second step of the two-step procedure but donot have all pairwise comparisons statistically significant, i.e.

E = {gene g : pg ≤ d1,∃ at least one j such that pgj > d2}.

For gene g ∈ E, let x̄gj denote the sample mean for gene g under treatmentcondition j, and [j] denote the treatment which has the jth largest magni-tude (absolute value) of sample mean. For gene g ∈ E, compute the mean-centered residuals, rgjk = xgjk− x̄gj, where k = 1, · · · , nj. For each permu-tation b = 1, 2, · · · , B = 1000, generate a permuted data set rb

gjk using theresiduals rgjk as the source data and by randomly assigning the 22 arraysto the three types of breast cancers. Then update rb

gjk for treatment [3]which has the largest magnitude of average gene expression by adding µ̂g,i.e. rb

g[3]k = rbg[3]k + µ̂g, where µ̂g = max{|x̄g1− x̄g2|, |x̄g1 − x̄g3|, |x̄g2 − x̄g3|}.

Let F r,bg denote the F-test statistic for testing the null hypothesis of equal-

ity across the three treatment conditions using the the data set rbgjk. Then

P (pg ≤ d1 | Dgj = 0, Dg = 1) is estimated by the percentage of F-teststatistics F r,b

g bigger than h1,

P̂ (pg ≤ d1 | Dgj = 0, Dg = 1) =

b=B∑b=1

#{g : F r,bg ≥ h1}

#{g : g ∈ E} · B.



Table 6: Number of significant comparisons for Hedenfalk et al. (2001) breastcancer data using the proposed two-step procedure at FDR significance levelsα1 = 0.0425 and α2 = 0.005, and one-step adaptive FDR controlling procedurewith incorporation of the proportion of true null hypotheses by the smoothermethod (Storey and Tibshirani, 2003) at FDR significance level 0.05.

Two-step One-step IntersectionBRCA1 vs BRCA2 61 45 43BRCA1 vs Sporadic 34 9 9BRCA2 vs Sporadic 43 16 16

(5) Estimate the overall FDR(α1, α2) (equation 12) using the estimates fromstep (1) to (4).

For each combination of α1 and α2 where α1 ∈ S, α2 ∈ S, α1 > α2, S isthe set of values { 1

20α, 1

20α, · · · , 19

20α} and α is the pre-specified FDR signifi-

cance level, perform Step 1 through Step 4 and record the number of rejectionsR(α1, α2) and the estimated FDR(α1, α2). The optimal choice of (α∗

1, α∗2) has

the largest number of rejections R(α1, α2) and the estimated FDR(α1, α2) lessthan α. For these data when α = 0.05, the best choice of significance lev-els in the respective Step 1 and Step 2 for the proposed two-step procedure isα∗

1 = 0.0425, α∗2 = 0.005, and there are 138 statistically significant comparisons

(61 between between BRCA1 and BRCA2, 34 between BRCA1 and sporadic,and 43 between BRCA2 and Sporadic) out of 3226 × 3 = 9678 comparisons.The estimated overall FDR is 0.0486. As expected the one-step procedure iden-tified a subset of statistically significant comparisons that were also identifiedusing the two-step procedure. Specifically, 70 significant comparisons (45 be-tween between BRCA1 and BRCA2, 9 between BRCA1 and Sporadic, and 16between BRCA2 and sporadic) are identified. When compared to the one-stepmultiple comparison procedure, the two-step multiple comparison procedureidentifies a greater number of significant pairwise comparisons (138 versus 70).These results reflect the increased power of the two-step procedure over theone-step procedure when detecting differentially expressed genes.

7 Simulation studies

So far we have studied the two-step procedure for three treatment conditionswith independent examples. Here we present numerical results when examin-ing dependency among genes, and four treatment conditions.



7.1 Dependency among genes example

Many assumptions about the dependency among the genes have been made inmicroarray gene expression studies. However, few if any of these assumptionshave been verified. Storey (2003a) hypothesized the form of dependency to be“clumpy dependence”. That is, genes are more likely to be dependent in smallgroups (such as pathways), and that each group is independent of the others.Here, we generate 10 groups of genes with 100 genes in each group. Genesfrom different groups are independent of each other, and within each groupthe gene expression is generated from a multivariate normal with mean vector0, standard deviation 1, correlation coefficient 0.5, and sample size 6 for eachof the three treatment conditions. We randomly select 10% of the genes ashaving a treatment effect. From this 10% of genes, R2/2 have mean (0,0,2),R2/2 have mean (0, 0, 4), and (1−R2) have mean (0, 2, 4). For this simulationstudy, R2 is chosen to be 0.2 or 0.8. Permutation resampling (permutation sizeis 1000) is used to compute the p-values for the global test of equality amongthe three treatment means, and the p-values for the pairwise comparisons.Both the one-step procedure at an FDR significance level α = 0.05 and thetwo-step procedure with respective FDR significance levels α1 = 0.04 andα2 = 0.01 are applied to the 100 simulated data sets, and the estimated FDR,the true FDR, and the average power averaging over 100 simulations are listedin Table 7. For completion and comparison, we also compute the p-valuesfrom the global F-tests and t-tests, and estimate the FDR based on resamplingfrom independent standard normal distributions as described in Section 3.2.Overall, the two-step procedure is more powerful than the one-step procedure.For the two-step procedure, the estimated and realized FDR are far below 0.05when the permutation method is used, and only a bit bigger than 0.05 whenresampling from the standard normal distribution is employed. Interestingly,for the one-step procedure the opposite conclusions are drawn, and it seemsthat permutation methods should be used in the two-step procedure whenthere is a dependence among the genes.

7.2 Four treatment conditions example

The two-step multiple comparison procedure has been fully investigated for 3treatment conditions. While extensions to more treatments are conceptuallystraightforward, when more than three treatments are of interest, one has to bevery cautious when considering the configuration of the means as they becomecomplex. For example, if the means are not all equal for three treatments,there are two cases: either all three means are different, or two of them are



Table 7: Simulation results. Estimated FDR (F̂DR), true FDR, and powerfor pairwise comparisons for dependent example of 3 treatment conditions and1000 genes using both the two-step and one-step procedure, respectively. Forthe two-step procedure, the FDR significance levels α1 = 0.04 and α2 = 0.01are used in in Step 1 and Step 2, respectively. Two-StepP represents thetwo-step procedure where permutation is used to compute the p-values andestimate the FDR. Two-StepN represents the two-step procedure where thestandard F-test and t-test is used to compute the p-values and sampling fromindependent standard normal distribution is used to estimate the FDR. One-StepP represents the one-step procedure where permutation is used to computethe p-values. One-StepN represents the one-step procedure where the standardF-test and t-test is used to compute the p-values.

R1 = 0.1, R2 = 0.2 R1 = 0.1, R2 = 0.8

F̂DR True FDR Power F̂DR True FDR PowerTwo-StepP 0.033 0.033 0.946 0.034 0.039 0.803Two-StepN 0.053 0.048 0.949 0.052 0.059 0.810One-StepP 0.050 0.066 0.787 0.050 0.063 0.792One-StepN 0.050 0.055 0.697 0.050 0.050 0.731

the same and the third is different. With four treatments, if the means arenot all equal, there are four possible reasons for this. If we let µ1, µ2, µ3, µ4

represent the treatment means, then all four means are may be different fromeach other, or two means are the same and the other two are different (e.g.,µ1 = µ2 6= µ3 6= µ4), or exactly two pairs of means are the same (e.g., µ1 =µ2 6= µ3 = µ4), or three means are the same and the fourth one is different(e.g., µ1 = µ2 = µ3 6= µ4). Simulation studies are used to compare the two-stepprocedure and one-step procedure when there are four treatment conditions,1000 genes, and a sample size of 6. For all cases, we assume 90% of the genesdo not have an effect (i.e., µ1 = µ2 = µ3 = µ4). The relationships amongthe means for the remaining genes are as follows. Case 1: 5% of 1000 geneshave a treatment mean vector (0, 0, 0, 2), and another 5% of the 1000 geneshave a treatment mean vector (0, 0, 0, 4); Case 2: 5% of the 1000 genes havea treatment mean vector (0, 0, 2, 2), and another 5% of the 1000 genes havea treatment mean vector (0, 0, 4, 4); Case 3: 10% of the 1000 genes have atreatment mean vector (0, 0, 2, 4); and Case 4: 10% of the 1000 genes have atreatment mean vector (0, 2, 4, 6). The data are generated from independentnormal distribution with standard deviation 1. Table 8 lists the results. Thetwo-step procedure has the lowest power for Case 1 where all the genes with an



Table 8: Simulation results. Estimated FDR (F̂DR), true FDR, and averagepower for pairwise comparisons for 4 treatment conditions and 1000 genesusing both the two-step and one-step procedures, respectively. For the two-step procedure, the FDR significance levels α1 = 0.04 and α2 = 0.01 are usedin in Step 1 and Step 2, respectively. For the one-step procedure, the FDRsignificance level is 0.05.

Case 1 Case 2 Case 3 Case 4

Two-Step F̂DR 0.048 0.042 0.044 0.054True FDR 0.044 0.040 0.037 0.037Power 0.768 0.851 0.939 0.999

One-Step True FDR 0.050 0.050 0.050 0.049Power 0.744 0.771 0.742 0.808

effect have three out of four treatment means the same. One possible reasonis that the F-test in Step 1 (for the work presented here, the F-test for theANOVA model is employed in Step 1) for testing H0 : µ1 = µ2 = µ3 = µ4 maynot be powerful. As a consequence, the number of genes entering Step 2 willbe smaller. The two-step procedure has the the largest power power for Case4 where all the genes with an effect have all four treatment means different.

8 Conclusion and discussion

A general two-step multiple comparison procedure for identifying statisticalsignificance for repetitive testing with multiple treatment conditions has beenproposed in the context of testing for differential expression of genes. Thisprocedure is not limited to microarray applications as it can be applied to anylarge data set with multiple groups of treatment conditions (greater than two)and a large number of objects, where there is interest in identifying objectsthat behave differently among the pairwise comparisons of treatments. Thistwo-step procedure has more power than the one-step procedure in terms ofdetecting significant effects because the number of pairwise comparisons isgreatly reduced in the second step.

The flexibility of the two-step procedure can be seen in three ways. (1) Onecan fix the rejection regions in Step 1 and Step 2. For example in the contextof microarrays, if the p-value for testing the equality among the treatmentmeans is smaller than a fixed number c1, the corresponding gene will enter thesecond step. The pairwise comparisons with a p-value less than a fixed number



c2 will then be considered as statistically significant. (2) One may apply anFDR controlling procedure with the respective FDR significance level α1 andα2 (both are fixed and known) in each of the two steps. (3) One can select theFDR significance levels α∗

1 and α∗2 in the two steps such that the overall FDR

is controlled at a pre-chosen significance level α. For situations (1) and (2), wehave proposed estimates of the overall FDR for the two-step procedure. Oursimulation studies demonstrate the estimate of FDR as very accurate. Whenthe normality assumption is valid or the sample size is large, resampling fromthe standard normal distribution or multivariate normal distribution is used toestimate one component of the FDR (Section 3.2). In Section 6, we presented apermutation method to estimate FDR when the normality assumption may notbe valid. We also addressed the situation where the experimental design is notbalanced and thereafter the two-sample t test statistics may have a differentdistribution depending the sample sizes. For situation (3), both criteria and asearch procedure were proposed for α∗

1 and α∗2 such that the realized FDR can

be controlled below a pre-specified FDR significance level α and the numberrejections is maximized. From the simulation the best choice of α∗

1 and α∗2

depends only on the proportion of differentially expressed genes (R1 and R2

in the case of three treatment conditions).It should be noted that the particular FDR controlling procedure applied

in the two steps is very important. In this work, simulations for the two-stepprocedure were based on an adaptive FDR controlling procedure (Benjaminiand Hochberg’s FDR controlling procedure with incorporation of the estimateof the proportion of true null hypotheses, Storey and Tibshirani (2003)) inboth steps, because it controls the FDR at the pre-chosen significance levelfor independent test statistics. Other FDR controlling procedures can be usedas long as the realized FDR is below, but very close to, the pre-chosen FDRlevel for independent cases. Otherwise, the overall FDR will be far below thepre-chosen FDR level, hence the power may actually decrease.

While extensions of the two-step multiple comparison procedure to moretreatments are conceptually straightforward, when more than three treatmentsare of interest, one has to be very cautious when thinking about the configu-ration and complexity of the means. For the situations where the genes withan effect have one mean different and all the other means the same the globalF-test in Step 1 for testing equality of all means may not be powerful. As aconsequence, the number of genes entering Step 2 will decrease, and thereforethe two-step procedure may fail to be more powerful than the one-step proce-dure. As a remedy for this situation, we are currently investigating the use ofa closed testing procedure to increase power.



References

Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis ofmicroarray expression data: re gularized t-test and statistical inferences ofgene changes. Bioinformatics , 17(6), 509–519.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate:a practical and powerful approach to multiple testing. Journal of the RoyalStatistical Society, Series B , 57, 289–300.

Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the falsediscovery rate in multiple testing with independent statistics. Journal ofEducational and Behavioral Statistics , 25(1), 60–83.

Black, M. A. (2004). A note on the adaptive control of false discovery rates.Journal of the Royal Statistical Society, Series B , 66(2), 297–304.

Efron, B. (2003). Robbins, empirical bayes and microarrays. The Annals ofStatistics , 31(2), 366–378.

Gottardo, R., Pannucci, J. A., Kuske, C. R., and Brettin, T. (2003). Statisticalanalysis of microarray data: a Bayesian approach. Biostatistics , 4(4), 597–620.

Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R.,Meltzer, P., Gusterson, B., Esteller, M., Raffeld, M., Yakhini, Z., Ben-Dor,A., Dougherty, E., Kononen, J., Bubendorf, L., Fehrle, W., Pittaluga, S.,Gruvberger, S., Loman, N., Johannsson, O., Olsson, H., Wilfond, B., Sauter,G., Kallioniemi, O. P., Borg, A., and Trent, J. (2001). Gene-expressionprofiles in hereditary breast cancer. The New England Journal of Medicine,344(8), 539–548.

Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures .John Wiley & Sons.

Hsu, J. C. (1996). Multiple Comparisons: Theory and methods . Chapman &Hall.

Kerr, M. K., Martin, M., and Churchill, G. A. (2000). Analysis of variancefor gene expression microarray data. Journal of Computational Biology , 7,819–837.



Langaas, M., Lindqvist, B., and Ferkingstad, E. (2005). Estimating the pro-portion of true null hypotheses, with application to dna microarray data.Journal of the Royal Statistical Society, Series B , 67, 555–572.

Lu, Y., Zhu, J., and Liu, P. (2005). A two-step strategy for detecting differ-ential gene expression in cDNA microarray data. Current Genetics , 47(2),121–131.

Newton, M. A., Kendziorski, C. M., Richmond, C. S., and Tsui, F. B. K. W.(2001). On differential variability of expression ratios: Improving statisticalinference about gene expression changes from microarray data. Journal ofComputational Biology , 8(1), 37–52.

Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P. O., and Davis, R. W.(1996). Parallel human genome analysis: Microarray-based expression mon-itoring of 1000 genes. Proceedings of the National Academy of Sciences,USA, 93, 10614–10619.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of theRoyal Statistical Society, Series B , 64, 479–498.

Storey, J. D. (2003a). Comment on resampling-based multiple testing for DNAmicroarray data analysis by Ge, Dudoit, and Speed. Test , 12(1), 52–60.

Storey, J. D. (2003b). The positive false discovery rate: a Bayesian interpre-tation and the q-value. The Annals of Statistics , 31(6), 2013–2035.

Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewidestudies. Proceedings of the National Academy of Sciences, USA, 100(16),9440–9445.

Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conser-vative point estimation and simultaneous conservative consistency of falsediscovery rates: a unified approach. Journal of the Royal Statistical Society,Series B , 66, 187–205.

Tusher, V. S., Tibshirani, R., and Chu, G. (2001). Significance analysis ofmicroarrays applied to the ionizing radiation response. Proceedings of theNational Academy of Sciences, USA, 98(9), 5116–5121.

Wolfinger, R. D., Gibson, G., Wolfinger, E. D., Bennett, L., Hamadeh, H.,Bushel, P., Afshari, C., and Paules, R. S. (2001). Assessing gene signifi-cance from cDNA microarray expression data via mixed models. Journal ofComputational Biology , 8(6), 625–637.



Statistical Applications in Genetics and Molecular Biologyhji403/Publications/Jiang... · pose an...

Documents

Transcript of Statistical Applications in Genetics and Molecular Biologyhji403/Publications/Jiang... · pose an...