Journal Club Reading

www.proteomics-journal.com Proteomics

Received: 28/01/2016; Revised: 12/07/2016; Accepted: 21/07/2016

This article has been accepted for publication and undergone full peer review but has not been

through the copyediting, typesetting, pagination and proofreading process, which may lead to

differences between this version and the Version of Record. Please cite this article as doi:

10.1002/pmic.201600044.

This article is protected by copyright. All rights reserved.

Viewpoint

Multiple testing corrections in quantitative proteomics: a useful but blunt tool

Dana Pascovici1, David Handler2, Jemma X. Wu1, Paul A. Haynes2,*

1Australian Proteome Analysis Facility, Macquarie University, Sydney, Australia

2Department of Chemistry and Biomolecular Sciences, Macquarie University, NSW

2109, Australia

*To whom correspondence should be addressed

Professor Paul A. Haynes

Department of Chemistry and Biomolecular Sciences

Macquarie University

North Ryde, NSW 2109, Australia

Email: [email protected]

Phone: 61-2-9850 6258

Fax: 61-2-9850 6200

Keywords: false discovery rate, shot gun proteomics, FDR, multiple testing

corrections

Abbreviations

http://dx.doi.org/

http://dx.doi.org/

http://dx.doi.org/



BH - Benjamini and Hochberg, FDR – false discovery rate, iTRAQ – isotopic tags for

relative and absolute quantitation, NSAF – Normalized Spectral Abundance Factors.

Abstract

Multiple testing corrections are a useful tool for restricting the false discovery rate,

but can be blunt in the context of low power, as we demonstrate by a series of

simple simulations. Unfortunately, in proteomics experiments low power can be

common, driven by proteomics-specific issues like small effects due to ratio

compression, and few replicates due to reagent high cost, instrument time availability

and other issues; in such situations, most multiple testing corrections methods, if

used with conventional thresholds, will fail to detect any true positives even when

many exist. In this low power, medium scale situation, other methods such as effect

size considerations or peptide-level calculations may be a more effective option,

even if they do not offer the same theoretical guarantee of a low false discovery rate.

Thus, we aim to highlight in this article that proteomics presents some specific

challenges to the standard multiple testing corrections methods, which should be

employed as a useful tool but not be regarded as a required rubber stamp.

Viewpoint

Multiple testing corrections come in many “flavours”. They are employed to limit the

number of false positives occurring by chance when an analysis is repeated many

times and thus to reduce the false discovery rate (FDR) at the analysis level. In

proteomics they were initially borrowed from microarray research and other high

throughput areas, where they quickly became the norm. The informative review by

Diz [1] shows that multiple testing corrections were seldom used in quantitative

proteomics until relatively recently, and recommends a sensible multi-method



approach. Here we suggest that one reason for the slower uptake of such methods

in proteomics experiments, despite their ease of use and theoretical appeal, is that

they remain a useful but blunt tool that is less effective in discovery proteomics than

in, for example, the microarray environment.

In this paper we describe five key factors that combine to make multiple testing

corrections less effective in proteomics: medium problem scale; lower effect size due

to possible compression; lower analysis power due to high cost; percentage of data

showing an effect; and data distribution quirks. We then discuss some simple

alternatives that can help reduce, or at least understand the FDR in this medium

scale, low power situation.

(1) The typical microarray experiment often contains around 50,000 probes [2];

assuming 10% of the data shows a pattern of differential expression, this could yield

over 5000 probes to be considered for further statistical analysis. In contrast, a large

scale proteomics experiment might identify and quantify 1000-2000 proteins at the

start using current methods, of which perhaps 100-200 might be found as

differentially expressed; if strict multiple testing corrections approaches are

undertaken (such as Bonferroni or Holm[3]) the number may drop to only tens.

Restricting the analysis to such low numbers precludes from the start further

analyses by pathway or biological process enrichment; that may be acceptable if the

goal is to find one or two potential biomarkers, but not if the goal is to canvass the

differentially expressed proteins and sort them into categories meaningful for the

project at hand.

(2) The p-values generated by hypothesis testing from a particular experiment

depend crucially on the effect size (i.e. the difference between groups compared);



larger differences yield lower p-values. However, proteomics experiments that rely

on quantitation at the MS/MS level such as iTRAQ and TMT are known to undergo

ratio compression [4-6], and hence will report apparently smaller effects.

Necessarily, the resulting p-values will be higher, and the corrected p-values higher

still, meaning that few (if any) proteins pass the corrected thresholds.

To demonstrate the contribution of effect size, we analyse a simple simulated

dataset of 1000 entities (proteins), drawn randomly in triplicates from two normally

distributed samples (groups), to be then compared by a two sample t-test. We

evaluate first the uncorrected t-test p-values and the commonly used Benjamini and

Hochberg [7] FDR-adjusted p-values. In Figure 1 A, the two groups are sampled

from the same distribution (mean=0, SD=1), thus any differentially expressed

proteins should be regarded as false positives. The t-test p-value histogram is

shown, and the number of differentially expressed proteins with or without multiple

testing corrections is listed. The corrections work well, reducing the false positives

from about 5% to none. The middle panels show results obtained when sampling

from different distributions (mean difference 2); no proteins are found differentially

expressed when using the BH-FDR adjustments and the conventional 0.05

threshold. In this situation, the BH corrections are a blunt tool, unable to detect any

differentially expressed proteins despite the clear structure in the p-value histogram.

In the right panels, the effect size is larger again (means difference 4), and many

proteins are found to be differentially expressed with or without corrections. Thus, in

the more difficult situation where effect sizes are smaller, corrections are too strict a

tool.



(3) Proteomics remains a high cost undertaking; hence, experiments with a low

number of replicates (2-5) are often performed, particularly in situations where the

biological variability is expected to be low (cell lines, pools of plants, etc). In such low

power experiments p-value corrections are bound to be useful but blunt. Figure 1 B

revisits the simulation described above, but now with 4 replicates rather than

triplicates; note the simulations reported in [1] used 5 replicates, and hence had

higher power still. With more replicates, p-values corrections now yield several

hundred differentially expressed entities (BH-adjusted p-values < 0.05), with either

small or large effects. The boxed area highlighted in Figure 1, with small effects and

few replicates, is particularly difficult, and corrections are quite blunt in this scenario.

The issues described here are apparent in the spike-in SWATH experiment of Wu et

al [8], containing human and yeast cell lysates mixed so that the yeast proteins are

differentially expressed at known fold changes. At low fold changes (small effects)

and with a small number of replicates (n=3), multiple testing corrections correctly

enforce low FDR, but at the expense of having no true positives even when many

are present [8].

(4) Perhaps the key factor determining the usefulness of multiple testing corrections

and the final FDR is the percentage of proteins showing a genuine effect. This

percentage plays a similar role to the prevalence of a disease for a diagnostic test.

With no true effects, any multiple testing correction method, however strict, will work

well as long as it minimizes the number of false positives. At the other (unlikely)

extreme, when all entities have an underlying true effect, multiple testing corrections

are detrimental as they reduce the true positives without lowering the FDR. Multiple

testing corrections become vital as we move towards a situation with fewer true



effects. We now expand our previous simulation to include a varying prevalence of

true effect, working in the difficult situation with few replicates (n=3) and a small

effect size (Effect=2). We again randomly draw 3 replicates from a normal

distribution (mean =0, SD=1) for one group, while for the other group we consider a

mixture. For a certain proportion of the tests, which we vary in steps from 0 to 1, we

sample from a normal distribution with a small effect (mean = 2, SD=1 - these are

the entities showing a true effect, as there is a shift in the underlying distributions

they are sampled from), while for the other tests we sample from the same

distribution (mean=0, SD=1), thus here the null hypothesis holds. We run the

simulation 1000 times (Figure 2, Panel A), for all proportions, thus considering

varying prevalence scenarios, and summarize the true and false positives in each

case. We consider both BH-adjusted p-values, and Storey q-values [9], the two

methods most commonly used for FDR correction [1], with standard cut-offs of less

than 0.05. Both corrections restrict the false negatives to near zero, but in both

methods there are no true positives identified for effect prevalence less than 80%. In

this simulation the Storey corrections yield some true positives once the prevalence

reaches 80%; the BH corrections never do. The FDR stays low, but the false

negatives increase as the prevalence increases. In microarray analysis, a frequent

assumption is that changes occur for a small percentage of the data – and with

50,000 entities this may not be too limiting an assumption anyway; in proteomics the

prevalence may be higher. In the presence of small effects and few replicates, the

bluntness of the method will increase with the prevalence of true effects.

(5) In the tests above, the simulated data satisfies the requirements of normality and

heteroscedasticity required by the statistical methods applied. However, in many

cases proteomics data does not. For example iTRAQ log ratio data is better



modelled by a Cauchy distribution [10], and the departure from normality may impact

the resulting p-value distributions. Normalized spectral abundance factor (NSAF)

data [11] has an underlying integer based count structure which may create other

problems such as unnaturally low spurious variances (e.g. resulting from underlying

triplicate counts such as ( 1 1 1 ) or ( 2 3 2 )) - such proteins would have low p-

values even when FDR-corrected. These issues are outside the scope of this note

and simple simulations, and separate methods or tools have been devised to

account for them; more sophisticated statistical models for iTRAQ analysis [12-14],

and different normalization and analysis approaches for NSAF data [15, 16]). If

unaccounted for they can result in the ordering of the list of p-values, which is always

preserved by multiple testing corrections, being less reliable.

The simulations undertaken show that, even when all statistical assumptions are

met, multiple testing corrections can be a blunt tool that fails to detect true changes,

especially in a low power situation. Unfortunately, this can be precisely the situation

encountered in a labelled experiment (iTRAQ or TMT), where ratios are compressed

and reagent costs are high. Nonetheless, none of the key reasons above can

represent excuses for generating datasets with high numbers of false positives,

which would only further waste time and money. Hence, if multiple testing

corrections are not applied, then other means for reducing, or at least understanding,

the FDR should be employed. Below we list a few alternatives; none of them are

general, or able to guarantee a particular FDR, so they perhaps lack the theoretical

appeal of multiple testing corrections. However, in practical terms, they are simple

and potentially more effective, particularly in the low power situation driven by effect

compression and high cost.



Imposing an effect size cut-off, such as a fold change cut-off, is commonly employed

to help reduce the FDR, but is often criticised for not offering a theoretical guarantee

of keeping the FDR below a prescribed level. Yet, over-reliance on the null

hypothesis statistical testing (hence on p-values alone, corrected or otherwise,

without regard for effect sizes) is the first of the statistical deadly sins [17]; failure to

account for multiple testing comes much later at number five! In order to examine the

effect of an imposed cut-off, we revisit the varying prevalence simulation above, with

a small effect (means difference = 2), and only three replicates (Figure 2). With such

low statistical power, both BH and Storey multiple testing corrections failed to detect

differentially expressed entities at prevalence less than 80%; we now look at the

impact of imposing an effect size cut-off (difference > 2). While panel A runs the

simulation in a small experiment (N=1000), panel B simulates a large experiment

(N=50,000).

The simulation confirms that in the absence of any multiple testing corrections, using

only a p-value cut-off of 0.05, the FDR is considerable, and worse at low effect

prevalence. For this low power simulation, the FDR can be as high as ~40% when

the prevalence of true effect is 10%, and of course 100% when there is no true effect

at all. Applying a fold change cut-off reduces the FDR%, without eliminating the true

positive identifications. This approach works better when the scale of the problem is

medium; with 50,000 entries, the same simulation would still yield ~300 false

positives, whilst at the lower range of the scale (N=1000), the number of false

positives remains very small, typically in the single digits. Although the percentage

and thus FDR remains the same in the small or large scale scenario, validation (for

example by Western blots) could be more readily undertaken for a smaller number of

proteins.



Selecting an effect-size cut-off does not offer a theoretical guarantee that the false

discoveries are kept at a known level, so that estimation has to come from

elsewhere: either accepted knowledge or the current experiment. In the presence of

replicates (biological or technical), the right cut-off can be evaluated by applying it to

same-same samples: if a certain cut-off yields few changes for same-same

comparisons, it can then be used for comparisons between different groups.

For a real life experiment, it is hard to know whether the multiple testing corrections

are too blunt, or whether the experiment is mostly noise and there simply are no

differentially expressed proteins. P-value histograms and utilities developed to

estimate the proportion of true null hypotheses may help in this situation. A

histogram of all p-values generated from repeated tests can show whether the

dataset is likely to have a non-negligible proportion of true positives [1, 18]. In a

random dataset the likely result will be a flat p-value histogram. When the dataset

contains a proportion of entities with a true effect, they will generate lower p-values,

and the histogram will usually show a peak on the left; this structure is clearly visible

in the simulations above, even at low prevalence. A clear left peak in the p-value

histogram, but no differentially expressed proteins after multiple testing corrections,

likely indicates that the corrections used have been too strict. One could then

increase the significance threshold or investigate the option of a fold change and p-

value cut-off only.

Since the prevalence of true positives impacts the subsequent analysis, utilities have

been developed to estimate the proportion of true null hypotheses (no real effect), for

example as part of the influential microarray analysis R package ‘limma’ [19]. Such

utilities can be usefully employed in proteomics. Functions like propTrueNull [20] and



convest [21] take a list of p-values as input and estimate the percentage of true null

hypotheses under certain assumptions. An experiment with a peak structure in the

histogram, and an estimated percentage of true effect higher than 10%, but yielding

no proteins differentially expressed after multiple testing corrections, would suggest

that the approach taken is too strict.

In recent years, analysis platforms for discovery (such as DAVID, IPA, GeneGO,

KEGG), have improved immensely. Often the approach taken when investigating

broad changes between conditions is to place the differentially expressed proteins in

the context of pathways or processes, and to place more emphasis on the proteins in

the top enriched categories. Implicitly, we assume false positives are evenly

distributed, but true changes should cluster in relevant processes. This approach

constitutes an additional filter, though its precise effect on the FDR is difficult to

assess in a general manner. Naturally, an enrichment approach is not applicable in a

purely technical context such as a spike-in experiment, or when single biomarkers

are sought rather than a broad pathway-based functional understanding.

Another option for restricting the FDR is to consider the protein quantitation quality,

for instance by using the underlying peptides. Whilst the final dataset containing

protein abundances may be small scale in proteomics experiments, there is often a

wealth of information hidden underneath: peptide level peak areas or ratios, or ion

level information. For instance, proteins whose quantitation is based on multiple or

better quality peptides may generate more reliable information; this underlying ion

and peptide level information is already harnessed by tools such as MSstats [14].

The SWATH spike-in experiment previously mentioned also provides an example of

restricting the FDR by considering peptide level calculations, in the same low power



situation where using protein level abundances and multiple testing corrections

proved too strict [8].

Our range of simulations and examples (see Supplemental R script) show that,

while multiple testing corrections adequately constrain the FDR and remain an

important tool, they can be blunt in low power situations, which unfortunately are

common in proteomics experiments where triplicate experiments are frequently run,

and particularly so for labelled experiments affected by ratio compression (TMT and

iTRAQ). The low power, low effect prevalence scenario is difficult, and if p-values are

used without corrections the FDR will be high – with prevalence of true effect of 10%

the FDR may be as high as 40% when the common p-value cut-off of 0.05 is

employed alone. However, other simple methods such as considering effect size cut-

offs, peptide-level information, or enrichment analyses, can also reduce the FDR in

the smaller scale scenarios of discovery proteomics without completely eliminating

the true positives. Hence, we believe that multiple testing corrections should be

regarded as one of several available tools, rather than as an automatic requirement.

Acknowledgements

The authors acknowledge support from BioPlatforms Australia through the Australian

Government’s National Collaborative Research Infrastructure Scheme, and the

Australian Research Council training centre for Molecular Technologies in the Food

Industry. PAH completed work on this manuscript while on sabbatical at UCSD, and

wishes to acknowledge the gracious support of Majid Ghassemian and Betsy

Komives.



References

[1] Diz, A. P., Carvajal-Rodríguez, A., Skibinski, D. O. F., Multiple Hypothesis Testing in Proteomics: A Strategy for Experimental Work. Molecular & Cellular Proteomics 2011, 10. [2] Schinke-Braun, M., Couget, J. A., Cardiac Gene Expression, Springer 2007, pp. 13-40. [3] Holm, S., A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 1979, 65-70. [4] Bantscheff, M., Schirle, M., Sweetman, G., Rick, J., Kuster, B., Quantitative mass spectrometry in proteomics: a critical review. Analytical and Bioanalytical Chemistry 2007, 389, 1017-1031. [5] Ow, S. Y., Salim, M., Noirel, J., Evans, C., et al., iTRAQ Underestimation in Simple and Complex Mixtures: “The Good, the Bad and the Ugly”. Journal of Proteome Research 2009, 8, 5347-5355. [6] Savitski, M. M., Sweetman, G., Askenazi, M., Marto, J. A., et al., Delayed Fragmentation and Optimized Isolation Width Settings for Improvement of Protein Identification and Accuracy of Isobaric Mass Tag Quantification on Orbitrap-Type Mass Spectrometers. Analytical Chemistry 2011, 83, 8959-8967. [7] Benjamini, Y., Hochberg, Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995, 289-300. [8] Wu, J. X., Song, X., Pascovici, D., Zaw, T., et al., SWATH Mass Spectrometry Performance Using Extended Peptide MS/MS Assay Libraries. Molecular & Cellular Proteomics 2016, 15, 2501-2514. [9] Storey, J. D., A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002, 64, 479-498. [10] Breitwieser, F. P., M ller, ., ayon, ., her, T., et al., General statistical modeling of data from protein relative expression isobaric tags. Journal of Proteome Research 2011, 10, 2758-2766. [11] Zybailov, B., Mosley, A. L., Sardiu, M. E., Coleman, M. K., et al., Statistical Analysis of Membrane Proteome Expression Changes in Saccharomyces c erevisiae. Journal of proteome research 2006, 5, 2339-2347. [12] Hill, E. G., Schwacke, J. H., Comte-Walters, S., Slate, E. H., et al., A statistical model for iTRAQ data analysis. Journal of Proteome Research 2008, 7, 3091-3101. [13] Oberg, A. L., Mahoney, D. W., Eckel-Passow, J. E., Malone, C. J., et al., Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. Journal of Proteome Research 2008, 7, 225-233. [14] Choi, M., Chang, C.-Y., Clough, T., Broudy, D., et al., MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 2014. [15] Zhang, Y., Wen, Z., Washburn, M. P., Florens, L., Improving Label-Free Quantitative Proteomics Strategies by Distributing Shared Peptides and Stabilizing Variance. Analytical Chemistry 2015, 87, 4749-4756. [16] Pavelka, N., Fournier, M. L., Swanson, S. K., Pelizzola, M., et al., Statistical similarities between transcriptomics and quantitative shotgun proteomics data. Molecular & Cellular Proteomics 2008, 7, 631-644. [17] Millis, S. R., Statistical practices: The seven deadly sins. Child Neuropsychology 2003, 9, 221-233.



[18] Pounds, S. B., Estimation and control of multiple testing error rates for microarray studies. Briefings in Bioinformatics 2006, 7, 25-36. [19] Smyth, G. K., Bioinformatics and computational biology solutions using R and Bioconductor, Springer 2005, pp. 397-420. [20] Phipson, B., Empirical bayes modelling of expression profiles and their associations, University of Melbourne, Australia 2013. [21] Langaas, M., Lindqvist, B. H., Ferkingstad, E., Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005, 67, 555-572.

Figure Legends

Figure 1: 1000 repeated independent t-tests comparing two groups, using either

n=3, or n=4 replicates; the distributions from which the samples are taken are shown

in the top row. Left panels: no true effects, sample at random from the same

distribution (mean=0, sd=1); the resulting p-values are uniformly distributed. There

are ~5% proteins differentially expressed (uncorrected P-value < 0.05), but none if

using the BH FDR correction (BH-adjusted p-value < 0.05). Middle panels: sample

from distributions that have a shift, the true difference in means is 2. Without

corrections ~ 380 proteins are found as differentially expressed (P-value < 0.05), or ~

600 when using 4 replicates. With BH corrections no proteins are differentially

expressed when n=3, despite a clear effect. When n= 4, there are ~ 300 proteins

differentially expressed after BH corrections. Right panels: effect size has been

increased to 4; now there are a large number of differentially expressed proteins

when using BH-corrections, for either 3 or 4 replicates.



Figure 2: Simulated data with low power (difference in means=2, n=3 replicates) and

varying prevalence of true effect. The number of true positives and false positives

are shown for each prevalence, along with the FDR (in brackets), under 4 scenarios

shown in order top to bottom: BH-adjusted p-values < 0.05, Q-values < 0.05, P-value

< 0.05, and P-value and effect cut-off > 2. As the true effect prevalence increases

from 0% to 80%, multiple testing corrections keep the false positives limited. If using

BH-adjusted p-values < 0.05, no true positives are detected at any stage; the false

negatives increase linearly. For the Storey Q-values some true positives start to be

detected at very high prevalence (80%). The simulation is repeated either N=1000

(center) or N=50,000 times (right). In the medium scale experiment, using an effect

cut-off restricts the false positives to single digits, while still allowing for some true

positives to be detected.

Journal Club Reading

Documents

Transcript of Journal Club Reading