Journal Club Reading
-
Upload
truongtram -
Category
Documents
-
view
221 -
download
2
Transcript of Journal Club Reading
www.proteomics-journal.com Page 1 Proteomics
Received: 28/01/2016; Revised: 12/07/2016; Accepted: 21/07/2016
This article has been accepted for publication and undergone full peer review but has not been
through the copyediting, typesetting, pagination and proofreading process, which may lead to
differences between this version and the Version of Record. Please cite this article as doi:
10.1002/pmic.201600044.
This article is protected by copyright. All rights reserved.
Viewpoint
Multiple testing corrections in quantitative proteomics: a useful but blunt tool
Dana Pascovici1, David Handler2, Jemma X. Wu1, Paul A. Haynes2,*
1Australian Proteome Analysis Facility, Macquarie University, Sydney, Australia
2Department of Chemistry and Biomolecular Sciences, Macquarie University, NSW
2109, Australia
*To whom correspondence should be addressed
Professor Paul A. Haynes
Department of Chemistry and Biomolecular Sciences
Macquarie University
North Ryde, NSW 2109, Australia
Email: [email protected]
Phone: 61-2-9850 6258
Fax: 61-2-9850 6200
Keywords: false discovery rate, shot gun proteomics, FDR, multiple testing
corrections
Abbreviations
www.proteomics-journal.com Page 2 Proteomics
This article is protected by copyright. All rights reserved.
BH - Benjamini and Hochberg, FDR – false discovery rate, iTRAQ – isotopic tags for
relative and absolute quantitation, NSAF – Normalized Spectral Abundance Factors.
Abstract
Multiple testing corrections are a useful tool for restricting the false discovery rate,
but can be blunt in the context of low power, as we demonstrate by a series of
simple simulations. Unfortunately, in proteomics experiments low power can be
common, driven by proteomics-specific issues like small effects due to ratio
compression, and few replicates due to reagent high cost, instrument time availability
and other issues; in such situations, most multiple testing corrections methods, if
used with conventional thresholds, will fail to detect any true positives even when
many exist. In this low power, medium scale situation, other methods such as effect
size considerations or peptide-level calculations may be a more effective option,
even if they do not offer the same theoretical guarantee of a low false discovery rate.
Thus, we aim to highlight in this article that proteomics presents some specific
challenges to the standard multiple testing corrections methods, which should be
employed as a useful tool but not be regarded as a required rubber stamp.
Viewpoint
Multiple testing corrections come in many “flavours”. They are employed to limit the
number of false positives occurring by chance when an analysis is repeated many
times and thus to reduce the false discovery rate (FDR) at the analysis level. In
proteomics they were initially borrowed from microarray research and other high
throughput areas, where they quickly became the norm. The informative review by
Diz [1] shows that multiple testing corrections were seldom used in quantitative
proteomics until relatively recently, and recommends a sensible multi-method
www.proteomics-journal.com Page 3 Proteomics
This article is protected by copyright. All rights reserved.
approach. Here we suggest that one reason for the slower uptake of such methods
in proteomics experiments, despite their ease of use and theoretical appeal, is that
they remain a useful but blunt tool that is less effective in discovery proteomics than
in, for example, the microarray environment.
In this paper we describe five key factors that combine to make multiple testing
corrections less effective in proteomics: medium problem scale; lower effect size due
to possible compression; lower analysis power due to high cost; percentage of data
showing an effect; and data distribution quirks. We then discuss some simple
alternatives that can help reduce, or at least understand the FDR in this medium
scale, low power situation.
(1) The typical microarray experiment often contains around 50,000 probes [2];
assuming 10% of the data shows a pattern of differential expression, this could yield
over 5000 probes to be considered for further statistical analysis. In contrast, a large
scale proteomics experiment might identify and quantify 1000-2000 proteins at the
start using current methods, of which perhaps 100-200 might be found as
differentially expressed; if strict multiple testing corrections approaches are
undertaken (such as Bonferroni or Holm[3]) the number may drop to only tens.
Restricting the analysis to such low numbers precludes from the start further
analyses by pathway or biological process enrichment; that may be acceptable if the
goal is to find one or two potential biomarkers, but not if the goal is to canvass the
differentially expressed proteins and sort them into categories meaningful for the
project at hand.
(2) The p-values generated by hypothesis testing from a particular experiment
depend crucially on the effect size (i.e. the difference between groups compared);
www.proteomics-journal.com Page 4 Proteomics
This article is protected by copyright. All rights reserved.
larger differences yield lower p-values. However, proteomics experiments that rely
on quantitation at the MS/MS level such as iTRAQ and TMT are known to undergo
ratio compression [4-6], and hence will report apparently smaller effects.
Necessarily, the resulting p-values will be higher, and the corrected p-values higher
still, meaning that few (if any) proteins pass the corrected thresholds.
To demonstrate the contribution of effect size, we analyse a simple simulated
dataset of 1000 entities (proteins), drawn randomly in triplicates from two normally
distributed samples (groups), to be then compared by a two sample t-test. We
evaluate first the uncorrected t-test p-values and the commonly used Benjamini and
Hochberg [7] FDR-adjusted p-values. In Figure 1 A, the two groups are sampled
from the same distribution (mean=0, SD=1), thus any differentially expressed
proteins should be regarded as false positives. The t-test p-value histogram is
shown, and the number of differentially expressed proteins with or without multiple
testing corrections is listed. The corrections work well, reducing the false positives
from about 5% to none. The middle panels show results obtained when sampling
from different distributions (mean difference 2); no proteins are found differentially
expressed when using the BH-FDR adjustments and the conventional 0.05
threshold. In this situation, the BH corrections are a blunt tool, unable to detect any
differentially expressed proteins despite the clear structure in the p-value histogram.
In the right panels, the effect size is larger again (means difference 4), and many
proteins are found to be differentially expressed with or without corrections. Thus, in
the more difficult situation where effect sizes are smaller, corrections are too strict a
tool.
www.proteomics-journal.com Page 5 Proteomics
This article is protected by copyright. All rights reserved.
(3) Proteomics remains a high cost undertaking; hence, experiments with a low
number of replicates (2-5) are often performed, particularly in situations where the
biological variability is expected to be low (cell lines, pools of plants, etc). In such low
power experiments p-value corrections are bound to be useful but blunt. Figure 1 B
revisits the simulation described above, but now with 4 replicates rather than
triplicates; note the simulations reported in [1] used 5 replicates, and hence had
higher power still. With more replicates, p-values corrections now yield several
hundred differentially expressed entities (BH-adjusted p-values < 0.05), with either
small or large effects. The boxed area highlighted in Figure 1, with small effects and
few replicates, is particularly difficult, and corrections are quite blunt in this scenario.
The issues described here are apparent in the spike-in SWATH experiment of Wu et
al [8], containing human and yeast cell lysates mixed so that the yeast proteins are
differentially expressed at known fold changes. At low fold changes (small effects)
and with a small number of replicates (n=3), multiple testing corrections correctly
enforce low FDR, but at the expense of having no true positives even when many
are present [8].
(4) Perhaps the key factor determining the usefulness of multiple testing corrections
and the final FDR is the percentage of proteins showing a genuine effect. This
percentage plays a similar role to the prevalence of a disease for a diagnostic test.
With no true effects, any multiple testing correction method, however strict, will work
well as long as it minimizes the number of false positives. At the other (unlikely)
extreme, when all entities have an underlying true effect, multiple testing corrections
are detrimental as they reduce the true positives without lowering the FDR. Multiple
testing corrections become vital as we move towards a situation with fewer true
www.proteomics-journal.com Page 6 Proteomics
This article is protected by copyright. All rights reserved.
effects. We now expand our previous simulation to include a varying prevalence of
true effect, working in the difficult situation with few replicates (n=3) and a small
effect size (Effect=2). We again randomly draw 3 replicates from a normal
distribution (mean =0, SD=1) for one group, while for the other group we consider a
mixture. For a certain proportion of the tests, which we vary in steps from 0 to 1, we
sample from a normal distribution with a small effect (mean = 2, SD=1 - these are
the entities showing a true effect, as there is a shift in the underlying distributions
they are sampled from), while for the other tests we sample from the same
distribution (mean=0, SD=1), thus here the null hypothesis holds. We run the
simulation 1000 times (Figure 2, Panel A), for all proportions, thus considering
varying prevalence scenarios, and summarize the true and false positives in each
case. We consider both BH-adjusted p-values, and Storey q-values [9], the two
methods most commonly used for FDR correction [1], with standard cut-offs of less
than 0.05. Both corrections restrict the false negatives to near zero, but in both
methods there are no true positives identified for effect prevalence less than 80%. In
this simulation the Storey corrections yield some true positives once the prevalence
reaches 80%; the BH corrections never do. The FDR stays low, but the false
negatives increase as the prevalence increases. In microarray analysis, a frequent
assumption is that changes occur for a small percentage of the data – and with
50,000 entities this may not be too limiting an assumption anyway; in proteomics the
prevalence may be higher. In the presence of small effects and few replicates, the
bluntness of the method will increase with the prevalence of true effects.
(5) In the tests above, the simulated data satisfies the requirements of normality and
heteroscedasticity required by the statistical methods applied. However, in many
cases proteomics data does not. For example iTRAQ log ratio data is better
www.proteomics-journal.com Page 7 Proteomics
This article is protected by copyright. All rights reserved.
modelled by a Cauchy distribution [10], and the departure from normality may impact
the resulting p-value distributions. Normalized spectral abundance factor (NSAF)
data [11] has an underlying integer based count structure which may create other
problems such as unnaturally low spurious variances (e.g. resulting from underlying
triplicate counts such as ( 1 1 1 ) or ( 2 3 2 )) - such proteins would have low p-
values even when FDR-corrected. These issues are outside the scope of this note
and simple simulations, and separate methods or tools have been devised to
account for them; more sophisticated statistical models for iTRAQ analysis [12-14],
and different normalization and analysis approaches for NSAF data [15, 16]). If
unaccounted for they can result in the ordering of the list of p-values, which is always
preserved by multiple testing corrections, being less reliable.
The simulations undertaken show that, even when all statistical assumptions are
met, multiple testing corrections can be a blunt tool that fails to detect true changes,
especially in a low power situation. Unfortunately, this can be precisely the situation
encountered in a labelled experiment (iTRAQ or TMT), where ratios are compressed
and reagent costs are high. Nonetheless, none of the key reasons above can
represent excuses for generating datasets with high numbers of false positives,
which would only further waste time and money. Hence, if multiple testing
corrections are not applied, then other means for reducing, or at least understanding,
the FDR should be employed. Below we list a few alternatives; none of them are
general, or able to guarantee a particular FDR, so they perhaps lack the theoretical
appeal of multiple testing corrections. However, in practical terms, they are simple
and potentially more effective, particularly in the low power situation driven by effect
compression and high cost.
www.proteomics-journal.com Page 8 Proteomics
This article is protected by copyright. All rights reserved.
Imposing an effect size cut-off, such as a fold change cut-off, is commonly employed
to help reduce the FDR, but is often criticised for not offering a theoretical guarantee
of keeping the FDR below a prescribed level. Yet, over-reliance on the null
hypothesis statistical testing (hence on p-values alone, corrected or otherwise,
without regard for effect sizes) is the first of the statistical deadly sins [17]; failure to
account for multiple testing comes much later at number five! In order to examine the
effect of an imposed cut-off, we revisit the varying prevalence simulation above, with
a small effect (means difference = 2), and only three replicates (Figure 2). With such
low statistical power, both BH and Storey multiple testing corrections failed to detect
differentially expressed entities at prevalence less than 80%; we now look at the
impact of imposing an effect size cut-off (difference > 2). While panel A runs the
simulation in a small experiment (N=1000), panel B simulates a large experiment
(N=50,000).
The simulation confirms that in the absence of any multiple testing corrections, using
only a p-value cut-off of 0.05, the FDR is considerable, and worse at low effect
prevalence. For this low power simulation, the FDR can be as high as ~40% when
the prevalence of true effect is 10%, and of course 100% when there is no true effect
at all. Applying a fold change cut-off reduces the FDR%, without eliminating the true
positive identifications. This approach works better when the scale of the problem is
medium; with 50,000 entries, the same simulation would still yield ~300 false
positives, whilst at the lower range of the scale (N=1000), the number of false
positives remains very small, typically in the single digits. Although the percentage
and thus FDR remains the same in the small or large scale scenario, validation (for
example by Western blots) could be more readily undertaken for a smaller number of
proteins.
www.proteomics-journal.com Page 9 Proteomics
This article is protected by copyright. All rights reserved.
Selecting an effect-size cut-off does not offer a theoretical guarantee that the false
discoveries are kept at a known level, so that estimation has to come from
elsewhere: either accepted knowledge or the current experiment. In the presence of
replicates (biological or technical), the right cut-off can be evaluated by applying it to
same-same samples: if a certain cut-off yields few changes for same-same
comparisons, it can then be used for comparisons between different groups.
For a real life experiment, it is hard to know whether the multiple testing corrections
are too blunt, or whether the experiment is mostly noise and there simply are no
differentially expressed proteins. P-value histograms and utilities developed to
estimate the proportion of true null hypotheses may help in this situation. A
histogram of all p-values generated from repeated tests can show whether the
dataset is likely to have a non-negligible proportion of true positives [1, 18]. In a
random dataset the likely result will be a flat p-value histogram. When the dataset
contains a proportion of entities with a true effect, they will generate lower p-values,
and the histogram will usually show a peak on the left; this structure is clearly visible
in the simulations above, even at low prevalence. A clear left peak in the p-value
histogram, but no differentially expressed proteins after multiple testing corrections,
likely indicates that the corrections used have been too strict. One could then
increase the significance threshold or investigate the option of a fold change and p-
value cut-off only.
Since the prevalence of true positives impacts the subsequent analysis, utilities have
been developed to estimate the proportion of true null hypotheses (no real effect), for
example as part of the influential microarray analysis R package ‘limma’ [19]. Such
utilities can be usefully employed in proteomics. Functions like propTrueNull [20] and
www.proteomics-journal.com Page 10 Proteomics
This article is protected by copyright. All rights reserved.
convest [21] take a list of p-values as input and estimate the percentage of true null
hypotheses under certain assumptions. An experiment with a peak structure in the
histogram, and an estimated percentage of true effect higher than 10%, but yielding
no proteins differentially expressed after multiple testing corrections, would suggest
that the approach taken is too strict.
In recent years, analysis platforms for discovery (such as DAVID, IPA, GeneGO,
KEGG), have improved immensely. Often the approach taken when investigating
broad changes between conditions is to place the differentially expressed proteins in
the context of pathways or processes, and to place more emphasis on the proteins in
the top enriched categories. Implicitly, we assume false positives are evenly
distributed, but true changes should cluster in relevant processes. This approach
constitutes an additional filter, though its precise effect on the FDR is difficult to
assess in a general manner. Naturally, an enrichment approach is not applicable in a
purely technical context such as a spike-in experiment, or when single biomarkers
are sought rather than a broad pathway-based functional understanding.
Another option for restricting the FDR is to consider the protein quantitation quality,
for instance by using the underlying peptides. Whilst the final dataset containing
protein abundances may be small scale in proteomics experiments, there is often a
wealth of information hidden underneath: peptide level peak areas or ratios, or ion
level information. For instance, proteins whose quantitation is based on multiple or
better quality peptides may generate more reliable information; this underlying ion
and peptide level information is already harnessed by tools such as MSstats [14].
The SWATH spike-in experiment previously mentioned also provides an example of
restricting the FDR by considering peptide level calculations, in the same low power
www.proteomics-journal.com Page 11 Proteomics
This article is protected by copyright. All rights reserved.
situation where using protein level abundances and multiple testing corrections
proved too strict [8].
Our range of simulations and examples (see Supplemental R script) show that,
while multiple testing corrections adequately constrain the FDR and remain an
important tool, they can be blunt in low power situations, which unfortunately are
common in proteomics experiments where triplicate experiments are frequently run,
and particularly so for labelled experiments affected by ratio compression (TMT and
iTRAQ). The low power, low effect prevalence scenario is difficult, and if p-values are
used without corrections the FDR will be high – with prevalence of true effect of 10%
the FDR may be as high as 40% when the common p-value cut-off of 0.05 is
employed alone. However, other simple methods such as considering effect size cut-
offs, peptide-level information, or enrichment analyses, can also reduce the FDR in
the smaller scale scenarios of discovery proteomics without completely eliminating
the true positives. Hence, we believe that multiple testing corrections should be
regarded as one of several available tools, rather than as an automatic requirement.
Acknowledgements
The authors acknowledge support from BioPlatforms Australia through the Australian
Government’s National Collaborative Research Infrastructure Scheme, and the
Australian Research Council training centre for Molecular Technologies in the Food
Industry. PAH completed work on this manuscript while on sabbatical at UCSD, and
wishes to acknowledge the gracious support of Majid Ghassemian and Betsy
Komives.
www.proteomics-journal.com Page 12 Proteomics
This article is protected by copyright. All rights reserved.
References
[1] Diz, A. P., Carvajal-Rodríguez, A., Skibinski, D. O. F., Multiple Hypothesis Testing in Proteomics: A Strategy for Experimental Work. Molecular & Cellular Proteomics 2011, 10. [2] Schinke-Braun, M., Couget, J. A., Cardiac Gene Expression, Springer 2007, pp. 13-40. [3] Holm, S., A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 1979, 65-70. [4] Bantscheff, M., Schirle, M., Sweetman, G., Rick, J., Kuster, B., Quantitative mass spectrometry in proteomics: a critical review. Analytical and Bioanalytical Chemistry 2007, 389, 1017-1031. [5] Ow, S. Y., Salim, M., Noirel, J., Evans, C., et al., iTRAQ Underestimation in Simple and Complex Mixtures: “The Good, the Bad and the Ugly”. Journal of Proteome Research 2009, 8, 5347-5355. [6] Savitski, M. M., Sweetman, G., Askenazi, M., Marto, J. A., et al., Delayed Fragmentation and Optimized Isolation Width Settings for Improvement of Protein Identification and Accuracy of Isobaric Mass Tag Quantification on Orbitrap-Type Mass Spectrometers. Analytical Chemistry 2011, 83, 8959-8967. [7] Benjamini, Y., Hochberg, Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995, 289-300. [8] Wu, J. X., Song, X., Pascovici, D., Zaw, T., et al., SWATH Mass Spectrometry Performance Using Extended Peptide MS/MS Assay Libraries. Molecular & Cellular Proteomics 2016, 15, 2501-2514. [9] Storey, J. D., A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002, 64, 479-498. [10] Breitwieser, F. P., M ller, ., ayon, ., her, T., et al., General statistical modeling of data from protein relative expression isobaric tags. Journal of Proteome Research 2011, 10, 2758-2766. [11] Zybailov, B., Mosley, A. L., Sardiu, M. E., Coleman, M. K., et al., Statistical Analysis of Membrane Proteome Expression Changes in Saccharomyces c erevisiae. Journal of proteome research 2006, 5, 2339-2347. [12] Hill, E. G., Schwacke, J. H., Comte-Walters, S., Slate, E. H., et al., A statistical model for iTRAQ data analysis. Journal of Proteome Research 2008, 7, 3091-3101. [13] Oberg, A. L., Mahoney, D. W., Eckel-Passow, J. E., Malone, C. J., et al., Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. Journal of Proteome Research 2008, 7, 225-233. [14] Choi, M., Chang, C.-Y., Clough, T., Broudy, D., et al., MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 2014. [15] Zhang, Y., Wen, Z., Washburn, M. P., Florens, L., Improving Label-Free Quantitative Proteomics Strategies by Distributing Shared Peptides and Stabilizing Variance. Analytical Chemistry 2015, 87, 4749-4756. [16] Pavelka, N., Fournier, M. L., Swanson, S. K., Pelizzola, M., et al., Statistical similarities between transcriptomics and quantitative shotgun proteomics data. Molecular & Cellular Proteomics 2008, 7, 631-644. [17] Millis, S. R., Statistical practices: The seven deadly sins. Child Neuropsychology 2003, 9, 221-233.
www.proteomics-journal.com Page 13 Proteomics
This article is protected by copyright. All rights reserved.
[18] Pounds, S. B., Estimation and control of multiple testing error rates for microarray studies. Briefings in Bioinformatics 2006, 7, 25-36. [19] Smyth, G. K., Bioinformatics and computational biology solutions using R and Bioconductor, Springer 2005, pp. 397-420. [20] Phipson, B., Empirical bayes modelling of expression profiles and their associations, University of Melbourne, Australia 2013. [21] Langaas, M., Lindqvist, B. H., Ferkingstad, E., Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005, 67, 555-572.
Figure Legends
Figure 1: 1000 repeated independent t-tests comparing two groups, using either
n=3, or n=4 replicates; the distributions from which the samples are taken are shown
in the top row. Left panels: no true effects, sample at random from the same
distribution (mean=0, sd=1); the resulting p-values are uniformly distributed. There
are ~5% proteins differentially expressed (uncorrected P-value < 0.05), but none if
using the BH FDR correction (BH-adjusted p-value < 0.05). Middle panels: sample
from distributions that have a shift, the true difference in means is 2. Without
corrections ~ 380 proteins are found as differentially expressed (P-value < 0.05), or ~
600 when using 4 replicates. With BH corrections no proteins are differentially
expressed when n=3, despite a clear effect. When n= 4, there are ~ 300 proteins
differentially expressed after BH corrections. Right panels: effect size has been
increased to 4; now there are a large number of differentially expressed proteins
when using BH-corrections, for either 3 or 4 replicates.
www.proteomics-journal.com Page 14 Proteomics
This article is protected by copyright. All rights reserved.
Figure 2: Simulated data with low power (difference in means=2, n=3 replicates) and
varying prevalence of true effect. The number of true positives and false positives
are shown for each prevalence, along with the FDR (in brackets), under 4 scenarios
shown in order top to bottom: BH-adjusted p-values < 0.05, Q-values < 0.05, P-value
< 0.05, and P-value and effect cut-off > 2. As the true effect prevalence increases
from 0% to 80%, multiple testing corrections keep the false positives limited. If using
BH-adjusted p-values < 0.05, no true positives are detected at any stage; the false
negatives increase linearly. For the Storey Q-values some true positives start to be
detected at very high prevalence (80%). The simulation is repeated either N=1000
(center) or N=50,000 times (right). In the medium scale experiment, using an effect
cut-off restricts the false positives to single digits, while still allowing for some true
positives to be detected.
www.proteomics-journal.com Page 15 Proteomics
This article is protected by copyright. All rights reserved.