False Discovery Rate Guy Yehuda

63
1 alse Discovery Rate Guy Yehuda

description

False Discovery Rate Guy Yehuda. Outline. Short introduction to statistics The problem of multiplicity FDR vs. FWE FDR control procedures and resampling Results of using FDR control procedures Discussion. Short introduction to statistics. - PowerPoint PPT Presentation

Transcript of False Discovery Rate Guy Yehuda

Page 1: False Discovery Rate Guy Yehuda

1

False Discovery RateGuy Yehuda

Page 2: False Discovery Rate Guy Yehuda

2

Outline

• Short introduction to statistics

• The problem of multiplicity

• FDR vs. FWE

• FDR control procedures and resampling

• Results of using FDR control procedures

• Discussion

Page 3: False Discovery Rate Guy Yehuda

3

Short introduction to statistics

• (Arithmetic) Mean is the sum of all the scores divided by the number of scores:

μ = ΣX/N

where μ is the population mean and N is the number of scores.

Page 4: False Discovery Rate Guy Yehuda

4

• The variance is a measure of how spread out a distribution is:

where μ is the mean and N is the number of scores.

• When the variance is computed in a sample, the statistic

(where M is the mean of the sample) can be used. S2 is a biased estimate of σ2, however.

Page 5: False Discovery Rate Guy Yehuda

5

• By far the most common formula for computing variance in a sample is:

which gives an unbiased estimate of σ2 .• A standard deviation is simply the square root of

variance.• In a normal distribution, about 68% of the scores

are within one standard deviation of the mean and about 95% of the scores are within two standards deviations of the mean.

Page 6: False Discovery Rate Guy Yehuda

6

• Usually, if you are given a large enough group of subjects, you will find that the distribution of just about any given trait will be a pretty bell shaped curve like the one below:

Page 7: False Discovery Rate Guy Yehuda

7

T test

A T-test is used to evaluate the statistical similarity of two samples

• Gather your data.

• Calculate mean and then variance.

• Calculate T-statistic.

Page 8: False Discovery Rate Guy Yehuda

8

• degrees of freedom = N-1

Page 9: False Discovery Rate Guy Yehuda

9

• Null hypothesis (H0) states that the two samples are identical.

• Alternative hypothesis (H1) states that the samples are statistically different.

• A result is significant if it is unlikely to have occurred by chance.

• The significance level of a test is the maximum probability of accidentally rejecting a true null hypothesis (a decision known as a Type I Error). The significance of a result is also called its p-value.

Page 10: False Discovery Rate Guy Yehuda

10

• We use the degrees of freedom and the t-test in-order to find the p-value in t-statistics table.

• If p-value is 0.03 and the significance level is 0.05, than the result is significant, i.e. the null hypothesis is rejected.

• The power of a statistical test is the probability that the test will reject a false null hypothesis. The higher the power, the greater the chance of obtaining a statistically significant result when the null hypothesis is false.

Page 11: False Discovery Rate Guy Yehuda

11

…Back to Microarray World

Page 12: False Discovery Rate Guy Yehuda

12

Identifying Differentially Expressed Genes

• Individual Hypotheses Testing For each gene use a significance test of 0.05 level

H0: Gene is similarly expressed

H1: Gene is differently expressed

A t-statistic is calculated for comparing gene expression mean between the control and treatment groups.

Page 13: False Discovery Rate Guy Yehuda

13

Thousands of genes in an experiment

Thousands of hypotheses are tested

Probability of type I error (false discovery)committed increases sharply

Multiplicity problem!

Are you sure all the tests are independent?

Page 14: False Discovery Rate Guy Yehuda

14

Are you sure all the tests are independent?

• Genes tend to subgroup into highly correlated expression levels for reasons such as co-regulation.

• Gene expression measurement errors may be dependent due to factors related to the RNA source, the normalization process etc.

We need to account for the dependency structure between the test statistics

Page 15: False Discovery Rate Guy Yehuda

15

What shall we do ?Let’s pretend there is no (multiplicity) problem!

If we deal with 10,000 genes for example, working at the individual 0.05 level, would identify, on the average, 10,000*0.05=500 differentially expressed genes,

even if really no such gene exists!Do we have an extra time and money to waste

exploring that amount of genes in vain at later stages?

Page 16: False Discovery Rate Guy Yehuda

16

What shall we do ?

Let’s panic!!!Control the probability of making even one error (or

more) - the FamilyWise (type I) Error-rate (FWE)

The Westfall and Young step-down alg. (WY, 89)Holm procedure (79)Bonferroni procedure – most conservative

Aren’t we missing many genes we need to explore more carefully later ?

Page 17: False Discovery Rate Guy Yehuda

17

Is there an in-between approach?

Don’t worry,

everything is under (FDR) control!

Page 18: False Discovery Rate Guy Yehuda

18

The False Discovery Rate (FDR) criterion

Benjamini and Hochberg (95) :

R = # rejected hypotheses = # discoveries

V of these may be in error = # false discoveries

The error (type I) in the entire study is measured by

Q V

RR 0

0 R 0

i.e. the proportion of false discoveries among the discoveries (0 if none found)

FDR = E(Q)

Does it make sense?

Page 19: False Discovery Rate Guy Yehuda

19

Does it make sense?• Inspecting 100 features:3 false ones among 60 discovered - bearable3 false ones among 4 discovered - unbearableSo this error rate is adaptive• The same argument holds when inspecting

10,000So this error rate is scalable• If nothing is “real” controlling the FDR at level q

guarantees FWE = Prob( V ≥ 1 ) = E( V/R ) = FDR ≤ q

• But otherwiseFWE = Prob( V ≥ 1 ) ≥ FDR

So there is room for improving detection power

Page 20: False Discovery Rate Guy Yehuda

20

How do we control FDR?

m = #simultaneously tested null hypotheses

m0 of them are true

For each hypothesis Hi a test statistic is calculated along with the corresponding p-value Pi

Page 21: False Discovery Rate Guy Yehuda

21

Linear step up procedure (BH procedure)

• Order the p-values P(1) ≤ P(2) ≤…≤ P(m)

For a desired FDR level q :• Let

• Reject

• If no such k exists reject none

kmax{i:p(i)(i / m)q}

H(1)0 , H(2)

0 ,..., H(k )0

Page 22: False Discovery Rate Guy Yehuda

22

FDR control of the BH procedure

Independent and continuous test statistics

FDR q · m0 /m q

Benjamini & Hochberg `95

Independent + Positive dependent

FDR q · m0 /m

General dependency FDR q · m0 / m · ( 1 + 1/2 + 1/3 + … 1/m)

Benjamini and Yekutieli `01

Page 23: False Discovery Rate Guy Yehuda

23

Imaginary ExampleLet’s explore expression levels of 6 genes.

A Simple research Question: find differentially expressed genes.

Solution: for each gene calculate a test statistics and get the corresponding p-value Pi ;

The values: 0.042, 0.007, 0.035, 0.12, 0.03, 0.00005

Test: BH procedure at q=0.05

1. Sort the 6 p-values p(1) … p(k) … p(6)

2. Rejection threshold: maximal p(k) such that

p(k) 0.05·k /6

Results: k = 4 => H(1), H(2) rejected => we have found those 2 differentially expressed genes

Page 24: False Discovery Rate Guy Yehuda

24

Adaptive procedures

The BH procedure is too conservative

FDR q · m0 /m

If m0 is known: use q` = q · m / m0 control FDR q and gain power.

Benjamini & Hochberg `00 Storey `01

Mosig et al. `02

Turkheimer et al. `02

Page 25: False Discovery Rate Guy Yehuda

25

Is it always that good?

• Adaptive methods offer better performance only by utilizing the difference between m / m0 and 1.

• If that difference is small, i.e. the potential proportion of differentially expressed genes is small, they offer little advantage in power while their properties aren’t well established

No proven FDR control

Page 26: False Discovery Rate Guy Yehuda

26

Multiplicity adjusted p-Values

• The results of a multiple testing procedure can be reported as multiplicity adjusted p-values: Each adjusted p-value is compared to the

desired significance level, and if smaller then the hypothesis is rejected.

Page 27: False Discovery Rate Guy Yehuda

27

FWE adjusted p-Values

For an FWE controlling procedure, the adjusted p-value of an individual hypothesis is the lowest level for which FWE ≤ α

For instance, for the Bonferroni procedure, the adjusted p-value is simply Pi*m

Page 28: False Discovery Rate Guy Yehuda

28

FDR adjusted p-Values

• For an FDR controlling procedure, the adjusted p-value of an individual hypothesis is the lowest level of FDR for which the hypothesis is first included in the set of rejected hypotheses.

• Thus the adjusted p-value of P(k) using BH :

Define PBH(k) = min { P(i)m/i, i ≥ k }

pBH(k) ≤ q <=> for some i ≥ k, p(i) ≤ qi/m <=>

H(k) is rejected at FDR level q

Page 29: False Discovery Rate Guy Yehuda

29

Resampling

Drawing repeated samples from the given data.

In place of the formidable formulas and mysterious tables of parametric and non-parametric tests based on complicated mathematics and arcane approximations, the basic resampling tools are simulations, created especially for the task.

Page 30: False Discovery Rate Guy Yehuda

30

Resampling

• If the data contains high inter-correlations (as in our case), generally designed multiple comparisons may be over-conservative in specific dependency structure.

• Resampling-based multiple testing proc. (as Westfall and Young 89) utilize the empirical dependency structure of the data to construct more powerful procedures.

Page 31: False Discovery Rate Guy Yehuda

31

Resampling

• In p-value resampling, the data is repeatedly resampled under the complete null hypotheses, and a vector of resample-based p-values is computed (for each hypothesis).

• V*(p) = #resampling-based p-values less than p.• V*(p) is an estimated upper bound to the expected

number of p-values corresponding to true null hypotheses less than .

• In simple words : we try to estimate the distribution of the p-values.

Page 32: False Discovery Rate Guy Yehuda

32

Resampling FWE adjustments

• Resample the data N times. The WY proc. estimates the FWE by

• The testing procedure is then simply: reject

if

N

pVpFWE est 0)(#

)(*

iH 0 )( iest pFWE

Page 33: False Discovery Rate Guy Yehuda

33

Resampling FDR adjustments

• But FDR is also a function of the number of false null hypotheses being rejected.

FDRest(p) = E { V*(p) / V*(p) + s(p) }

s(p) = #false null hypotheses less than p

Do we know s(p) ? No.

But we can try and estimate it!

Page 34: False Discovery Rate Guy Yehuda

34

Estimation methods

The two next methods use resampling to estimate the joint distribution of the p-values.

The FDR local (point) estimator is conservative on the mean.

The FDR upper limit bounds the FDR with probability 95%. More strict than the first method, and therefore less powerful. In the other hand it controls the empirical FDR with probability 0.95, hence supplies more protection.

Page 35: False Discovery Rate Guy Yehuda

35

Alternative Estimation method

Let’s use the BH procedure, but rather than using the raw p-values corresponding to the test statistics, the p-values are estimated by resampling from the marginal distribution and collapsing over all hypotheses.

In simple words: averaging over all the genes, and then find the adjusted p-values and compare to q.

The gain: It overcomes the discrete nature of the permutation distribution of a test statistics based on few observations (supply more possible values for the p-values).

Page 36: False Discovery Rate Guy Yehuda

36

If you insist then formally:

For the k-th gene, with an observed test statistics tk , while N = #iterations and I = #genes, the estimated p-value is:

And then obtain the BH point estimate for the k-th gene:

N

j

I

ik

ji

estk tt

INP

1 1

*#11

i

mPP

esti

ik

BHk

)()( min

Page 37: False Discovery Rate Guy Yehuda

37

FDR controlling procedures• Use the p-values from t-tests in the BH procedure• Use the (marginal) p-values from resampling in

the BH procedure• The FDR resampling “point estimate”-procedure• The FDR resampling “upper-bound”- procedure

both estimate by resampling the dist’n of: FDRest

(p) = E { V*(p) / (V*(p) + S(p))}

Now that we know the procedures let’s test them.

Page 38: False Discovery Rate Guy Yehuda

38

The data Dudoit, Yang, Callow, Speed (2000):

Statistical analysis of a lipid metabolism study in mice.

• Treatment: 8 low HDL level knockout mice• Control: 8 “normal” C57B1/6 mice• Purpose: Identification of single differentially

expressed genes in the livers of mice with very low HDL cholesterol levels (treatment groups) compared to inbred control mice

Page 39: False Discovery Rate Guy Yehuda

39

• The micro-array data consists of 6359 genes.

• Data was suitably standardized using log transformation and lowess smoother.

• Compare gene expression means between the control and treatment groups.

Page 40: False Discovery Rate Guy Yehuda

40

Results of multiple testing on the original data

• The p-values obtained directly from the “raw” t-statistics with 14 degrees of freedom. Ignoring multiplicity, the actual number of raw p-values larger than 0.05 is 568 (out of 6,359).

Let’s examine the FWE control in this case

Page 41: False Discovery Rate Guy Yehuda

41

• Using Bonferroni procedure always controls the FWE

• To address multiplicity with Bonferroni at 0.05, conduct each test at level .05/6359.

• 8 genes identified

Bonferroni adjusted p-values:

pBON(i) =p(i)*m are compared

to 0.05

*The enlarged table is in the appendix

Table 1: First 12 Largest T-Statistics 1,2

T-Statistic P-Value

(df=14)

-20.6 7.0*10-12

-12.5 5.6*10-9

-11.9 1.1*10-8

-11.7 1.3*10-8

-11.4 1.8*10-8

-11.3 1.9*10-8

-7.8 1.8*10-6

-7.4 3.6*10-6

5.0 1.8*10-4

-4.5 4.6*10-4

-4.5 4.9*10-4

-4.4 6.5*10-4

1. The t-statistics were ranked according to

their absolute values.

2. Bonferroni adjusted p-value is 1.6*10-4.

Page 42: False Discovery Rate Guy Yehuda

42

• Applying the FDR controlling BH on the raw p-values, we came up with the same 8 rejected null hypotheses obtained in the original analysis.

Surprising ? Not really.

Take a look at the table and observe the large gap between the p-values of genes identified by the FWE controlling procedure and the other p-values.

Also don’t forget the actual distribution of the test statistics is not quite the same t-distribution underlying the derivation of the p-values.

So what’s the next improvement? You know it.

Page 43: False Discovery Rate Guy Yehuda

43

• Let’s estimate the distribution of the t-statistics over 1000 resampling iterations.

• Adjusted p-values were calculated using the WY alg. (FWE & resampling) and the 3 resampling-based procedures we learned.

• The ten highest absolute t-values (except the largest one, 20.6, which is too far outside of the plotted x-axis) are marked on the added plot.

• At the 0.05 significance level, we may still reject the same 8 hypotheses by all procedures. What about 0.1?

*The enlarged graph is in the appendix

That’s right: Resampling

Page 44: False Discovery Rate Guy Yehuda

44

FDR and FWE adjusted p-valuesoriginal data

Page 45: False Discovery Rate Guy Yehuda

45

Summary of the Results

• The results of FDR control are very close to the results of the FWE control, and both are very different from the unadjusted results.

• The reduced conservativeness of FDR controlling procedures does not trigger discovery of artifacts.

But what if there is no clear distinction between differentially expressed genes and similarly expressed ones ?

Page 46: False Discovery Rate Guy Yehuda

46

Simulated Data Configuration

• We set the number of differentially expressed genes to be fixed at 70 (~1%)

• Differential expression was modeled using the weak lp-ball model described in Abramovich et al. (2000), by which a sparse signal pattern was generated:

r * n 1/p * i -1/p, i=1,…,n where p is a decay rate parameter, r is the maximum of the

decay function and n is the number of values. We used p = {0.5,0.75,1,1.25,1.5,1.75,2}.

Page 47: False Discovery Rate Guy Yehuda

47

1. Remove from the original data the differences in expression levels between the experiment and control groups that were not attributed to noise.

2. On each simulation repetition, shuffle the experiment and control groups. However, don’t shuffle the genes (why?).

3. Add the sparse signal of differential expression values to 70 randomly selected genes.

4. Do 100 resampling iterations. 5. Apply procedures described earlier.6. Do 400 simulation repetitions and calculate the

average FDR and power over the simulations, for.

Page 48: False Discovery Rate Guy Yehuda

48

Simulation Study Results• The next figure presents the mean curves of the

adjusted p-values vs. the test statistics, for the decay rate p=2 and adjusted FDR level less than 0.25.

• The maximal standard error of the estimated FDR was less than 0.003.

• True FDR = the real FDR (we expect 70 genes so if we identify 80 then we know 10 are by chance).

What can we learn from the “true FDR” ?

*The enlarged figure is in the appendix

Page 49: False Discovery Rate Guy Yehuda

49

FDR and FWE adjusted p-values simulated data (p = 2)

Page 50: False Discovery Rate Guy Yehuda

50

What can we learn from the “true FDR? ”

• For all FDR controlling procedures, the adjusted p-values are larger then the corresponding true FDR, indicating that using them guarantees the control of the FDR.

• The FDR procedures produce FDR adjusted p-values much closer to the true FDR than the FWE adjusted p-values obtained by the WY alg.

Page 51: False Discovery Rate Guy Yehuda

51

What about the power?• Here we also include Holm’s non-resampling

multiple testing procedure, which controls the FWE.

• All FDR controlling procedures obtain substantially more power than the FWE controlling procedures.

• Pay attention for the differences of resampling and non-resampling proc.

*The enlarged figure is in the appendix

Page 52: False Discovery Rate Guy Yehuda

52

Power by Configuration

P

Po

we

r

0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Westfall-Young (FWE)Res. upper limit (FDR)BH point-estimate (FDR)Res. point-estimate (FDR)BH LSU - No Resampling (FDR)Holm - No Resampling (FWE)

Power by configuration

Page 53: False Discovery Rate Guy Yehuda

53

Power

• The resampling point-estimator is the most powerful procedure

• The other two resampling estimators following very closely behind. Although the upper-limit res. est. estimates the joint distribution of the test statistics, its relative conservativeness does not allow increase of power over the BH resampling est., which estimates only the marginal distribution.

Page 54: False Discovery Rate Guy Yehuda

54

Power

• The BH linear step-up procedure performs relatively well, in spite of it assuming t-distributed test statistics that are either independent or positively dependent, and not using resampling.

• Holm’s procedure, performing the most poorly, reconfirms the advantage of resampling for cases in which the test statistics are not independent.

Page 55: False Discovery Rate Guy Yehuda

55

Discussion

• Why limiting the power to generate candidate hypotheses at this stage?

• Controlling the FDR at the screening stage of the research carries a benefit for the next stages of the analysis (and then we can use FWE or FDR for the follow-up studies)

Page 56: False Discovery Rate Guy Yehuda

56

• It is not quite clear how the dependency structure affects the performance of multiple testing procedures. Measurement error of microarray data tends to be positively dependent.

• A co-regulation dependency may be a result of biological variability of the co-regulation and need not be positive.

• Characterization of the dependency structure attributed to the above common co-regulation is one of the main research targets of micro-array experiments.

• Pre-selections of genes that pass an initial FDR testing at not too low level can suppress noise?

Page 57: False Discovery Rate Guy Yehuda

57

Summary

• We learned four procedures that were shown to control the FDR at the desired level

• Clearly, all FDR controlling procedures retain higher power than FWE controlling procedures, and are therefore highly useful for the discovery of differential genetic expression.

• The choice among the four is a matter of buying more power and better properties at the expense of more complicated computations.

• Substantial increase in power is already gained when the p-values are estimated by resampling

Page 58: False Discovery Rate Guy Yehuda

58

Thanks

Yoav Benjamini

Anat Reiner

Note: On the course site you can watch this presentation along with the added notes to the slides which will help you to understand it even better.

Page 59: False Discovery Rate Guy Yehuda

59

Appendix

Page 60: False Discovery Rate Guy Yehuda

60

FDR and FWE adjusted p-values original data

Page 61: False Discovery Rate Guy Yehuda

61

FDR and FWE adjusted p-values simulated data

Page 62: False Discovery Rate Guy Yehuda

62

Power by configuration

Page 63: False Discovery Rate Guy Yehuda

63