1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...

20
1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics

Transcript of 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...

Page 1: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

1

Statistical Methods for Rare Variant Association Test Using Summarized Data

Qunyuan Zhang Ingrid Borecki, Michael A. Province

Division of Statistical Genomics

Page 2: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

2

Motivation

Individual level Summarized level

SubjectVariant

TraitV1 V2 V3

1 0 0 0 case

2 1 0 0 case

3 0 0 0 case

4 0 0 0 control

5 0 0 0 control

6 0 0 1 control

… … … … …

Variant

V1 V2 V3

Variant No. in cases 10 8 3

Variant No. in controls 2 0 1

No. of cases 300 300 300

No. of controls 500 500 500

• Pooled DNA sequencing• Public data (as control)

Next generation sequencing => rare variants Two types of data

Page 3: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

Models for Individual-level Data

Burden/collapsing test

0:

)(

0

1

H

xwY i

k

ii

)0...(0: 210

1

ki

k

iii

H

xY

Collective/group test

0:

0:

0:

0

2022

1011

kkk HxY

HxY

HxY

Single-variant test

(Regular GWAS)

Page 4: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

Methods for Individual-level Data

4

CMC (Li and Leal, 2008)

WSS (Madsen and Browning, 2009)

VT (Price et al, 2010)

aSum (Han and Pan, 2010)

KBAC (Liu and Leal, 2010)

RBT (Ionita-Laza et al, 2011)

PWST (Zhang et al, 2011)

SKAT( Wu et al, 2011)

EREC( Lin et al, 2011)…

Page 5: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

5

Methods for Summarized Data

Method Description Bi-directional effects Ref.

EFTExclusive Frequency Testtesting mutually exclusive allele/carrier freq.

× Commonly-used in publications, such as Cohen et al., 2004TFT Total Frequency Test

testing total allele/carrier freq. ×

CAST Cohort Allele Sum Testtesting total allele/carrier number

× Morgenthaler & Thilly, 2006

C-alpha testing variance √ Neale et al., 2011

Page 6: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

6

An Example of Summarized Data

Jonathan C. Cohen, et al. Science 305, 869 (2004)

Page 7: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

An Example of Summarized Data (cont.)

Variant allele

number

Reference allele

numberTotal

Low-HDL group 20 236 256

High-HDL group 2 254 256

Total 22 490 512

Page 8: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

8

QQ Plots of Existing Methods (under the null)

•EFT and C-alphainflated with false positives

•TFT and CAST no inflation, but need to assume single direction of effects

•ObjectiveMore general, non-inflated, powerful methods …

CAST C-alpha

EFT TFT

Page 9: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

9

Structure of Summarized datavariant 1

variant i variant k

variant 2

Strategy

Instead of testing total freq./number, we test the randomness of all tables.

variant 3 …

Page 10: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

10

4. Calculating p-value P= Prob.( )

Exact Probability Test (EPT)

k

iiPL

1

)log(

iAiiiiii nNCanCanCP ,,, 2211

1.Calculating the probability of each table based on hypergeometric distribution

2. Calculating the logarized joint probability (L) for multiple tables

3. Enumerating all possible table combinations and L scores

Page 11: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

11

Likelihood Ratio Test (LRT)

2~):,,,Pr(

):,,,Pr(log2

1212211

12102211

kdfHbaba

HbabaLR k

i

iiA

iiii

k

i

iiiiii

Binomial distribution

Maximum likelihoodestimation

Page 12: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

12

Q-Q Plots of EPT and LRT(under the null)

EPTN=500

EPTN=3000

LRTN=500

LRTN=3000

Page 13: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

13

Power Comparison significance level=0.00001

Simulation

Logistic model N=500, 1000,3000

50% cases 50% controls

5-15 variantsMAF<1%

Positive causal: 80%(OR=2 to 6)

Neutral :20%Negative causal:0%

Pow

er

Sample size

Pow

er

Sample size

Pow

er

Sample size

Page 14: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

14

Power Comparison significance level=0.00001

Pow

er

Sample size

Simulation

Logistic model N=500, 1000,3000

50% cases 50% controls

5-15 variantsMAF<1%

Positive causal: 60%(OR=2 to 6)

Neutral :20%Negative causal:20%

(OR=1/6 to 1/2)

Page 15: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

15

Power Comparison significance level=0.00001

Pow

er

Sample size

Simulation

Logistic model N=500, 1000,3000

50% cases 50% controls

5-15 variantsMAF<1%

Positive causal: 40%(OR=2 to 6)

Neutral :20%Negative causal:40%

(OR=1/6 to 1/2)

Page 16: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

16

Application

-LOG10 p-values of 933 cancer-related genes

Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals from the NHBLI exome project

Page 17: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

17

Individual-level Data Based Methods vs.

Summarized Data Based Methods

An interesting question:

If we have individual-level data, but we choose to perform summarized data based analysis, will there be any power gain or loss?

Page 18: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

18

Power Comparison individual-level data vs. summarized data

N=1000, significance level=0.00001

Pow

er

Variant proportion positive : neutral : negative (%)

Individual-leveldata basedmethods: CMCLi & Leal, 2008

SKATWu et al., 2011

Page 19: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

19

ConclusionsEFT and C-alpha produce inflated p-value.TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects.EPT produces correct p-value and maintains power regardless of effect directions, more computer time.LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data

(This study has not bee published)

Page 20: 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical.

20

Acknowledgements

Dr. Li Ding

Charles Lu

Krishna-Latha Kanchi

(for providing the TCGA and NHBLI exome data)