1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...

1

Statistical Methods for Rare Variant Association Test Using Summarized Data

Qunyuan Zhang Ingrid Borecki, Michael A. Province

Division of Statistical Genomics

http://medschool.wustl.edu/

2

Motivation

Individual level Summarized level

SubjectVariant

TraitV1 V2 V3

1 0 0 0 case

2 1 0 0 case

3 0 0 0 case

4 0 0 0 control

5 0 0 0 control

6 0 0 1 control

… … … … …

Variant

V1 V2 V3

Variant No. in cases 10 8 3

Variant No. in controls 2 0 1

No. of cases 300 300 300

No. of controls 500 500 500

• Pooled DNA sequencing• Public data (as control)

Next generation sequencing => rare variants Two types of data

Models for Individual-level Data

Burden/collapsing test

0:

)(

0

1

H

xwY i

k

ii

)0...(0: 210

1

ki

k

iii

H

xY

Collective/group test

0:

0:

0:

0

2022

1011

kkk HxY

HxY

HxY

Single-variant test

(Regular GWAS)

Methods for Individual-level Data

4

CMC (Li and Leal, 2008)

WSS (Madsen and Browning, 2009)

VT (Price et al, 2010)

aSum (Han and Pan, 2010)

KBAC (Liu and Leal, 2010)

RBT (Ionita-Laza et al, 2011)

PWST (Zhang et al, 2011)

SKAT( Wu et al, 2011)

EREC( Lin et al, 2011)…

5

Methods for Summarized Data

Method Description Bi-directional effects Ref.

EFTExclusive Frequency Testtesting mutually exclusive allele/carrier freq.

× Commonly-used in publications, such as Cohen et al., 2004TFT Total Frequency Test

testing total allele/carrier freq. ×

CAST Cohort Allele Sum Testtesting total allele/carrier number

× Morgenthaler & Thilly, 2006

C-alpha testing variance √ Neale et al., 2011

6

An Example of Summarized Data

Jonathan C. Cohen, et al. Science 305, 869 (2004)

An Example of Summarized Data (cont.)

Variant allele

number

Reference allele

numberTotal

Low-HDL group 20 236 256

High-HDL group 2 254 256

Total 22 490 512

8

QQ Plots of Existing Methods (under the null)

•EFT and C-alphainflated with false positives

•TFT and CAST no inflation, but need to assume single direction of effects

•ObjectiveMore general, non-inflated, powerful methods …

CAST C-alpha

EFT TFT

9

Structure of Summarized datavariant 1

variant i variant k

variant 2

…

Strategy

Instead of testing total freq./number, we test the randomness of all tables.

variant 3 …

10

4. Calculating p-value P= Prob.( )

Exact Probability Test (EPT)

k

iiPL

1

)log(

iAiiiiii nNCanCanCP ,,, 2211

1.Calculating the probability of each table based on hypergeometric distribution

2. Calculating the logarized joint probability (L) for multiple tables

3. Enumerating all possible table combinations and L scores

11

Likelihood Ratio Test (LRT)

2~):,,,Pr(

):,,,Pr(log2

1212211

12102211

kdfHbaba

HbabaLR k

i

iiA

iiii

k

i

iiiiii

Binomial distribution

Maximum likelihoodestimation

12

Q-Q Plots of EPT and LRT(under the null)

EPTN=500

EPTN=3000

LRTN=500

LRTN=3000

13

Power Comparison significance level=0.00001

Simulation

Logistic model N=500, 1000,3000

50% cases 50% controls

5-15 variantsMAF<1%

Positive causal: 80%(OR=2 to 6)

Neutral :20%Negative causal:0%

Pow

er

Sample size

Pow

er

Sample size

Pow

er

Sample size

14


Pow

er

Sample size

Simulation



5-15 variantsMAF<1%



(OR=1/6 to 1/2)

15


Pow

er

Sample size

Simulation



5-15 variantsMAF<1%



(OR=1/6 to 1/2)

16

Application

-LOG10 p-values of 933 cancer-related genes

Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals from the NHBLI exome project

17

Individual-level Data Based Methods vs.

Summarized Data Based Methods

An interesting question:

If we have individual-level data, but we choose to perform summarized data based analysis, will there be any power gain or loss?

18

Power Comparison individual-level data vs. summarized data

N=1000, significance level=0.00001

Pow

er

Variant proportion positive : neutral : negative (%)

Individual-leveldata basedmethods: CMCLi & Leal, 2008

SKATWu et al., 2011

19

ConclusionsEFT and C-alpha produce inflated p-value.TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects.EPT produces correct p-value and maintains power regardless of effect directions, more computer time.LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data

(This study has not bee published)

20

Acknowledgements

Dr. Li Ding

Charles Lu

Krishna-Latha Kanchi

(for providing the TCGA and NHBLI exome data)

1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...

Documents

Transcript of 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...