1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...
-
Upload
lorin-rice -
Category
Documents
-
view
222 -
download
5
Transcript of 1 Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid...
1
Statistical Methods for Rare Variant Association Test Using Summarized Data
Qunyuan Zhang Ingrid Borecki, Michael A. Province
Division of Statistical Genomics
2
Motivation
Individual level Summarized level
SubjectVariant
TraitV1 V2 V3
1 0 0 0 case
2 1 0 0 case
3 0 0 0 case
4 0 0 0 control
5 0 0 0 control
6 0 0 1 control
… … … … …
Variant
V1 V2 V3
Variant No. in cases 10 8 3
Variant No. in controls 2 0 1
No. of cases 300 300 300
No. of controls 500 500 500
• Pooled DNA sequencing• Public data (as control)
Next generation sequencing => rare variants Two types of data
Models for Individual-level Data
Burden/collapsing test
0:
)(
0
1
H
xwY i
k
ii
)0...(0: 210
1
ki
k
iii
H
xY
Collective/group test
0:
0:
0:
0
2022
1011
kkk HxY
HxY
HxY
Single-variant test
(Regular GWAS)
Methods for Individual-level Data
4
CMC (Li and Leal, 2008)
WSS (Madsen and Browning, 2009)
VT (Price et al, 2010)
aSum (Han and Pan, 2010)
KBAC (Liu and Leal, 2010)
RBT (Ionita-Laza et al, 2011)
PWST (Zhang et al, 2011)
SKAT( Wu et al, 2011)
EREC( Lin et al, 2011)…
5
Methods for Summarized Data
Method Description Bi-directional effects Ref.
EFTExclusive Frequency Testtesting mutually exclusive allele/carrier freq.
× Commonly-used in publications, such as Cohen et al., 2004TFT Total Frequency Test
testing total allele/carrier freq. ×
CAST Cohort Allele Sum Testtesting total allele/carrier number
× Morgenthaler & Thilly, 2006
C-alpha testing variance √ Neale et al., 2011
6
An Example of Summarized Data
Jonathan C. Cohen, et al. Science 305, 869 (2004)
An Example of Summarized Data (cont.)
Variant allele
number
Reference allele
numberTotal
Low-HDL group 20 236 256
High-HDL group 2 254 256
Total 22 490 512
8
QQ Plots of Existing Methods (under the null)
•EFT and C-alphainflated with false positives
•TFT and CAST no inflation, but need to assume single direction of effects
•ObjectiveMore general, non-inflated, powerful methods …
CAST C-alpha
EFT TFT
9
Structure of Summarized datavariant 1
variant i variant k
variant 2
…
Strategy
Instead of testing total freq./number, we test the randomness of all tables.
variant 3 …
10
4. Calculating p-value P= Prob.( )
Exact Probability Test (EPT)
k
iiPL
1
)log(
iAiiiiii nNCanCanCP ,,, 2211
1.Calculating the probability of each table based on hypergeometric distribution
2. Calculating the logarized joint probability (L) for multiple tables
3. Enumerating all possible table combinations and L scores
11
Likelihood Ratio Test (LRT)
2~):,,,Pr(
):,,,Pr(log2
1212211
12102211
kdfHbaba
HbabaLR k
i
iiA
iiii
k
i
iiiiii
Binomial distribution
Maximum likelihoodestimation
12
Q-Q Plots of EPT and LRT(under the null)
EPTN=500
EPTN=3000
LRTN=500
LRTN=3000
13
Power Comparison significance level=0.00001
Simulation
Logistic model N=500, 1000,3000
50% cases 50% controls
5-15 variantsMAF<1%
Positive causal: 80%(OR=2 to 6)
Neutral :20%Negative causal:0%
Pow
er
Sample size
Pow
er
Sample size
Pow
er
Sample size
14
Power Comparison significance level=0.00001
Pow
er
Sample size
Simulation
Logistic model N=500, 1000,3000
50% cases 50% controls
5-15 variantsMAF<1%
Positive causal: 60%(OR=2 to 6)
Neutral :20%Negative causal:20%
(OR=1/6 to 1/2)
15
Power Comparison significance level=0.00001
Pow
er
Sample size
Simulation
Logistic model N=500, 1000,3000
50% cases 50% controls
5-15 variantsMAF<1%
Positive causal: 40%(OR=2 to 6)
Neutral :20%Negative causal:40%
(OR=1/6 to 1/2)
16
Application
-LOG10 p-values of 933 cancer-related genes
Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals from the NHBLI exome project
17
Individual-level Data Based Methods vs.
Summarized Data Based Methods
An interesting question:
If we have individual-level data, but we choose to perform summarized data based analysis, will there be any power gain or loss?
18
Power Comparison individual-level data vs. summarized data
N=1000, significance level=0.00001
Pow
er
Variant proportion positive : neutral : negative (%)
Individual-leveldata basedmethods: CMCLi & Leal, 2008
SKATWu et al., 2011
19
ConclusionsEFT and C-alpha produce inflated p-value.TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects.EPT produces correct p-value and maintains power regardless of effect directions, more computer time.LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data
(This study has not bee published)
20
Acknowledgements
Dr. Li Ding
Charles Lu
Krishna-Latha Kanchi
(for providing the TCGA and NHBLI exome data)