1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics...

49
1 Association Analysis Association Analysis of Rare Genetic of Rare Genetic Variants Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics

Transcript of 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics...

Page 1: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

1

Association Analysis of Association Analysis of Rare Genetic VariantsRare Genetic Variants

Qunyuan ZhangDivision of Statistical Genomics

Course M21-621 Computational Statistical Genetics

Page 2: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

2

Rare VariantsRare Variants

Low allele frequency: usually less than 1%

Low power: for most analyses, due to less variation of observations

High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated p-value.

Page 3: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

3

An Example of Low Power

Jonathan C. Cohen, et al. Science 305, 869 (2004)

Page 4: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

An Example of High False Positive Rate(Q-Q plots from GWAS data, unpublished)

N=~2500

MAF>0.03

N=~2500

MAF<0.03

N=~2500

MAF<0.03

Permuted

N=50000

MAF<0.03

Bootstrapped

Page 5: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

5

Three Levels of Three Levels of Rare Variant DataRare Variant Data

Level 1: Individual-level

Level 2: Summarized over subjects

Level 3: Summarized over both subjects and variants

Page 6: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

6

Level 1: Individual-level

Subject V1 V2 V3 V4 Trait-1 Trait-2

1 1 0 0 0 90.1 1

2 0 1 0 . 99.2 1

3 0 0 0 0 105.9 0

4 0 0 0 0 89.5 0

5 0 . 0 0 97.6 0

6 0 0 0 0 110.5 0

7 0 0 1 0 88.8 0

8 0 0 0 1 95.4 1

Page 7: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

7

Level 2: Summarized over subjects (by group)

Jonathan C. Cohen, et al. Science 305, 869 (2004)Jonathan C. Cohen, et al. Science 305, 869 (2004)

Page 8: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Level 3: Summarized over subjects (by group) and variants (usually by gene)

Variant allele

number

Reference allele

numberTotal

Low-HDL group

20 236 256

High-HDL group

2 254 256

Total 22 490 512

Page 9: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

9

Methods For Level 3 Data

Page 10: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

10

Single-variant Test vs Total Freq.Test (TFT)

Jonathan C. Cohen, et al. Science 305, 869 (2004)

Page 11: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

11

What we have learned …

Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01)

Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…)

Page 12: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

12

Methods For Level 2 Data

Allowing different samples sizes for different variants

Different variants can be weighted differently

Page 13: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

13

CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56

Under H0:S(cases)/2N(cases)−S(controls)/2N(controls) =0S: variant number; N: sample size

T= S(cases) − S(controls)N(cases)/N(controls)= S(cases) − S∗(controls)(S can be calculated variant by variant and can be weighted differently, the final T=sum(WiSi) )

Z=T/SQRT(Var(T)) ~ N (0,1)

Var(T)= Var (S(cases) − S* (controls) )=Var(S(cases)) + Var(S* (controls))=Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2

Page 14: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

14

C-alpha

PLOS Genetics, 2011 | Volume 7 | Issue 3 | e1001322

Effect direction problem

Page 15: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

15

C-alpha

Page 16: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

QQ Plots of Existing Methods (under the null)

•EFT and C-alphainflated with false positives

•TFT and CAST no inflation, but assuming single effect-direction

•ObjectiveMore general, powerful methods …

CAST C-alpha

EFT TFT

Page 17: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

17

More Generalized Methods For Level 2 Data

Page 18: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Structure of Level 2 datavariant 1

variant i variant k

variant 2

Strategy

Instead of testing total freq./number, we test the randomness of all tables.

variant 3 …

Page 19: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

4. Calculating p-value P= Prob.( )

Exact Probability Test (EPT)

k

iiPL

1

)log(

iA

iiiiii nNCanCanCP ,,, 2211

1.Calculating the probability of each table based on hypergeometric distribution

2. Calculating the logarized joint probability (L) for all k tables

3. Enumerating all possible tables and L scores

ASHG Meeting 1212, Zhang

Page 20: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Likelihood Ratio Test (LRT)

2~):,,,Pr(

):,,,Pr(log2

1212211

12102211

kdfHbaba

HbabaLR k

i

iiA

iiii

k

i

iiiiii

Binomial distribution

ASHG Meeting 1212, Zhang

Page 21: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Q-Q Plots of EPT and LRT(under the null)

EPTN=500

EPTN=3000

LRTN=500

LRTN=3000

Page 22: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Power Comparison significance level=0.00001

Variant proportion

Positive causal 80%

Neutral 20%

Negative Causal0%

Pow

er

Sample size

Pow

er

Sample size

Pow

er

Sample size

Page 23: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Power Comparison significance level=0.00001

Variant proportion

Positive causal 60%

Neutral 20%

Negative Causal20%

Pow

er

Sample size

Page 24: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Power Comparison significance level=0.00001

Variant proportion

Positive causal 40%

Neutral 20%

Negative Causal40%

Pow

er

Sample size

Page 25: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

25

Methods For Level 1 Data

•Including covariates

•Extended to quantitative trait

•Better control for population structure

•More sophisticate model

Page 26: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

26

Collapsing (C) test

Step 1

Step 2

logit(y)=a + b* X + e (logistic regression)

Li and Leal,The American Journal of Human Genetics 2008(83): 311–321

Page 27: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

27

Variant Collapsing

(+) (+) (.) (.)

Subject V1 V2 V3 V4 Collapsed Trait

1 1 0 0 0 1 1

2 0 1 0 0 1 1

3 0 0 0 0 0 0

4 0 0 0 0 0 0

5 0 0 0 0 0 0

6 0 0 0 0 0 0

7 0 0 1 0 1 0

8 0 0 0 1 1 1

Page 28: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

28

WSS

Page 29: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

29

WSS

Page 30: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

30

WSS

Page 31: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

31

Weighted Sum Testi

m

ii gws

1

Collapsing test (Li & Leal, 2008), wi =1 and s=1 if s>1

Weighted-sum test (Madsen & Browning ,2009), wi calculated based-on allele freq. in control group

aSum: Adaptive sum test (Han & Pan ,2010), wi = -1 if b<0 and p<0.1, otherwise wj=1

KBAC (Liu and Leal, 2010), wi = left tail p value

RBT (Ionita-Laza et al, 2011), wi = log scaled probability

PWST p-value weighted sum test (Zhang et al., 2011) :, wi = rescaled left tail p value, incorporating both significance and directions

EREC( Lin et al, 2011), wi = estimated effect size

Page 32: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

32

When there are only causal(+) variants …

(+) (+)Subjec

t V1 V2Collapse

d Trait

1 1 0 1 3.00

2 0 1 1 3.10

3 0 0 0 1.95

4 0 0 0 2.00

5 0 0 0 2.05

6 0 0 0 2.10

Collapsing (Li & Leal,2008) works well, power increased

Page 33: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

33

(+) (+) (.) (.)

Subject V1 V2 V3 V4Collapse

d Trait1 1 0 0 0 1 3.002 0 1 0 0 1 3.103 0 0 0 0 0 1.954 0 0 0 0 0 2.005 0 0 0 0 0 2.056 0 0 0 0 0 2.107 0 0 1 0 1 2.008 0 0 0 1 1 2.10

When there are causal(+) and non-causal(.) variants …

Collapsing still works, power reduced

Page 34: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

34

(+) (+) (.) (.) (-) (-)

Subject V1 V2 V3 V4 V5 V6Collaps

ed Trait1 1 0 0 0 0 0 1 3.002 0 1 0 0 0 0 1 3.103 0 0 0 0 0 0 0 1.954 0 0 0 0 0 0 0 2.005 0 0 0 0 0 0 0 2.056 0 0 0 0 0 0 0 2.107 0 0 1 0 0 0 1 2.008 0 0 0 1 0 0 1 2.109 0 0 0 0 1 0 1 0.95

10 0 0 0 0 0 1 1 1.00

When there are causal(+) non-causal(.) and causal (-) variants …

Power of collapsing test significantly down

Page 35: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

35

P-value Weighted Sum Test (PWST)(+) (+) (.) (.) (-) (-)

Subject V1 V2 V3 V4 V5 V6 Collapsed pSum Trait1 1 0 0 0 0 0 1 0.86 3.002 0 1 0 0 0 0 1 0.90 3.103 0 0 0 0 0 0 0 0.00 1.954 0 0 0 0 0 0 0 0.00 2.005 0 0 0 0 0 0 0 0.00 2.056 0 0 0 0 0 0 0 0.00 2.107 0 0 1 0 0 0 1 -0.02 2.008 0 0 0 1 0 0 1 0.08 2.109 0 0 0 0 1 0 1 -0.90 0.95

10 0 0 0 0 0 1 1 -0.88 1.00t 1.61 1.84 -0.04 0.11 -1.84 -1.72

p(x≤t) 0.93 0.95 0.49 0.54 0.05 0.062*(p-0.5) 0.86 0.90 -0.02 0.08 -0.90 -0.88

Rescaled left-tail p-value [-1,1] is used as weight

Page 36: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

36

P-value Weighted Sum Test (PWST)

Power of collapsing test is retained

even there are bidirectional effects

Page 37: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

37

PWST:Q-Q Plots Under the Null

Direct testInflation of type I error

Corrected by permutation test(permutation of phenotype)

Page 38: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Generalized Linear Mixed Model (GLMM)

& Weighted Sum Test (WST)

38

Page 39: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

GLMM & WST

Y : quantitative trait or logit(binary trait)α : interceptβ : regression coefficient of weighted sum m : number of RVs to be collapsed wi : weight of variant igi : genotype (recoded) of variant iΣwigi : weighted sum (WS)X: covariate(s), such as population structure variable(s)τ : fixed effect(s) of XZ: design matrix corresponding to γγ : random polygene effects for individual subjects, ~N(0, G), G=2σ2K, K is the kinship matrix and σ2 the additive ploygene genetic variance ε : residual

ZXgwY i

m

ii

1

39

Page 40: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold;

Based on function annotation/prediction; SIFT, PolyPhen etc.

Based on sequencing quality (coverage, mapping quality, genotyping quality etc.);

Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test;

Any combination …

Weight

40

i

m

ii gw

1

Page 41: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Adjusting relatedness in family data for non-data-driven test of rare variants.

Application 1: Family Data

41

i

m

ii gwY

1

ZgwY i

m

ii

1

γ ~N(0,2σ2K)

Unadjusted:

Adjusted:

Page 42: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Q-Q Plots of –log10(P) under the Null

Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error

Li & Leal’s collapsing test, modeling family structure via GLMM,inflation is corrected

42

(From Zhang et al, 2011, BMC Proc.)

Page 43: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Application 2: Permuting Family Data

ZgwY i

m

ii

1

Permuted

Non-permuted, subject IDs fixed

43

MMPT: Mixed Model-based Permutation Test

Adjusting relatedness in family data for data-driven permutation test of rare variants.

γ ~N(0,2σ2K)

Page 44: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Q-Q Plots under the Null WSS

SPWSTPWSTaSum

Permutation test, ignoring family structure, inflation of type-1 error

44

(From Zhang et al, 2011, IGES Meeting)

Page 45: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Q-Q Plots under the Null WSS

SPWSTPWSTaSum

Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected

(From Zhang et al, 2011, IGES Meeting)

Page 46: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Burden Test vs. Non-burden Test

46

Burden test

)0...(0:

...

210

1

ki

k

iii

H

xY

Non-burden test

T-test, Likelihood Ratio Test, F-test, score test, …

SKAT: sequence kernel association test

0:

)(

0

1

H

xwY i

k

ii

Page 47: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

SKAT: sequence kernel association test

)0...(0: 210

1

ki

k

iii

H

xY

Page 48: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Extension of SKAT to Family Data

kinship matrix

Polygenic heritability of the trait Residual

Han Chen et al., 2012, Genetic Epidemiology

Page 49: 1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Other problems

49

Missing genotypes & imputation

Genotyping errors & QC (family consistency,

sequence review)

Population Stratification

Inherited variants and de novo mutation

Family data & linkage infomation

Variant validation and association validation

Public databases

And more …