Bayesian variant-based pathway enrichment analysis using ...xiangzhu/CSHL_2015.pdf · Bayesian...

1
Bayesian variant-based pathway enrichment analysis using GWAS summary statistics Xiang Zhu 1 and Matthew Stephens 1,2 1 Department of Statistics, 2 Department of Human Genetics More information at [email protected] Can pathway enrichment and variant prioritization be integrated? Carbonetto and Stephens [1] proposed a single framework that integrated testing pathway enrichment, estimating enrichment level and prioritizing genetic variants in the enriched pathways. The software, BMApathway, is available at https://github.com/pcarbo/bmapathway. A novel integrated approach The GWAS full data are genotypes X := (x 1 ,..., x n ) | and phenotypes y := (y 1 ,..., y n ) | from n unrelated individuals. I For continuous traits, linear regression is used: y i |x i , β N (β 0 + x | i β -1 ). I For binary traits, logistic regression is used: y i |x i , β Bernoulli(η (x i , β )), logit(η (x i , β )) = β 0 + x | i β . The multiple-SNP effect of p SNPs, β := (β 1 ,...,β p ) | , has prior β j (1 - π j )δ 0 + π j N (02 B ). The enrichment of associations withing a pathway is modeled as logit 10 (π j )= θ 0 + a j θ, where a j := 1{SNP j is in the pathway}, θ is the log-fold enrichment and θ 0 is the genome-wide log-odds. Notations: a := (a 1 ,..., a p ) | and γ := (γ 1 ,...,γ p ) | , γ j = 1{β j 6=0}. Can the integrated method be applied to GWAS summary data? I Application of BMApathway is complicated by access to full data. I Summary statistics from single-SNP analysis are widely available. I The enrichment prior is useful even if full data are not provided. A similar integrated analysis is possible if we keep the prior and use a likelihood that only relies on summary data. References [1] P. Carbonetto, M. Stephens, PLoS Genetics 9, e1003770 (2013). [2] X. Zhu, M. Stephens, Presented at the 65th Annual Meeting of The American Society of Human Genetics, Baltimore, MD (2015). [3] Wellcome Trust Case Control Consortium , Nature 447, 661 (2007). [4] D. Croft, et al., Nucleic Acids Research 42, D472 (2014). [5] L. Y. Geer, et al., Nucleic Acids Research 38, D492 (2010). [6] E. G. Cerami, et al., Nucleic Acids Research 39, D685 (2011). [7] C. F. Schaefer, et al., Nucleic Acids Research 37, D674 (2009). [8] H. Mi, A. Muruganujan, P. D. Thomas, Nucleic Acids Research 41, D377 (2013). [9] C. Wrzodek, F. B¨ uchel, M. Ruff, A. Dr¨ ager, A. Zell, BMC systems biology 7, 15 (2013). [10] A. R. Pico, et al., PLoS Biology 6, e184 (2008). [11] L. S. Chen, et al., The American Journal of Human Genetics 86, 860 (2010). [12] H. Gui, M. Li, P. C. Sham, S. S. Cherny, BMC research notes 4, 386 (2011). [13] M.-X. Li, J. S. Kwan, P. C. Sham, The American Journal of Human Genetics 91, 478 (2012). [14] A. R. Wood, et al., Nature Genetics 46, 1173 (2014). [15] 1000 Genomes Project Consortium, Nature 467, 1061 (2010). Regression with Summary Statistics (RSS) provides a solution. We propose the following regression model for GWAS summary statistics: b β |S , R , β N (SRS -1 β , SRS ). I b β := ( ˆ β 1 ,..., ˆ β p ) | , where ˆ β j is the single-SNP effect estimate of SNP j ; I S := diag(s), s := (s 1 ,..., s p ) | , where s j is the standard error of ˆ β j ; I R is the population linkage disequilibrium (LD) matrix. We term the model Regression with Summary Statistics [2]. RSS uses an algorithm based on variational approximation. Our posterior computation exploits the fact that p (β , γ |D)= Z p (β , γ |D0 )p (θ 0 |D)dθ 0 dθ. The strength of pathway enrichment is measured by SBF(a) := p ( b β |S , R , a,θ> 0) / p ( b β |S , R , a= 0). The log-fold enrichment θ is estimated from p (θ |D), D := { b β , S , R , a}. The association signal of SNP j given the enrichment is summarized as SPIP(j ) := p (γ j =1| b β , S , R , a). Estimate p (β , γ |D0 ) given {θ 0 } Decomposition of marginal likelihood: log p (D|θ 0 )= E q log q (β , γ ) p (β , γ |D0 ) | {z } Kullback-Leibler (KL) divergence +E q log p (D, β , γ |θ 0 ) q (β , γ ) | {z } Evidence lower bound (LB) Optimization problem: q ? = arg min q KL(q ; θ 0 ) = arg max q LB(q ; θ 0 ) The only assumption being made: mean field approximation q (β , γ )= p Y j =1 q j (β j j ) Optimal solution q ? j for each each q j : q ? j (β j j )=[α j N (β j ; μ j 2 j )] γ j [(1 - α j )δ 0 (β j )] 1-γ j Iterative scheme for obtaining q ? j : σ 2 j =(s -2 j + σ -2 B ) -1 μ j = σ 2 j (s -2 j ˆ β j - s -1 j i 6=j s -1 i R ij α i μ i ) α j 1 - α j = π j 1 - π j · σ j σ B · exp ( μ 2 j 2σ 2 j ) Estimate p (θ 0 |D) I Current approach is to use points of {θ 0 } in a regular grid and set p (θ 0 |D) exp{LB(q ? ; θ 0 )}. I We are investigating more efficient approaches. Parallel implementation R ij =0 if SNP i and j are on different chromosomes. I Iterative update of q ? j only requires data from SNP i that R ij 6= 0. I LB(q ? ; θ 0 )= 22 c =1 LB c , LB c only uses data from Chromosome c . RSS yields results comparable to a method that requires genotype data. We compare RSS with a full data-based method, BMApathway [1], through simulations based on real genotype data [3]. I Null dataset assumes that each SNP is equally likely to be causal. I Enriched dataset assumes that SNPs in the pathway are more likely to be associated with the phenotypes. The pathway used in simulations is signal transduction (Reactome [4]) retrieved from BioSystems [5]. Type 1 error Power Enrichment estimation Acknowledgements We thank Peter Carbonetto and Xin He for helpful discussions. We thank Raman Shah and John Zekos for expert technical support. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. We acknowledge the University of Chicago Research Computing Center for support of this work. RSS provides a way to gain biological insights into complex human traits. Pathways are retrieved from Pathway Commons (PC) [6], Biosystems (BS) [5] and BioCarta (BC), which include gene sets derived from Reactome [4], PID [7], PANTHER [8], KEGG [9] and WikiPathways (Wiki) [10]. Crohn’s disease We applied RSS on 3160 curated pathways and GWAS summary statistics of 435,615 SNPs for Crohn’s disease from 4,686 individuals (1,748 cases and 2,938 controls) in the British population [3]. The top five candidate pathways for enrichment of Crohn’s disease detected by RSSpathway and BMApathway are the same, and their enrichments are also significant using other methods. Pathway Source Database log 10 (BF) - log 10 (P ) RSSpathway BMApathway [1] GRASS [11] Gates-Simes [12] HYST [13] IL12-mediated signaling events PID PC 9.42 4.00 11.06 10.80 15.65 Cytokine signaling in immune system Reactome BS 8.87 5.97 14.91 10.32 15.65 IL23-mediated signaling events PID PC 8.72 4.13 12.29 11.02 Inf Immune system Reactome BS 6.24 3.30 10.72 9.97 8.39 Immune system Reactome PC 5.75 3.07 10.04 10.13 7.49 The following figure shows P 1 , posterior probability that locus contains disease risk variants, with and without enrichment of cytokine signaling. Adult height We applied RSS on 3700 curated pathways and GWAS summary statistics of 1.06 million SNPs for adult human height from 253,288 individuals of European (EUR) ancestry [14]. The population LD matrix R was estimated from the 1000 Genomes [15] EUR samples. The top-ranked pathways are listed below, and most of them are linked to skeletal development and homeostasis, regulation of human stature, cell migration, cancer and etiology of bone diseases. Pathway Source Database # of genes Hedgehog signaling pathway Wiki BS 22 Hedgehog signaling pathway KEGG BS 51 Basal cell carcinoma KEGG BS 55 Hedgehog ’on’ state Reactome BS 84 RAC1 signaling pathway PID PC 54 Rho cell motility signaling pathway BC BC 32 Signaling by Hedgehog Reactome BS 135 Regulatory role of PI3K subunit p85 BC BC 16 How does salmonella hijack a cell BC BC 13 Y branching of actin filaments BC BC 20 Pathogenic Escherichia coli infection KEGG BS 55 Pathogenic Escherichia coli infection Wiki BS 56 How progesterone initiates oocyte membrane BC BC 33 Rho GTPases activate WASPs and WAVEs Reactome BS 35 Cytoskeletal regulation by Rho GTPase PANTHER PC 70 Signaling events mediated by the Hedgehog family PID BS 22 Signaling events mediated by the Hedgehog family PID PC 23 EPHB-mediated forward signaling Reactome PC 41 Ligand-receptor interactions Reactome BS 8 Software Software of using RSS for integrated enrichment analysis, RSSpathway, will be available from https://github.com/stephenslab/rss.

Transcript of Bayesian variant-based pathway enrichment analysis using ...xiangzhu/CSHL_2015.pdf · Bayesian...

Page 1: Bayesian variant-based pathway enrichment analysis using ...xiangzhu/CSHL_2015.pdf · Bayesian variant-based pathway enrichment analysis using GWAS summary statistics ... PID [7],

Bayesian variant-based pathway enrichmentanalysis using GWAS summary statistics

Xiang Zhu1 and Matthew Stephens1,2

1Department of Statistics, 2Department of Human Genetics

More information at [email protected]

Can pathway enrichment andvariant prioritization be integrated?

Carbonetto and Stephens [1] proposed a single framework that integratedtesting pathway enrichment, estimating enrichment level and prioritizinggenetic variants in the enriched pathways. The software, BMApathway, isavailable at https://github.com/pcarbo/bmapathway.

A novel integrated approach

The GWAS full data are genotypes X := (x1, . . . , xn)ᵀ and phenotypesy := (y1, . . . , yn)ᵀ from n unrelated individuals.

I For continuous traits, linear regression is used:

yi |xi ,β ∼ N(β0 + xᵀi β, τ−1).

I For binary traits, logistic regression is used:

yi |xi ,β ∼ Bernoulli(η(xi ,β)), logit(η(xi ,β)) = β0 + xᵀi β.

The multiple-SNP effect of p SNPs, β := (β1, . . . , βp)ᵀ, has prior

βj ∼ (1− πj)δ0 + πjN(0, σ2B).

The enrichment of associations withing a pathway is modeled as

logit10(πj) = θ0 + ajθ,

where aj := 1{SNP j is in the pathway}, θ is the log-fold enrichmentand θ0 is the genome-wide log-odds.

Notations: a := (a1, . . . , ap)ᵀ and γ := (γ1, . . . , γp)ᵀ, γj = 1{βj 6= 0}.

Can the integrated method beapplied to GWAS summary data?

I Application of BMApathway is complicated by access to full data.

I Summary statistics from single-SNP analysis are widely available.

I The enrichment prior is useful even if full data are not provided.

A similar integrated analysis is possible if

we keep the prior and use a likelihood that only relies on summary data.

References

[1] P. Carbonetto, M. Stephens, PLoS Genetics 9, e1003770 (2013).

[2] X. Zhu, M. Stephens, Presented at the 65th Annual Meeting of TheAmerican Society of Human Genetics, Baltimore, MD (2015).

[3] Wellcome Trust Case Control Consortium , Nature 447, 661 (2007).

[4] D. Croft, et al., Nucleic Acids Research 42, D472 (2014).

[5] L. Y. Geer, et al., Nucleic Acids Research 38, D492 (2010).

[6] E. G. Cerami, et al., Nucleic Acids Research 39, D685 (2011).

[7] C. F. Schaefer, et al., Nucleic Acids Research 37, D674 (2009).

[8] H. Mi, A. Muruganujan, P. D. Thomas, Nucleic Acids Research 41,D377 (2013).

[9] C. Wrzodek, F. Buchel, M. Ruff, A. Drager, A. Zell, BMC systemsbiology 7, 15 (2013).

[10] A. R. Pico, et al., PLoS Biology 6, e184 (2008).

[11] L. S. Chen, et al., The American Journal of Human Genetics 86, 860(2010).

[12] H. Gui, M. Li, P. C. Sham, S. S. Cherny, BMC research notes 4, 386(2011).

[13] M.-X. Li, J. S. Kwan, P. C. Sham, The American Journal of HumanGenetics 91, 478 (2012).

[14] A. R. Wood, et al., Nature Genetics 46, 1173 (2014).

[15] 1000 Genomes Project Consortium, Nature 467, 1061 (2010).

Regression with Summary Statistics(RSS) provides a solution.

We propose the following regression model for GWAS summary statistics:

β|S ,R ,β ∼ N(SRS−1β, SRS).

I β := (β1, . . . , βp)ᵀ, where βj is the single-SNP effect estimate of SNP j ;

I S := diag(s), s := (s1, . . . , sp)ᵀ, where sj is the standard error of βj ;

I R is the population linkage disequilibrium (LD) matrix.

We term the model Regression with Summary Statistics [2].

RSS uses an algorithm based onvariational approximation.

Our posterior computation exploits the fact that

p(β,γ|D) =

∫p(β,γ|D, θ0, θ)p(θ0, θ|D)dθ0dθ.

The strength of pathway enrichment is measured by

SBF(a) := p(β|S ,R , a, θ > 0) / p(β|S ,R , a, θ = 0).

The log-fold enrichment θ is estimated from

p(θ|D), D := {β, S ,R , a}.The association signal of SNP j given the enrichment is summarized as

SPIP(j) := p(γj = 1|β, S ,R , a).

Estimate p(β,γ|D, θ0, θ) given {θ0, θ}

Decomposition of marginal likelihood:

log p(D|θ0, θ) = Eq log

[q(β,γ)

p(β,γ|D, θ0, θ)

]︸ ︷︷ ︸

Kullback-Leibler (KL) divergence

+ Eq log

[p(D,β,γ|θ0, θ)

q(β,γ)

]︸ ︷︷ ︸

Evidence lower bound (LB)

Optimization problem:

q? = arg minq

KL(q; θ0, θ) = arg maxq

LB(q; θ0, θ)

The only assumption being made: mean field approximation

q(β,γ) =

p∏j=1

qj(βj , γj)

Optimal solution q?j for each each qj :

q?j (βj , γj) = [αjN(βj ;µj , σ2j )]γj [(1− αj)δ0(βj)]1−γj

Iterative scheme for obtaining q?j :

σ2j = (s−2

j + σ−2B )−1

µj = σ2j (s−2

j βj − s−1j

∑i 6=js−1i Rijαiµi)

αj

1− αj=

πj

1− πj·σj

σB· exp

{µ2

j

2σ2j

}

Estimate p(θ0, θ|D)

I Current approach is to use points of {θ0, θ} in a regular grid and set

p(θ0, θ|D) ∝ exp{LB(q?; θ0, θ)}.I We are investigating more efficient approaches.

Parallel implementation

Rij = 0 if SNP i and j are on different chromosomes.

I Iterative update of q?j only requires data from SNP i that Rij 6= 0.

I LB(q?; θ0, θ) =∑22

c=1 LBc , LBc only uses data from Chromosome c .

RSS yields results comparable to amethod that requires genotype data.

We compare RSS with a full data-based method, BMApathway [1], throughsimulations based on real genotype data [3].

I Null dataset assumes that each SNP is equally likely to be causal.

I Enriched dataset assumes that SNPs in the pathway are more likely to beassociated with the phenotypes. The pathway used in simulations is signaltransduction (Reactome [4]) retrieved from BioSystems [5].

Type 1 error

Power

Enrichment estimation

Acknowledgements

We thank Peter Carbonetto and Xin He for helpful discussions. We thankRaman Shah and John Zekos for expert technical support.

This study makes use of data generated by the Wellcome Trust Case ControlConsortium. A full list of the investigators who contributed to the generationof the data is available from www.wtccc.org.uk.

We acknowledge the University of Chicago Research Computing Center forsupport of this work.

RSS provides a way to gain biologicalinsights into complex human traits.

Pathways are retrieved from Pathway Commons (PC) [6], Biosystems (BS)[5] and BioCarta (BC), which include gene sets derived from Reactome [4],PID [7], PANTHER [8], KEGG [9] and WikiPathways (Wiki) [10].

Crohn’s disease

We applied RSS on 3160 curated pathways and GWAS summary statisticsof 435,615 SNPs for Crohn’s disease from 4,686 individuals (1,748 casesand 2,938 controls) in the British population [3].

The top five candidate pathways for enrichment of Crohn’s diseasedetected by RSSpathway and BMApathway are the same, and theirenrichments are also significant using other methods.

Pathway Source Database log10(BF) − log10(P)RSSpathway BMApathway [1] GRASS [11] Gates-Simes [12] HYST [13]

IL12-mediated signaling events PID PC 9.42 4.00 11.06 10.80 15.65Cytokine signaling in immune system Reactome BS 8.87 5.97 14.91 10.32 15.65IL23-mediated signaling events PID PC 8.72 4.13 12.29 11.02 InfImmune system Reactome BS 6.24 3.30 10.72 9.97 8.39Immune system Reactome PC 5.75 3.07 10.04 10.13 7.49

The following figure shows P1, posterior probability that locus containsdisease risk variants, with and without enrichment of cytokine signaling.

Adult height

We applied RSS on 3700 curated pathways and GWAS summary statisticsof 1.06 million SNPs for adult human height from 253,288 individuals ofEuropean (EUR) ancestry [14]. The population LD matrix R wasestimated from the 1000 Genomes [15] EUR samples.

The top-ranked pathways are listed below, and most of them are linked toskeletal development and homeostasis, regulation of human stature, cellmigration, cancer and etiology of bone diseases.

Pathway Source Database # of genesHedgehog signaling pathway Wiki BS 22Hedgehog signaling pathway KEGG BS 51Basal cell carcinoma KEGG BS 55Hedgehog ’on’ state Reactome BS 84RAC1 signaling pathway PID PC 54Rho cell motility signaling pathway BC BC 32Signaling by Hedgehog Reactome BS 135Regulatory role of PI3K subunit p85 BC BC 16How does salmonella hijack a cell BC BC 13Y branching of actin filaments BC BC 20Pathogenic Escherichia coli infection KEGG BS 55Pathogenic Escherichia coli infection Wiki BS 56How progesterone initiates oocyte membrane BC BC 33Rho GTPases activate WASPs and WAVEs Reactome BS 35Cytoskeletal regulation by Rho GTPase PANTHER PC 70Signaling events mediated by the Hedgehog family PID BS 22Signaling events mediated by the Hedgehog family PID PC 23EPHB-mediated forward signaling Reactome PC 41Ligand-receptor interactions Reactome BS 8

Software

Software of using RSS for integrated enrichment analysis, RSSpathway, willbe available from https://github.com/stephenslab/rss.