Fei Zou Department of Biostatistics Email: [email protected]
description
Transcript of Fei Zou Department of Biostatistics Email: [email protected]
Bayesian Variable Selection in Semiparametric Regression
Modeling with Applications to
Genetic Mappping
Fei Zou
Department of Biostatistics
Email: [email protected]
Outline
• Introduction– Experimental crosses– Existing QTL Mapping Methods
• Bayesian semi-parametric QTL Mapping
• Results
• Remarks and Conclusions
http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt
Overview
• One gene one trait: very unlikely
• The vast majority of biological traits are caused by complex polygenes– Potentially interacting with each other
• Most traits have significant environmental exposure components– Potentially interacting with polygenes
ParentsP1
P2
Experimental Crosses: F2
Experimental Crosses
• F2 Backcross(BC)
F1 F1
P1 P2
F2: BC:
P1 P2
P1 F1
AA BB
AB AB
BB AB AB
AA BB
AA AB
AB AA AB
QTL Data Format
0: homozygous AA, 2: homozygous BB, 1: heterozygote AB.Marker positions:
Linkage Analysis
• Data structure: – Marker data (genotypes plus positions)– Phenotypic trait(s)– Other nongenetic covariates, such as age,
gender, environmental conditions etc
• Quantitative trait loci (QTL): a particular region of the genome containing one or more genes that are associated with the trait being assayed or measured
QTL Mapping of Experimental Crosses
• Single QTL Mapping• Single marker analysis (Sax, 1923 Genetics)• Interval mapping: Lander & Botstein (1989,
Genetics)
• Multiple QTL mapping• Composite interval mapping (Zeng 1993 PNAS,
1994 Genetics; Jansen & Stam, 1994 Genetics)• Multiple interval mapping (Kao et al., 1999
Genetics)• Bayesian analysis (Satagopan et al., 1997
Genetics)
Single QTL Interval Mapping
• For backcross, the model assumes
• QTL analysis:
• If QTL genotypes are observed, the analysis is trivial: simple t-test!
• However, QTL position is unknown and therefore QTL genotypes are unobserved
2(QTL genotype is AA) ~ ( , )i AAy N 2(QTL genotype is Aa) ~ ( , )i Aay N
0 : vs :AA Aa A AA AaH H
Interval Mapping
• For QTL between markers– QTL genotypes missing: can use marker genotypes to
infer the conditional probabilities of the QTL genotypes for a given QTL position
– Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm
– In any region where the profile exceeds a (genome-wide) significance threshold, a QTL declared at the position with the highest LOD score.
0
2
4
6
8
Chromosome
lod
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516171819 X
Profile LOD
Multiple QTL Mapping
• Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli – Single QTL interval mapping
• Ghost QTL (Lander & Botstein 1989)• Low power
Multiple QTL Mapping
• Composite interval mapping (Zeng 1993, 1994; Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region
• which background markers to include; window size etc
• Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously
• Computationally intensive; how many QTLs to include?
Multiple QTL Mapping
• Bayesian methods (Stephens and Fisch 1998 Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al. 2003 Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika)
• change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics)
Multiple QTL Mapping
• Alternative, multiple QTL mapping can be viewed as a variable selection problem– Forward and step-wise selection procedures (Broman
and Speed 2002 JRSSB) – LASSO, etc– Bayesian QTL mapping
• Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage
• Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA)
• Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat)
• Software: R/qtlbim by Yi’s group
Multiple QTL Mapping
• Limitations of existing QTL mapping methods – do not model covariates at all or only
model covariate effect linearly
– do not model interactions at all or model only lower order interactions, such as two way interactions
• The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there are possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions.
2 p 22p
2pk
Semiparmetric Multiple (Potentially Interacting) QTL Mapping
• Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects
• Semiparametric model:
where function is unspecified, QTL genotypes and represent all non- genetics factors/covariates.
• When equals : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1.
21 2 1( , ,... , ,..., ) , 1, , with (0, )~
iid
i i i ip i iq i iy x x x t t e i n e N
1 2, ,...i i ipx x x1,...,i iqt t
1 2 3 1i i i ix x x t
Bayesian Semi/non-parametric Methods
– Dirichlet process (Muller et al. 1996)– Splines (Smith and Kohn 1996; Denison et al. 1998
and DiMatteo et al. 2001)– Wavelets (Abramovich et al. 1998 JRSSB)– Kernel models (Liang et al 2007) – Gaussian process (Neal 1997; 1996)
• Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel.
• Gaussian process is flexible for curve estimation because of their flexible sample path shapes
• Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB)
Prior Specification on
• A Gaussian process such that all possible finite dimensional distributions follow multivariate normal with mean 0 and covariance function
where , s and s are hyperparameters and
1( ,..., )Tn
2 2' ' '
1 1
1cov( , ) exp[ ( ) ( ) ]
p q
i i xk ik i k tj ij i jk j
x x t t
xk
1 1( ,..., , ,..., )i i ip i iqx x t t tj
• Hyperparameter defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters related to length scales which characterize the distance in that particular direction over which y is expected to vary significantly– controls the smoothness of : when
the posterior mean of almost interpolates the data while centered around the prior mean function if
– When = 0, y is expected to be an essentially constant
function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).
( )xk tj 1/ xk
xk
0
Priors on • The original papers on the Gaussian process
(Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the parameters. However, does provide information about the relevance of any QTL with value near zero indicating an irrelevant QTL.
• For variable selection purpose, we can impose the following Gamma mixture priors on
1/xk xk
s
Prior Specifications
• Inverse Gamma distributions are used for the priors of and . 2
Simulations
• Set ups:– backcross population – 200 or 500 individuals– 151 evenly spaced markers at 5cM intervals– Four QTLs with varying heritabilities:
• Main effect model: all four QTL act additively• Main plus two way interactions• Four way interactions only
n=500 and pure 4 way-interaction model
n=500 and pure additive model
Real Data Analysis
• A mouse study– # samples: 187 backcross samples– # markers: 85 with average marker distance
20 cM – Phenotypes: inguinal, gonadal, retroperitoneal
and mesenteric fat pad weights
Remarks
• For studies with large # of samples and/or large # of markers, MCMC converges very slowly– We employed the hybrid Monte Carlo method, which
merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation.
– We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards)
• point estimate
Real Study: Cardiovascular Disease
• 2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease
• 820 individuals
• Non-genetic covariates: gender, smoking status, age
Remarks
• Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL – Picks up genes related to the trait regardless
of their marginal main effects or joint epistasis effects
– Cannot readily differentiates genetic contributions
• main effect? interaction? or both?• Fine tuned parametric model with selected genes
Remarks and Future Research
• How to extend the methodologies to human genome-wide association (GWA) studies, where hundreds of thousands of markers are available– Is it possible?– potential solutions: pathway analysis; data reduction
techniques
• How to extend the method to human pedigree analysis where mixed effect model is used for correlated family members?– Use inheritance vector: so far results are very
promising
Acknowledgement
• Joint work with – Hanwen Huang – Haibo Zhou– Fuxia Cheng– Ina Hoeschele
• Funding support– NIH R01 GM074175