X ia - genotype of i-th individual at locus a X ia = 1/2 - individual is heterozygous at locus a
description
Transcript of X ia - genotype of i-th individual at locus a X ia = 1/2 - individual is heterozygous at locus a
Modifying the Schwarz Bayesian Information Criterion to locate multiple interacting
Quantitative Trait Loci
1. M.Bogdan, J.K.Ghosh and R.W.Doerge,Genetics 2004 167: 989-999.
2. M.Bogdan and R.W.Doerge “Mapping multiple interacting QTL by multidimensional genome searches’’
Xia- genotype of i-th individual at locus a
Xia = 1/2 - individual is heterozygous at locus a
Xia = -1/2 - individual is homozygous at locus a
dab=10 cM - ρ (Xia, Xib) = 0.81
Data for QTL mapping
Y1,...,Yn - vector of trait values for n backcross individuals
X=[Xij], 1 ≤ i ≤ n, 1 ≤ j ≤ m - genotypes of m markers
Standard methods of QTL mapping One QTL model
2(1) Q , (0, )
Q (-1/2,1/2) - QTL genotypei i i i
i
Y N
1. Search over markers - fit model (1) at each marker and choose markers for which the likelihood exceeds a preestablished threshold value as candidate
QTL locations.
Interval mapping Lander and Botstein (1989)
• Consider a fixed position between markers
- state of flanking markers
1 1 1 1 1 1 1 1, , , , , , ,
2 2 2 2 2 2 2 2
1(Q | ) easy to compute
2
i
i
i i i
I
I
p P I
2
2 2
1
Q , (0, )
1 1( | ) ( , ) (1 ) ( , )
2 2
( | ) ( | )
i i i i
i i i i
n
i ii
Y N
f Y I p N p N
L Y I f Y I
1. Estimate μ, β, and σ by EM algorithm and compute the corresponding likelihood.
2. Repeat this procedure for a new possible QTL location.
3. Plot the resulting likelihoods as the function of assumed QTL position.
• Problems with interval mapping
a) Not able to distingush closely linked QTL
b) Not able to detect epistatic QTL (involved only in interactions)
• Solution
Estimate the location of several QTL at once using multiple regression model (Kao et al. 1999)
p r
i j ij jl ij ilj 1 1 j<l m
Y μ β γ εiQ Q Q
Problem : estimation of the number of additive and interaction terms
iεXXγXβμY jjj iuik
p
1j
r
1jjihji
Xij - genotype of j-th marker
average number of markers - (200,400)
Bayesian Information Criterion
• Choose the model which maximizes
log L -1/2 k log n
L – likelihood of the data for a given model
k – number of parameters in the model
n – sample size
Broman (1997) and Broman and Speed (2002) – BIC overestimates QTL number
How to modify BIC ?
Mi – i-th linear model (specifies which markers
are included in regression)
θ = (μ, β1,..., βp, γ1,..., γr, σ) – vector of parameters
for Mi
fi(θ) – density of the prior distribution for θ
π(i) – prior probability of Mi
L(Y|θ) – likelihood of the data given the vector
of paramers θ
mi(Y) – likelihood of the data given the model Mi
P(Mi|Y) π(i)mi(Y)
BIC neglects π(i) and uses asymptotic approximation
θ)dθ(θ)f|L(Y(Y)m ii
n 2)logr1/2(p)θ̂L(Y, log(Y)m log i
neglecting π(i) = assigning the same prior probabilityto all models = assigning high prior probability to the
event that there are many regressors
Example : 200 markers
200 models with one additive term
=19 900 models with one interaction or with two additive terms
= 9.05*1058 models with 100 additive terms
2
200
100
200
Idea: supplement BIC with a more realistic prior
distribution π
)(log2log))()((log)(
regression from squares of sum residual
)(log2
)ˆ,(log
log))()((2
1)ˆ,(log)(log)(
~
iniripRSSniS
RSS
nCRSSn
YL
niripYLiiS
Choice of π (George and McCulloch, 1993)
M – number of markers
2
1)M(MN
- number of potential interactions
α - the probability that i-th additive term appears in the model
ν - the probability that j-th interaction term appears in the model
π(M)= αp νr(1-α)M-p (1-ν)N-r
M- model with p additive terms and r interactions
We choose Nuu
Nll
,1
and ,1
log π(M)=C(M,N,l,u)-p log(l-1)-r log(u-1)
)1log(2)1log(2
log)(log)(
urlp
nrpRSSniS
Prior distribution on the number of additive terms, p –Binomial (M,α)
Prior distribution on the number of interactions, r –Binomial (N,ν)
Choice of l and u should depend on the prior knowledge on the number of QTL.
u
N, E(r)
l
ME(p)
Our choice – for the sample size 200probability of wrongly detecting QTL (when there are
none) ≈ 0.05
We keep E(p) and E(r) equal to 2.2
The choice is supported by theoretical bound on type I error based on Bonferoni inequality.
( ) log ( ) log
2 log( / 2.2) 2 log( / 2.2)
S i n RSS p r n
p M r N
Additional penalty similar to Risk Inflation Criterion of Foster and George (2k log t , where t is the total
number of available regressors) and to the modification of BIC proposed by Siegmund (2004).
Search over 12 chromosomesmarkers spaced every 10 cM
n h2 p corr. extr r corr extr
200 0 0 0.95 0.03 0 - 0.02
500 0 0 0.99 0.01 0 - 0
200 0.2 1 1 0.03 0 0 0.02
200 0.195 0 - 0.01 1 0.95 0.04
n h2 p corr extr r corr extr
200 0.55 0 - 0.02 3 2.88 0.08
200 0.5 7 5.06 0.26 0 - 0.09
500 0.5 7 6.99 0.14 0 - 0.03
200 0.43 12 2.39 0.31 0 - 0.03
500 0.43 12 9.68 0.47 0 - 0.02
200 0.71 12 9.53 0.75 0 - 0.02
200 0.53 2 1.95 0.04 5 2.11 0.11
500 0.53 2 2 0.03 5 3.47 0.08
• The criterion adjusts well to the number of available markers
• For n = 200 the criterion detects almost all additive QTL with individual h2 =0.13 and interactions with h2 =0.2.
• For n = 500 the criterion detects almost all additive QTL with individual h2 =0.06 and interactions with h2 =0.12.
Bound for the type I error
1
0 0
the maximum of the criterion over
all one dimensional models
ˆ ˆ= log L ( / , ) the value of the criterion
for the null model
- the number of terms chosen by our criterion
S
S Y
D
P
1 0( 0) ( )D P S S
0
0 0
0
- the value of the criterion for a
given one dimensional model
if
ˆ( / )2 log log 2(log( 1) or log( 1))
ˆ( / )
( )
2 ( log 2(log( 1) or log( 1)))
where (0,1)
i
i
i
i
M
M
M
M
S
S S
L Yn l u
L Y
P S S
P Z n l u
Z N
2
1 01 2
By Bonferoni inequality and the bound
1P(Z>x) exp( )
222 2
( )( 1) ( , ) ( 1) ( , )
x
xM N
P S Sl C l n u C u n
1 0
, 2.2 2.2
( )
4.4 1 1
2 log 2log( 1) log 2log( 1)
M Nl u
P S S
n n l n u
For n=200 and typical values of M this yields values in the range between 0.057 and 0.08.