1 CASE STUDY: Genetic Linkage Analysis via Bayesian Networks We speculate a locus with alleles H...
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of 1 CASE STUDY: Genetic Linkage Analysis via Bayesian Networks We speculate a locus with alleles H...
1
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
We speculate a locus with alleles H (Healthy) / D (affected)
If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.
2
4
5
1
3
H
A1/A1
D
A2/A2
H
A1/A2
D
A1/A2
H
A2/A2
D DA1 A2
H DA1 A2
H | DA2 | A2
D DA2 A2
Recombinant
Phase inferred
2
The Variables InvolvedLijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i.
Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). “The genotype” Yj = person I is affected/not affected. “The phenotype”.
Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) .
Sijm = a binary variable {0,1} that determines which maternal allele is received from the mother. Similarly, Sijf = a binary variable {0,1} that determines which paternal allele is received from the father.
It remains to specify the joint distribution that governs these variables. Bayesian networks turn to be a perfect choice.
3
The Bayesian network for Linkage
Locus 1
Locus 3 Locus 4
Si3
m
Li1
fL
i1m
Li3
m
Xi1
Si3
f
Li2
fL
i2m
Li3
f
Xi2
Xi3
Locus 2 (Disease)
Y3
y2
Y1
This network depicts the qualitative relations between the variables.We have already specified the local conditional probability tables.
4
Details regarding recombination
S23m
L21fL21m
L23m
X21 S23f
L22fL22m
L23f
X22
X23
S13m
L11fL11m
L13m
X11 S13f
L12fL12m
L13f
X12
X13
{m,f}tssP tt
where
1
1),|( 1323
is the recombination fraction between loci 2 & 1.
Y2Y1
Y3
5
Details regarding the Loci
The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are connected to the Xij variables (only in the disease locus). For example, model of perfect recessive disease yields the penetrance probabilities:
P(y11 = sick | X11= (a,a)) = 1P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0
Li1fLi1m
Li3m
Xi1
Si3m
Y1
P(L11m=a) is the frequency of allele a. X11 is an unordered allele
pair at locus 1 of person 1 = “the data”. P(x11 | l11m, l11f) = 0 or 1 depending on consistency
6
SUPERLINK
Stage 1: each pedigree is translated into a Bayesian network.
Stage 2: value elimination is performed on each
pedigree (i.e., some of the impossible values of the variables of the network are eliminated).
Stage 3: an elimination order for the variables is determined, according to some heuristic.
Stage 4: the likelihood of the pedigrees given the values is calculated using variable elimination according to the elimination order determined in stage 3.
Allele recoding and special matrix multiplication is used.
7
Comparing to the HMM model
X1 X2 X3 Xi-1 Xi Xi+1X1 X2 X3 Yi-1 Xi Xi+1
X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1
The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i. Similarly for the disease locus we use Yi.
REMARK: The HMM approach is equivalent to the Bayesian network approach provided we sum variables locus-after-locus say from left to right.
8
Experiment A (V1.0)
• Same topology (57 people, no loops)• Increasing number of loci (each one with 4-5 alleles)• Run time is in seconds.
Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter
A0 2 0.03 0.12 0.27A1 5 0.1 3.77 0.31A2 6 0.14 79.32 0.39A3 7 0.42 0.69A4 8 0.36 2.81A5 10 1.19 84.66A6 12 4.65A7 14 3.01A8 18 20.98A9 37 8510.15
A10 38 10446.27A11 40
over 100 hours
Out-of-memory
Pedigree sizeToo big forGenehunter.
Elimination Order: General Person-by-Person Locus-by-Locus (HMM)
9
Experiment C (V1.0)
Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter
Bayesnets Trees Trees HMMD0 100 0.16 (2 l.e.) 0.41 (99 l.e.)D1 110 0.2 (2 l.e.) 0.45 (109 l.e.)D2 120 0.21 (2 l.e.) 0.48 (119 l.e.)D3 130 0.22 (2 l.e.) 0.49 (129 l.e.)D4 140 0.24 (2 l.e.) 0.51 (139 l.e.)D5 150 0.25 (2 l.e.) 0.53 (149 l.e.)D6 160 0.27 (2 l.e.) 0.54 (159 l.e.)D7 170 0.3 (2 l.e.) 0.6 (169 l.e.)D8 180 0.3 (2 l.e.) 0.59 (179 l.e.)D9 190 0.32 (2 l.e.) 0.61 (189 l.e.)D10 200 0.34 (2 l.e.) 0.66 (199 l.e)D11 210 0.37 (2 l.e.) 0.67 (209 l.e)
• Same topology (5 people, no loops)• Increasing number of loci (each one with 3-6 alleles)• Run time is in seconds.
Out-of-memory
Bus error
Order typeSoftware
10
Some options for improving efficiency
1. Multiplying special probability tables efficiently.
2. Grouping alleles together and removing inconsistent alleles.
3. Optimizing the elimination order of variables in a Bayesian network.
4. Performing approximate calculations of the likelihood.
kx x x
n
iii paxPP
3 1 1
)|()|( data
11
Standard usage of linkageThere are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their xij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps).
The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction ’ between the disease locus and marker i and ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function.
This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).
12
)()( )(max iGi
XCXCi
The unconstrained Elimination Problem reduces to finding treewidth if:• the weight of each vertex is constant, • the cost function is
• Finding the treewidth of a graph is known to be NP-complete (Arnborg et al., 1987).
• When no edges are added, the elimination sequence is perfect and the graph is chordal.
Relation to Treewidth
.
Parameter Estimation Lecture #10
Acknowledgement: Some slides of this lecture are due to Nir Friedman.
14
Likelihood function for a die: Multinomial sampling
Let X be a random variable with 6 values x1,…,x6
denoting the six outcomes of a die. Suppose we observe a sequence of independent outcomes:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)What is the probability of this data ?
If we knew the long-run frequencies i for falling on side xi, then,
25
154
23
32
21 1)|(
iiDataP
Where ={1,2,3,4,5} are called the parameters of the likelihood function. We wish to estimate these parameters from the data we have seen.
15
Sufficient Statistics
To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6).
We do not need to recall the entire sequence of outcomes
{Ni | i=1…6} is called the sufficient statistics
for the multinomial sampling.
654321
5
154321 1)|(N
i iNNNNNDataP
16
Sufficient Statistics
A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood
Formally, s(Data) is a sufficient statistics if for any two datasets D and D’
s(Data) = s(Data’ ) P(Data|) = P(Data’|)
Datasets
Statistics
17
Maximum Likelihood EstimateMaximum likelihood estimate is an assignment to the parameters that maximizes the probability of data (i.e., the likelihood function ).
Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer:
654321
5
154321 1log)|(logN
i iNNNNNDataP
5
16
5
11loglog
i ii ii NN
01
)|(log5
1
6
i ii
i
i
NNDataP
A sufficient condition for maximum is:
18
Finding the Maximum
5
1
6
1i ii
i NN
We have just found that:
ii
jj N
N Divide the ith and jth equations:
Sum from j=1 to 6:
ii
j j
N
N
6
11
Hence the MLE is given by:
6,..,1 iN
N ii
19
Adding Pseudo Counts
The MLE given by ,6,..,1 iN
N ii
can be misleading for small data sets because it could happen that a small data set is not typical. For example, it might be that we know that the dice is manufactured to be loaded but the small dataset we examined does not show this property.
The MAP estimate can be justified as maximizing one’s posterior (namely, after seeing the data) best estimate of the frequencies for each side. The theory formally justifying this formula is called Bayesian Statistics (not covered in this course due to time constraints).
The MAP estimate is given by 6,..,1'
'
iNN
NN iii
The six pseudo counts N’i sum to N’. They express one’s assessment regarding the frequencies for each side prior to seeing the data. Large N’ indicates high confidence. Smaller than 1 values are possible.
20
Example: The ABO locusRecall that a locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.
N
N
N
N
N
N
N
N
N
N
N
N oooo
baba
obob
bbbb
oaoa
aaaa
//
//
//
//
//
// ,,,,,
Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:
The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.
We wish to estimate the proportion in a population of the 6 genotypes.
21
The ABO locus (Cont.)However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles a,b,o in the population determine the frequencies of the genotypes as follows: a/b= 2a b, a/o= 2a o, b/o= 2b o, a/a= [a]2, b/b= [b]2, o/o= [o]2. So now we have three parameters that we need to estimate.
22
The Likelihood FunctionLet X be a random variable with 6 values xa/a, xa/o ,xb/b, xb/o, xa/b , xo/o denoting the six genotypes. The
parameters are = {a ,b, o}.
The probability P(X= xa/b | ) = 2a b.
The probability P(X= xo/o | ) = o o. And so on for the other four genotypes.
215232 222)|( oobaobboaaDataP
What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ?
Obtaining the maximum of this function yields the MLE. This can be done by multidimensional Newton’s algorithm.
23
Gradient Ascent (Newton like methods):Follow gradient of likelihood w.r.t. to parameters (As taught in your favorite Numerical Analysis course). Improve, by adding line search methods to determine step size and get faster convergence. Start at several random locations.
P(D
ata
|
)Computing MLE
Finding MLE parameters: nonlinear optimization problem
24
Gene Counting
n
nnn baoaaaa 2
2 ///
Had we known the counts na/a and na/o (blood type A individuals), we could have estimated a from n individuals as follows (and similarly estimate b and o):
Can we compute what na/a and na/o are expected to be ?Using the current estimates of a and o we can as follows:
oaa
aaaa nn
22
2
/
We repeat these two steps until the parameters converge.
oaa
oaaoa nn
2
22/
25
Gene Counting (example of EM)Input: Counts of each blood type nA, nB, nO, nAB of n
people. Desired Output: ML estimate of allele frequencies a ,b ,
o.
Initialization: Set a ,b ,and o to arbitrary values (say,
1/3).Repeat E-step (Expectation):
oaa
oaAoa
oaa
aAaa nnnn
2
2
2 2/2
2
/
obb
obBob
obb
bBbb nnnn
2
2
2 2/2
2
/
n
nnn
n
nnn
n
nnn oboaOo
ABobbbb
ABoaaaa 2
2
2
2
2
2 //////
M-step (Maximization):
Until a ,b ,and o converge