1 CASE STUDY: Genetic Linkage Analysis via Bayesian Networks We speculate a locus with alleles H...

1

CASE STUDY: Genetic Linkage Analysis via Bayesian Networks

We speculate a locus with alleles H (Healthy) / D (affected)

If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

2

The Variables InvolvedLijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i.

Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). “The genotype” Yj = person I is affected/not affected. “The phenotype”.

Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) .

Sijm = a binary variable {0,1} that determines which maternal allele is received from the mother. Similarly, Sijf = a binary variable {0,1} that determines which paternal allele is received from the father.

It remains to specify the joint distribution that governs these variables. Bayesian networks turn to be a perfect choice.

3

The Bayesian network for Linkage

Locus 1

Locus 3 Locus 4

Si3

m

Li1

fL

i1m

Li3

m

Xi1

Si3

f

Li2

fL

i2m

Li3

f

Xi2

Xi3

Locus 2 (Disease)

Y3

y2

Y1

This network depicts the qualitative relations between the variables.We have already specified the local conditional probability tables.

4

Details regarding recombination

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

{m,f}tssP tt

where

1

1),|( 1323

is the recombination fraction between loci 2 & 1.

Y2Y1

Y3

5

Details regarding the Loci

The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are connected to the Xij variables (only in the disease locus). For example, model of perfect recessive disease yields the penetrance probabilities:

P(y11 = sick | X11= (a,a)) = 1P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0

Li1fLi1m

Li3m

Xi1

Si3m

Y1

P(L11m=a) is the frequency of allele a. X11 is an unordered allele

pair at locus 1 of person 1 = “the data”. P(x11 | l11m, l11f) = 0 or 1 depending on consistency

6

SUPERLINK

Stage 1: each pedigree is translated into a Bayesian network.

Stage 2: value elimination is performed on each

pedigree (i.e., some of the impossible values of the variables of the network are eliminated).

Stage 3: an elimination order for the variables is determined, according to some heuristic.

Stage 4: the likelihood of the pedigrees given the values is calculated using variable elimination according to the elimination order determined in stage 3.

Allele recoding and special matrix multiplication is used.

7

Comparing to the HMM model

X1 X2 X3 Xi-1 Xi Xi+1X1 X2 X3 Yi-1 Xi Xi+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i. Similarly for the disease locus we use Yi.

REMARK: The HMM approach is equivalent to the Bayesian network approach provided we sum variables locus-after-locus say from left to right.

8

Experiment A (V1.0)

• Same topology (57 people, no loops)• Increasing number of loci (each one with 4-5 alleles)• Run time is in seconds.

Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter

A0 2 0.03 0.12 0.27A1 5 0.1 3.77 0.31A2 6 0.14 79.32 0.39A3 7 0.42 0.69A4 8 0.36 2.81A5 10 1.19 84.66A6 12 4.65A7 14 3.01A8 18 20.98A9 37 8510.15

A10 38 10446.27A11 40

over 100 hours

Out-of-memory

Pedigree sizeToo big forGenehunter.

Elimination Order: General Person-by-Person Locus-by-Locus (HMM)

9

Experiment C (V1.0)

Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter

Bayesnets Trees Trees HMMD0 100 0.16 (2 l.e.) 0.41 (99 l.e.)D1 110 0.2 (2 l.e.) 0.45 (109 l.e.)D2 120 0.21 (2 l.e.) 0.48 (119 l.e.)D3 130 0.22 (2 l.e.) 0.49 (129 l.e.)D4 140 0.24 (2 l.e.) 0.51 (139 l.e.)D5 150 0.25 (2 l.e.) 0.53 (149 l.e.)D6 160 0.27 (2 l.e.) 0.54 (159 l.e.)D7 170 0.3 (2 l.e.) 0.6 (169 l.e.)D8 180 0.3 (2 l.e.) 0.59 (179 l.e.)D9 190 0.32 (2 l.e.) 0.61 (189 l.e.)D10 200 0.34 (2 l.e.) 0.66 (199 l.e)D11 210 0.37 (2 l.e.) 0.67 (209 l.e)

• Same topology (5 people, no loops)• Increasing number of loci (each one with 3-6 alleles)• Run time is in seconds.

Out-of-memory

Bus error

Order typeSoftware

10

Some options for improving efficiency

1. Multiplying special probability tables efficiently.

2. Grouping alleles together and removing inconsistent alleles.

3. Optimizing the elimination order of variables in a Bayesian network.

4. Performing approximate calculations of the likelihood.

kx x x

n

iii paxPP

3 1 1

)|()|( data

11

Standard usage of linkageThere are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their xij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps).

The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction ’ between the disease locus and marker i and ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function.

This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).

12

)()( )(max iGi

XCXCi

The unconstrained Elimination Problem reduces to finding treewidth if:• the weight of each vertex is constant, • the cost function is

• Finding the treewidth of a graph is known to be NP-complete (Arnborg et al., 1987).

• When no edges are added, the elimination sequence is perfect and the graph is chordal.

Relation to Treewidth

.

Parameter Estimation Lecture #10

Acknowledgement: Some slides of this lecture are due to Nir Friedman.

14

Likelihood function for a die: Multinomial sampling

Let X be a random variable with 6 values x1,…,x6

denoting the six outcomes of a die. Suppose we observe a sequence of independent outcomes:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)What is the probability of this data ?

If we knew the long-run frequencies i for falling on side xi, then,

25

154

23

32

21 1)|(

iiDataP

Where ={1,2,3,4,5} are called the parameters of the likelihood function. We wish to estimate these parameters from the data we have seen.

15

Sufficient Statistics

To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6).

We do not need to recall the entire sequence of outcomes

{Ni | i=1…6} is called the sufficient statistics

for the multinomial sampling.

654321

5

154321 1)|(N

i iNNNNNDataP

16

Sufficient Statistics

A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood

Formally, s(Data) is a sufficient statistics if for any two datasets D and D’

s(Data) = s(Data’ ) P(Data|) = P(Data’|)

Datasets

Statistics

17

Maximum Likelihood EstimateMaximum likelihood estimate is an assignment to the parameters that maximizes the probability of data (i.e., the likelihood function ).

Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer:

654321

5

154321 1log)|(logN

i iNNNNNDataP

5

16

5

11loglog

i ii ii NN

01

)|(log5

1

6

i ii

i

i

NNDataP

A sufficient condition for maximum is:

18

Finding the Maximum

5

1

6

1i ii

i NN

We have just found that:

ii

jj N

N Divide the ith and jth equations:

Sum from j=1 to 6:

ii

j j

N

N

6

11

Hence the MLE is given by:

6,..,1 iN

N ii

19

Adding Pseudo Counts

The MLE given by ,6,..,1 iN

N ii

can be misleading for small data sets because it could happen that a small data set is not typical. For example, it might be that we know that the dice is manufactured to be loaded but the small dataset we examined does not show this property.

The MAP estimate can be justified as maximizing one’s posterior (namely, after seeing the data) best estimate of the frequencies for each side. The theory formally justifying this formula is called Bayesian Statistics (not covered in this course due to time constraints).

The MAP estimate is given by 6,..,1'

'

iNN

NN iii

The six pseudo counts N’i sum to N’. They express one’s assessment regarding the frequencies for each side prior to seeing the data. Large N’ indicates high confidence. Smaller than 1 values are possible.

20

Example: The ABO locusRecall that a locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.

N

N

N

N

N

N

N

N

N

N

N

N oooo

baba

obob

bbbb

oaoa

aaaa

//

//

//

//

//

// ,,,,,

Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.

We wish to estimate the proportion in a population of the 6 genotypes.

21

The ABO locus (Cont.)However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles a,b,o in the population determine the frequencies of the genotypes as follows: a/b= 2a b, a/o= 2a o, b/o= 2b o, a/a= [a]2, b/b= [b]2, o/o= [o]2. So now we have three parameters that we need to estimate.

22

The Likelihood FunctionLet X be a random variable with 6 values xa/a, xa/o ,xb/b, xb/o, xa/b , xo/o denoting the six genotypes. The

parameters are = {a ,b, o}.

The probability P(X= xa/b | ) = 2a b.

The probability P(X= xo/o | ) = o o. And so on for the other four genotypes.

215232 222)|( oobaobboaaDataP

What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ?

Obtaining the maximum of this function yields the MLE. This can be done by multidimensional Newton’s algorithm.

23

Gradient Ascent (Newton like methods):Follow gradient of likelihood w.r.t. to parameters (As taught in your favorite Numerical Analysis course). Improve, by adding line search methods to determine step size and get faster convergence. Start at several random locations.

P(D

ata

|

)Computing MLE

Finding MLE parameters: nonlinear optimization problem

24

Gene Counting

n

nnn baoaaaa 2

2 ///

Had we known the counts na/a and na/o (blood type A individuals), we could have estimated a from n individuals as follows (and similarly estimate b and o):

Can we compute what na/a and na/o are expected to be ?Using the current estimates of a and o we can as follows:

oaa

aaaa nn

22

2

/

We repeat these two steps until the parameters converge.

oaa

oaaoa nn

2

22/

25

Gene Counting (example of EM)Input: Counts of each blood type nA, nB, nO, nAB of n

people. Desired Output: ML estimate of allele frequencies a ,b ,

o.

Initialization: Set a ,b ,and o to arbitrary values (say,

1/3).Repeat E-step (Expectation):

oaa

oaAoa

oaa

aAaa nnnn

2

2

2 2/2

2

/

obb

obBob

obb

bBbb nnnn

2

2

2 2/2

2

/

n

nnn

n

nnn

n

nnn oboaOo

ABobbbb

ABoaaaa 2

2

2

2

2

2 //////

M-step (Maximization):

Until a ,b ,and o converge

1 CASE STUDY: Genetic Linkage Analysis via Bayesian Networks We speculate a locus with alleles H...

Documents

Transcript of 1 CASE STUDY: Genetic Linkage Analysis via Bayesian Networks We speculate a locus with alleles H...