TWO PAPERS ON MONTE CARLOESTIMATION OF MODELS FOR ... · The chain is also geometrically ergodic by...

TWO PAPERS ON MONTE CARLO ESTIMATION OF

MODELS FOR COMPLEX GENETIC TRAITS

by

Sun Wei GuoElizabeth A. Thompson

TECHNICAL REPORT No. 229

Apr.il1992

De.partment ofStatistics, GN~22

University of Washington

Seattle, Washington 98195 USA

Two Papers on Monte Carlo Estimation of

Models for Complex Genetic Traits *

Sun Wei Guo Elizabeth A. Thompson

April, 1992

Abstract

In human quantitative genetics, computational complexity restricts the cur

rent methods for estimation of models for complex genetic traits. The two pa

pers in this technical report continue the development of Markov chain Monte

Carlo methods to accomplish this estimation. The papers here have been sub

mitted for publication. They are based on work developed in Sunwei Guo's

Ph.D. (Guo, 1991), and continuing under a to E.A.Thompson

from the National Institutes of Health (Thompson and Wijsman, 1990). Two

previous papers have been published; one on the Monte Carlo estimation of

likelihood ratios (Thompson and Guo, 1991), the second on a Monte Carlo

quantitative traits (Guo and Thompson, 1991).

NIH

Contents

1

:37

1JlSt,HIlitLl(Jll of Ml:i{ed IVlo,dels1.

References

genetics. Unpublished

)nl'\Terl~ntv of Washington.

Guo, S. W. (1991) Monte Carlo methods the 'iU''''U',>YU'Y>

Ph.D. dissertation. Dept. of Biostatistics,

Guo, S. W. and Thompson, E. A. (1991) Monte Carlo estimation of variance compo

nent models. IMA J. klath. Appl. Med. BioI. 8: 171-189.

Thompson, E. A. and Guo, S. "V. (1991) Evaluation of likelihood ratios for complex

genetic models. IJV1A J. ~Math. Appi. kled. Bioi. 8: 149-169.

Thompson, E.A. and Wijsman, M. (1990) A Gibbs sampler approach to the likeli-

hood analysis of complex models. Technical report 193, Department

of Statistics, of "Vashington.

Monte Carlo Estimation of Mixed models for

Large Complex Pedigrees *

Sun Wei Guo l Elizabeth A. Thompsonl ,2

1Department of Biostatistics, SC-32

2Department of Statistics, GN-22


Seattle, Washington 98195

U. S. A.

Abstract

In human quantitative genetics, computational complexity restricts the cur

rent methods for estimation of mixed models which include major gene effects to

data on small pedigrees. However, large complex pedigrees are not uncommon

in practice. Also, large pedigrees tend to provide more information on genetic

transmission and are more genetically homogeneous than a pooled sample of

many nUlcle<Lr t<Lm.ilies.

Gibbs sarnpJler, for estimationEM algorithm and

aPI>ro,:Lch also nf()vl,C1es a j.v.l.vu,,<:;

mixed models. The

as'TmlDtc~tic variance-

meth(Jds are COllceptlla1ty O'.L"lH'C,

easy to impl€~m(mt can hartdle multip,le IH~ritabliel H{)Il-IlenlaOle r,Lndom

COIIrpOllen11s. A nUlllerH:al eJ\:amIlle to illustrate

words: EM algorithrrl,

1 Introduction

has a In and Stt~W<I,rt.

Morton and MacLean, 1974). This partitions in a quantitative

trait into three sources: the effect of a single major of large effect, residual

additive heritable of polygenic loci, and the independent random effects of

the environment. Numerous applications of the mixed model in human genetics have

been published (see, for example, Leppert et al. 1986). In the field of plant/animal

mixed model can used to the genetic quantitative

traits such disease resistance.

Sample sizes being equal, a single large pedigree tends to provide more information

on major gene transmission than a pooled sample of many nuclear families. For those

traits that involve mitochondrial (matrilineal) inheritance or other effects that provide

long-range dependence, a single large pedigree is more suitable for study, not only

because it provides more information on the transmission but also because these

effects can be obscured by other familial correlations in nuclear families. Moreover,

large pedigrees tend to be more genetically and environmentally homogeneous than

a pooled sample of nuclear families. However, most of the methods proposed

so far for estimating mixed are restricted mainly to data on nuclear families

or This is due to the formidable computational burden in

the evaluation of the likelihood. and MacLean (1974) proposed a numerical

I1H~LlJUU of is not

model to Inc,orporate lTmt,'ttlCin

SImIlar to IS

an Importcmt

llK€~l1hl[)Od.s, a torJffilClaoLe

li~lsteldt (1982) prclpoised an method of vW~'vUH:N"HjL""

likelihood. for by Q",,:,rrh'rl«

the likelihood surface. Although Hasstedt (1982) showed that approximation

works on small pedigrees, Thompson and Wijsrnan (1990) pointed out that

approximation can be sensitive to parameters of the procedure and thus warrants

further investigation. Furthermore, the approximation is general the sense

that it can only deal mixed model with major gene(s) and additive polygenic

effects. Since most complex quantitative traits are believed to be by a

number of heritable/non-heritable random effects, a more general method is needed.

In a recent paper, Thompson and Guo (1991) proposed Monte Carlo evaluation of

the likelihood ratios of mixed models on complex pedigrees by using the Gibbs sampler

(Geman and Geman, 1984). Their method provides a tractable and efficient approach

to likelihood ratio evaluation for mixed models and other complex genetic models.

In this paper, we shall show how the Gibbs sampler, in conjunction with the EM

algorithm, can be used to estimate the parameters of mixed models and their standard

errors. This method, coupled with likelihood evaluation of Thompson and

Guo (1991), an integrated approach to estimation and for

mixed models.

sm:lUlaLt~U data set on a is described in 3. In section

we nrClVlcLe a SUlnnlary dlSCUS~;lOll.

3

2 The Method

2.1 Notation and assumptions

Po(G) = IT PO(Gi )

founder s i non- .fclun£!ers

J.

pa1:arr1etl~r p 18 lRvnlv:e(!. alttlmlgh UtJJlIVI,tJ a

18

or

odep,enclmg Wh(~thl~r 11'1rln;,rll1'" I J 18 an md.lca,tor lUIlctllon

1 or 0 del>endl1Jlg on wllletJler J IS a tou.ndl~r or nOJll-tIJllIld€:r,

of l!:ellotYPIC C()ntilj?;uratlon G can

Po(G) h(G)p21pl 1

h(G) exp [(21Fl 1 IFI 2 ) log[P/(l p)J + 21Fl1og(1 - (1)

where 1 is a vector of ones, and h(G) does not depend on p. The vectors Ii ,2,3)

are (of course) functions of G, but for notational convenience we leave this dependence

implicit.

For given G, the simple mixed model be specified, in vector notation, as

(2)

where a is a vector of additive genetic effects, normally distributed as Nn(O,o-;A),

where A is the numerator relationship matrix (Henderson, 1976), and e is a vector

of error effects (or individual environmental effects) normally distributed Nn(O, 0-;1).

2.2 The EM equations

There are parameters to be estimated: the gene treqw~nc:y major gene ellt:CLS

z = and nnl\ro'pnlr and emllr(mnlleIiltal Vo,L Lo,LL'-'C01> ,

likelihood for model (2)

IS m(jllOl.·-f!t~Ile j?;ellOtyplC C(:mfJlguratlon on IS

over l'hrClUl!:hout, f

P a dlSicre1~e pJwbablllty dls'~ntlUtlon.

G, error cOlmp'onen'~s

(Ott,

Estimation o is a "missmg

G

based on the "complete data" (y, G, a) is:

- #1 11

X exp r(21~11 + 1~12)10g 1 P P - 2;~aIA-lal

fe ( ylG, a)Pe(G)fe(a) =1

where c(O) is a function of 0 but not G.

Thus the natural sufficient statistics for 0 are 21F l l + IF12, l~(y - a), l~l,

(i = 1,2,3), a'A-lain and e'eln, whose unconditional expectations are 2nFP, ni#i,

ni, (i = 1,2, 3), O'~ and 0';, respectively, where nF is the total number of founders in

the pe'(1l.e:re4e, ni is the expected number genotYl)e i ( i - 1,2, :3).

Hence, if we denote new values of parameters by *, we obtain the following EM

equations estimating 0:

p*

(i = 1,

eXl)ec~tat;lOIls on

the

fo(G,

there is no practical way to evaluate denominator of the above equation for

a pedigree of more than about ten individuals (Ott, 1979).

2.3 Monte Carlo estimation

Vve the major genotypes and polygenotypes given the data and

estimating the conditional expectations required in the by a Monte

Carlo method. However, a classical Monte Carlo method, providing independent re

alizations of major genotypes and polygenotypes given the data, is precluded because

the conditional distribution is intractable and because there is no known efficient

algorithm to generate independent realizations from the distribution (8).

To sample unknown major and polygenotypes from the conditional

distribution, we use Gibbs an iterative procedure generates multi-

ple dependent realizations of the unobserved variables conditional on observed data

(Hastings, 1970; Geman and Geman, 1984; Gelfand Smith,1990). Beginning from

any realization of and polygenotypes, the genotype polygenotype

curItent estImate of pal'arrtetE~rs, and current cOllti~~uratIon of

one

a lVJ.d,IIHJV t = 1,

IS a stcbtlC)flCbry GIs:tn butlCm

the ma,lor PieIlot:yP(;S can move

poJsg:.enC)tYlpes can· move to

1'>,-.,,,'f',,,,,, probability one step by positivity of distribution.

Any irreducible Hastings algorithm is also Harris 1991). Hence

P8(G, aly) is the unique invariant distribution of the Markov chain and averages over

the Markov the For

forintegrable lUIlCtlfon V(G, a)

E(V),

g:ellotyples and

1 twhere ~ - I: V(G(l), a(l».

t [=1

The chain is also geometrically ergodic by an argument of Chan (1991). This is

because the major genotypes configurations themselves form a Markov chain {G(t)},

which has a finite state space and so is is geometrically ergodic (Chung, 1967), but

then so is the joint chain because the rate for the joint chain is dominated by the

rate for the major genotype chain (equation 2.2 in Chan, 19.91). Geometric ergodicity

implies a central limit th(;or,em

where the asymptotic variance O"~ delpeIldS on lUIlctJLOn V and on autocorrela-

are cOI~re,latled. To recluc:e autoc:orlrel,:ttI()n, one can sarnp,le

reClUllred to a

conapultationaJ e111Clen<:y IS dlslcussed

one runs

re':l,l1i~atloI1S are

to the COllditioJllal eXlpe<Jta1t;ioJllS and new O(l+1}.

This completes one iteration until likelihood

of the model no trend. With a reasonably large sample

size (N =200, in equations can

be accurately est'im(i,ted

Hnpl(~mient the individual, the conditional dis-

tribution of his genotype given his trait value Yi, polygenotype ai and genotypes

and polygenotypes of other members in the pedigree, and the conditional distribu

tion of his polygenotype given his trait value Yi, his genotype Gi, and genotypes

and polygenotypes of other members in the pedigree. For individual i, we specify

a neighborhood consisting of (if present the his spouse(s)

(if Yi, polygenotype

ai, and genotypes of neighborhood, the and polygenotypes on other

pedigree members do not information about Gi . if we

let ,Gm of {Gj} his , {Gjl} his ottsprmp;'

,a,y)

)

IS If IS mlE,slug,

to be L If i is a founder,

Similarly, if we ae][lQt;e

mothe~r, spouse(s)

seg:re~!;at:ion pr()ba,I::>lJIty is just the l'I'",.~r.'tur.lr tI~equeI1CY

pojly~ell()types of

f( aila_i, y, G) - f( aila-i,]li, Gi )

<X [g f(a;da;, a;)] f(~ilaf>am)f(YiIGi,ai) (10)

where a_i denotes the breeding values of all the members in the pedigree except i,

and j(ailaj, am) is the polygenic segregation probability density; that is, given aj, am,

ai rv N((af am)/2, ()"~/2). If i is a founder, ai rv N(O, ()"~).

After some algebra, we find that the conditional distribution of ai given Yi, Gi, a j, am, {aj}

and {ajl} is normal with mean

and variance

E (11)

(12)

IS or

number of offsprings. Equations (11) and (12) are QUU"'HH

ai given Yi, af, am, {aj} and {ajl} for the polygenic

I'tH)mpSCm and Shaw, 1990; Guo and Thompson, 1991). Thus, generation of

the (local) conditional distributions (9) and (10), is straightforward.

2.4 Choice of starting genotypic configurations (G, a)

Although the ergodic theorem ensures that the realizations generated via the Gibbs

sampler will converge in distribution to the true joint distribution, it is important to

choose a good starting genotypic configuration to avoid an unnecessarily prolonged pe

riod before realizations can be collected. Since the observed data contain information

elH:::CLS, we use an approach we refer as gene-dropping" ,

as opposed to the simple method (MacCluer et aI, 1986). Basically,

we drop the major from the top of the pedigree down to the bottom, using

the current estimate of gene frequency and also the data, To each founder i in the

pe(llgr'ee, we a the pr()Oa,OllIt

0(

1,

each n0I1-Ie.unGer J 1 a ",,,,,,.nt,rnp IS ral1ldoJrnl} salnplled

) ex = klGj,

ex P(Gi = klGjl Gm)exp [ (14)

Once the major genes have been dropped through the pedigree, we drop the

polygenes. We first randomly draw a polygenotype for each founder i according to

the following distribution:

(15)

Once an the founders are assigned breeding values, the breeding value for eaeh

non-founder i is randomly drawn from the following distribution:

fe(aiIYi, Gil aj, am) ex fe(YilGil ai)fe(adaj, am)

ex [_ (Yi - ~~;- ai)2] exp {-"----'-~--'-C---..::.-}(16)

2.5 Extension to include multiple random effects

al12::orltmn can

can

nance emects:

dOJffil][laI1Ce ett~ectlh U"'''''"HJ'LH,'-,U as N(O, aJD), and notation

1S as beb3re. On a zen:>lo()p p,edl,e;re,e, Dare = 1 if i = j, or

dij = 1/4 if i and j are full-siblings or 0 otherwise.

equations similar to (4-7) are needed. In addition,

estimate this model, the EM

EM equation for 17J is

(17)

To evaluate the expectations EM equations, we need samples from the joint

conditional distribution fe(G, a, dly), but

fe(G,a,dIY) = fe(yIG,a,d)Pe(G)fe(a)fe(d)I:G fa fd fe(yIG, a, d)Pe(G)fe(a)fe(d) dd da

a distribution more complicated than equation (8). Using the Gibbs sampler, however,

we can generate realizations from this conditional distribution. It is sufficient to

calculate the conditional distributions:

f(aila_i,y,G,d) f(ailaj,am,{aji},{aj}, Yi,

C< [~ail] ,ai,

neighbors are now his

two as before.

mean

(12), respec:tlVely, exc.ept Yi IS now rt:I.Hi:tI.~t:U by Yi - di.and variance similar to (11)

If we denote by Si the of i's full slbJtmg;s, e}CClUlcilIJlg IlJlm8lelt, s = ISi/+l,then distribution (20) is a normal distribution:

d - 3s+3j, 1] - 4 s+2'

More generally, a complex mixed model has the following form:

(21)

UTh.,.....,. zi's are k random components, with Zi rv Nn(O, o-;Edj e is environmental effect,

with e rv Nn(O, 0";1). Zl, ... , Zk ,e and major gene are assumed mutually independent.

Ei's are known are positive selui-de'linilte. Without of p;e][lel'allty, we can

assume are imTer1~ibLe we can always re- pajra:mE~tr]lZe model so that,

enough

eql11i:U,10I1S for estllfi,'Lt1I1P;

= 0"; c, where c > 0 iswhich is lfi'/ertltile. and

-c>so

,...

n

i = 1, ... ,

In .e;eIletJlC models, direct 111"\i'pr!'l1011 of m<lktnces can avoided taking

inverses are often sparse, facilitating efficient storage and computation.

To obtain realizations the conditional distribution given data, agam, the

Gibbs sampler can be used, which requires only the conditional distributions:

- Pe(Gil{Gj }, {Gjl }, Gj, Gm , Yi, Zli,· .. , Zki)

()( [g P(G;dG;, Gj )] P(GdGf> Gm ) exp [- (y; - "G, ~~;- ... - ZIO)' ]c26)

and

Ie (

ex

ex

IZr, , Zk, (Zj)-i, G, y)

Izr, ,Zk, (Zj)-i, Gi , Yi)

(27)

ance are ,G,

p

Since ,e:enera1Glon of varJlables conditional (mrtrlllJUltl(mS IS stral,e:ht-

1

plementation to estimate parameters of model (2).

2.6 Estimation of the information matrix

as other set,tlIllp;Sl reason for

usmg alt!;ori.ttllrIl is that likelihood function difficult or impractical to

evaluate, but, if the data are viewed as a function of some missing random variables,

the evaluation of MLEs based on the "complete data" is relatively straightforward.

However, the EM algorithm does not immediately yield asymptotic standard errors

of parameter estimates. Yet, it is often of practical interest to know the variance

covariance matrix, or, to construct confidence interval for estimated parameters. In

this section, we pr~~seIlt a IVllJlll,e Carlo method for estimating the observed information

matrix. For notational convenience, let u = (G, Zt, Zz, ... , Zk), (i.e. u is the "missing

data" vector). Then

1(0;00

) = L(O) = [ Pe(u)fe(Ylu) dPe(uly) = [ Pe(u,y) dPeo(uly)L(Oo) Ju Peo(u)foo(Ylu) . Ju (u,y)

(Thompson and Guo, 1991), which nT'"","lfl,,,,,,

IS

on IS

sec()nd IS COIlldltlonal distI'ibtlticm of u y.

equation can

as Qhr,wn in previous sectioll, u can be sampled

a IVI0IlLe

Po(uly) via the Gibbs sam-

pIer. the conditional

distribution Po(uly),

/u a'lO:e~t,y) dF,(uly) '" ~ t a'lOg:;;~rJ,y) (29)

Similarly,

Eo (alogpo(U,y)alogpo(U,y)) ~ ~ t alogPo(u(k),y)alogPo(u(k),y) (30)aBi aBj N k=1 aBi aBj

(aIOgpO(U,y)) ~ ~ t alogPo(u(k),y) (31)

aBi N k=1 aBi

If the information is to be evaluated at the MLE iJ, then

ralog Po(u, y) dPo(U1y)1 = alog~o(Y)1 = 0iu aOi 0=8 aOt 0=8

so at 0 = iJ

The log-likelihood the eltE~cts are

assl1mt~(1 to

term

tleltlCe the Pe(u,y) are

and

y) PII (G, a, y) = fe(yIG,

where

logPe(G, a,y) log fe(yIG, a) + log fe(a) + log PII (G)

k k 2 1 'Y" (Yi - /lai ai)2log fe(yIG, a) = --log 211" - -log O"e - L..t 2

2 2 2 iEO ·O"e(33)

(34)

where ° is the index set of individuals who are observed and k = 101, nlF is the

number of founders with genotype 1(1 = 1,2,3), and C is some constant. Therefore,

the score vector is sImply

non-zero corupiJUe:uts

index

----:::--'-:::-'---'- -

1{erlot"1rpe 1;

(1=1,2,3) (38)

and

82 Iogfo(a) __ n + _Ia'A-La8((T~)2 2(T~ (T~

(39)

(40)

82 IogPo(G)8p2

---- +---:---.,--- (41)

2.7 Assessing the Monte Carlo variance and optimal sample.spacIng

Since lUCl;U\-'U is used to Ci:ll;lUl<:Ll;C mixed model,

var'latiOn IS mtlroduced.

is

t=-oo

IS t asStlmlllp;

. ., can

the empirical autocovariance at lag t, and can be estImat€~dby

00

&? = L: w(t)1tt=-oo

(43)

where IS some sUJ.talble weight turlctl0n. For example, w(t) 1

w(t) = 0 for large t, and w(t) makes a smooth monotone traJ[ls!t,lOn betwe<m

values (Geyer, 1991). Once the Monte Carlo variances are estimated, one can estimate

also the optimal spacing, k, in the Gibbs sampler. The Central Limit Theorem

variance for a function V of chain sampled at spacing k is

00

Sk = L: ,ktt=-oo

and the variance av€~ra~~e of _,,,,.In<><, at consecutive sam-

pIes is approximately skiN. To achieve accuracy, must be proportional to Sk.

conditional eXjDec;ta1;lOils In to

"'vrr.,,,rl,,, the Monte(This ap1D1H~S to any function V, and

Carlo estlmi'ttes of

cost of

or pr()pc)rtJlonal

pr()pClrtionial to + can

can k. cost

delpel1dS on

on autocovarlanlce struc1Gure of

even ditten~nt statisl1ics USt::UI,() e:stunate,the pal:arrlet<:;rs, may

have different optimal spacings.

3 Numerical examples

In this section we the method proposed above. The

programs implementing method are written in C. All computations were

carried out ona DEC3100 UNIX workstation. The random number generator used

was the run-time library drand48. program psdraw (Geyer, 1988) was used to

draw the pedigree.

Example. Simulated data. vVe consider the simulated data on a 230-member,

six-generation pedigree are 67 founders the pedigree. This

pedigree is similar inform to the plant pedigrees of Dr. Mitchell-OIds (personal

communication) to study of III

Brassica campestris. The model we considered is (1) 2. The simulation

values are shown Table 2. \Vith 0.5, ILL = IL2=1.0,

IL3 = we pel'formed

lteratU)ll a sarrlple 400 Gibbs sanlpll':;s were ';rJ'll.wn

ten IteJratlons, salnples were dr<1Wlrl. 20

est,lIDlatl':;s are obtarne<:1, salnples are f!,e:t1el'atj~d,

we 7.

model

standard errors and 95% Co]tltlc1eIlce Hltf'rv'all'l are shown Table 2. The C;;"~HH<L~";U

Table 3. In essence,

estlmat(~S of O"~ and

asymptotic variance-covariance matrix is

correctly the major gene effects. Note that

asymptotically, a sul)st,anl;ial ne~~ative COITelati,on.

Figure 2 shows and over It can

seen the figures that an the estimates approached the vicinity of their

MLEs, and the log-likelihood ratio of the mixed model versus the polygenic model

quickly increases. The estimates of major gene effects stabilize quickly. Other pa

rameter estimates, and the log-likelihood, continue to vary within limits, as is to be

expected of a Monte-Carlo procedure; each EM step was based on only 400 (depen-

dent) samples. Table 4 shows the of log-likelihood ratios with respect to

various models. It can from table that the data strongly support a mixed

model with codominant expression of the major substantial difference in

log-likelihood ratios strongly rejects a model, a pure addi-

tive model, polygenic model, effect

esti.ma,ted log-likelihood model with major

plus ",rt,rtiti"c> T)nlvO'pn1lr c()m1porlent,

surprIsmg, if we

stana;ara error

estlmatt~a p.ara,mt~tel~Sare

de'Vla,tKms are

errors m£tgnllttlde are nes~ll~:ll)le.

we explore the:l1~:el1ll1oo,dStLrtal~e for

the two variance u; u;. vVe considered eight points surrounding

putative MLEs of u; and u; (Table 5), unchanged.

From Table 5 it can be seen that the estimated parameters do indeed provide higher

llKlellllOCid surface

between u; and also noticeable as expected from the

negative asymptotic correlation two parameters. The small

differences in the likelihood ratios suggest that the likelihood surface with respect to

u; and u; is fairly flat, as is also evidenced by the relatively wide confidence intervals

for these two parameters.

4 Discussion

We have presented a new method for estimating mixed models for large complex

metnCl(] can easily handle with multiple heritable/non-

peIQls;!:re<~s do not pose

While it true with mCre8,S1lJlg a1vallaOlllty of nfl,IVn'1flT'nI11r DNA mark-

al)~;en(:::e or eXllsttmCe

partl1cular, m1J{ed mCldels can

mCldels can

once are Impu1;ed

mc~mt)ers, one can do of a poteIltH'Ll11nk,'Lp;e

a of m!l:>rUlatlve meiosles t;:(~ Udltl

mtonua,tl\Jre het(~rO,zve:ollfes In can C;"'IHHHH'C;U

the Monte Carlo methc,(1 of this 1-'...~rvJ.. an impullfed genetic cOllti~;llI'atJOnon

our method can be to include genetic marker combining segregation

and linkage analysis (Guo and Thompson, in preparation).

Implementation of the Monte Carlo EM algorithm requires specification of three

operational parameters: of EM iterations, Carlo sample size

used to conditional expectations at iteration, and number of Gibbs

cycles between samples. The number of iterations can be determined by monitoring

the estimates and log-likelihood values (Figure 2). The Monte Carlo sample size is

largely determined by the desired accuracy of the parameter estimates, while the op

timal number of cycles for each sample can be investigated by the methods of section

2.7. In general it is the result of a compromise between two conflicting goals: more

accuracy in Monte Carlo estimation and less computing cost. More specifically, if the

Markov chain constructed via the Gibbs sampler high autocorrelation between

consecutive values of a function on the chain, a larger Carlo sample size is

needed order to achieve desired Increasing the number of cycles would

reduce between successive and thus the re-

sample. For the ex,'Lmple sal;Istactor'y for C;>J~HH,'-"v-

are more rarlC!.()m ett4ects.

IS mc:th<:rerlt to

as

to

errors

"accelerated" version of Monte

sample size, as well as the numt>er

any el.!!:en1lTallleS of One

cycles.

mcttnx areproblem can be more acute

remedy

If estirnation·of the curn:mt tie~ssllan is inexpensive,

l''lewton··l{<'LpllSOn ITletllOCl is combined

with

All~l'l(:~u1!ll we have Ili for

an individuals, this for simplicity only. In general, mixed models can incorporate

other covariates as age and sex. The effects of these covariates can be estimated

by appropriate EM equations as described in Thompson and Shaw (1990). Alterna

tively, one can sample the major genes and polygenic effects based on the current

estimate of covariate e1tec1;s and, once one can the major poly-

additional covariates and use standard regression UH~LJJlVUl"

estimate CO'varlat,e effects. The latter method is based on the fact that, given major

genes and polygenic effects, observations on different Inrllvl/in::lJR are m(1e~)eIJlClent,as

is seen from

area of

more cOlnplex geIletlc nloclels

Acknowledgement

IDc)dels on

Charles for shCLnrl~

was sup,por'ted in part

USDA contract

and

References

a qmmtjitat;ive trait:

Series: Theory and ;.Uethods. Springer-

Canmn~~s, C., Thclmplson,

Tnpor"U"IiVn. .1"lUT"TUll of Hum,fJ,n lJ:enletu:s ':I:'ll;v,CU--VO'\}.

Brockwell, P, J., R. A. (1987)

Verlag. New York.

Skolnick, M.H. (1978) Probability Functions on

Cornpl<:lx Pl:ldH~ret~S. Adv.Appl.Prob. •".""'-,,

Chan, Asymptotic behavior of the Gibbs SarnpJler.

No. 294, Department of Statistics, University of Chicago.

Technical Report

Chung, K. L. (1967) Markov Chains with Stationary Transition Probabilities, 2nd ed.

Berlin: Springer-Verlag.

Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977) Maximum likelihood from incom-

plete data via J R Stat Soc. B 39:1-38.

J:<.;l~;tQ]l, R.C. and ;JM:::W41 J. (1971) A General Model

Human Heredity 21:523-,542.

lveJlletlc Analysis of

marginal derlsltles. Journal of 'IDP'f'?r/l'n Statistical As.soctatzon

tioJrtware for lJal'culat111g

lJepartuaeIlt of Statistics, Jnnlenntv of \¥ashington.

Geyer, Markov chain Monte Carlo ma,xiIllmm llKelUlOoc1. Computer

ence and Statistics: Proceeding'S of 23rd Symposium on the Pp.

156-163. Interface Foundation of North America.

Guo, S. W. (1991) Monte Carlo methods in the quantitative genetics. Unpublished

ch~:se],tatlon. Dept. of Biostatistics, University of Washington.

Guo, S~ W. and E. A. (1991) Monte Carlo estimation of variance compo-

nent models. IMA J. Nfath. Appl. Med. BioI. 8: 171-189.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and

their applications. Biometrika 57: 97-109.

Hasstedt, S. J. (1982) A mixed-model likelihood approximation on large pedigrees.

Computers and Biomedical Research 15:295-307.

Hasstedt, S. J. and Cartwright, P. (1979) PAP-Pedigree Analysis Package. Technical

Report No. 13, Department of Medical Biophysics and Computing, University of

Utah.

A SlmLple metnoa a numE~ra1tor

relationship matrix used in prediction of brE~edlll~ vaH''''''. DZiOTfLeCTU;S 32:69-83.

J.

deiin€~d on a

m press.

a Markov

"""'lAt",..", clcmtiJl;u.rat,ioILS by a SaIJaplmg SCllenae. Biometrics:

Sundberg, R. (1974) Maximum likelihood theory for mcomplete

Statist. l:q,~....,l)~.

from an expo-

data

augmentation. Journal oJAmerican Statistical Association 82:528-540.

Thompson, E. A., Shaw, R. G. (1990) Pedigree analysis quantitative traits: vari-

ance components without matrix inversion. Biometrics. 46:399-413.

Thompson, E.A. (1986) Pedigree Analysis in Human Genetics. The Johns Hopkins

Thompson, E. A. and Guo, S. W. (1991) Evaluation of likelihood ratios for complex

genetic models. IMA J. Math. Appl. Med. Bioi. 8:

(1991) .l.UO,Lft'-/V chains for eX1PlOTll11Jl; p,ost,en()r distributions. Technical

1: alg;lorithm to estlmate v;:tnance cO]J[1pCmellt models.

3. Compute , z 1,2, ... , kj

4. Set initial parameter estimates, p,j (j = 1,2, 3)j a'l, i = 1,2, ... , k, and a;j

5. Posterior gene-dropping:

Drop Gj Drop Z1, ... , Zkj

6. De-memorization: Gibbs salnple (z~, zg, ... , z2Gly) for certain number of timesj

7. Next EM iteration step

Set p* = 11/J2 = at2 = 0';2 0, i 1,2, ... , k; j = 1,2,3.;

For j = 1 to N (the Monte Carlo sample size)

Gibbs sample (ZI, Z2, ... , Zk, Gly);

for 1= 1 to k (the .chosen spacing)

Randomly permute all individuals in the pedigree;

In the order indicated by the permutation,

update genotypes and random effects;

next 1

After kth cycle, we have configuration (ZI, Z2, .•. , Zk, G);

compute p* = p* + (21F11 + 1F12 )!(21F1)/Nj

compute

compute p,: =next j

+ [lHy Z1 ... - Zk)] !1~1;

.= 1, ,z 1, ...

-I..v":"",,,,2.7914)

0.8855)

(0.0633, 0.2890)

2.3169

0.1762 0.0576

2.0

p

0.6

0.2

fi2 0.0

fiI

Estil1latedasYl1lptotic var'lal1;ce-co,rar1an(~ell1laU'IX X 103.

p

-2.2621 58.6270

Table 5: Log-likelihood ratios in the neighborhood of the MLE. The Gibbs sampler

is run at Do set to See text explaination.

Figure

Figure 2: iVlonte Log-

timate of over iterations; additive nOhW'PTI11r

variance vs. the EM (d) }£stln;ratie of error vaI'IaIlce vs the iterations;

(e) Estimates of major gene effects.

Figure 1: Pedigree structure of the simulated data. Individuals with grey color con-

stI1Lute a and grey colors con-

SHLULt:: a black

constItute a Zi$ljl-member Slx-,:;eller,xtlcm p~~dl~:ree.

..0

~ It)

~0

i ~ ~

Vf i'"li 0

i 0

0 50 1()() 150 200

iteralipn(a)

"'1

n0

'"08 tlc ..,~ C ".

~0 ! 0.. ~

~..

~0 '"i 0It)

0

'"". 00

0 50 100 lS0 200 50 100 150 200

iteration it&ralion(c) (d)

'.

" '''~' .."~""''-~~~'"'' ', " •••• ~".- ." < ~ •••.• '".~.-_•• ,~

...'~'i' ~.•• ,'- .• _--_._. •

o 50 100 150 200

are:

A Monte Carlo Method for Combined

Segregation and Linkage Analysis

Sun Wei Guo l Elizabeth A. Thompson2

I Department of Biostatistics

University of Michigan

Ann Arbor, MI 48109-2029

2Department of Statistics, GN-22



U. S. A.

Running Head: A Monte Carlo method for genetic

(1), (2): This is based on reSE~ar(:h c()mlJ.letE~(1 S\iVG was a student

Jpn::;}rtlTlPl.t of tilc.statIstlcs,

Correspondence to:

Dr. b.:li;zabeth

lLn.<n·frro",-nf of Statistics, GN-22


Phone: (206) 685-0108

Fax: (206) 685-7419

a

Abstract

Carlo to

of a quantitative trait observed on an

In conjunction with the Carlo method of likelihood

ratio evaluation proposed by Thompson and Guo (1991), the method

provides for estimation and hypothesis testing. The greatest attraction

of this approach is its ability to handle complex genetic models and large

pe(11~ree:s. Two examples are presented. One is of simulated data

ona large pedigree; the other is a reanalysis of published data previously

analyzed by other methods. These examples illustrate the practicality of

the method.

Introduction

The past decade seen enorDlous success

successes are of for Huntington's disease, fibrosis and

lJuLch.enne's Dluscular' dystropy. contrast, progress in Dlapping quantitative traits

has been very slow, despite the fact that Dlany relevant measures of diseases are

clinical, physiological and biological traits that vary continuously aDlong individuals.

There is no shortage of data. In fact, advances in biology and molecular genetics

have generated so much data that the availability of statistical techniques has become

becoDle a bottleneck in the process of the mapping of quantitative traits.

The current available techniques for mapping quantitative traits can be grouped

into four categories: 1) sib~pair methods (Haseman and Elston, 1972),2) discrete-type

linkage analysis (Ott, 1991; Thomas and Cortessis, 1991),3) mixed models (Hasstedt,

1982), and 4) regressive models (Bonney, 1984; Bonney et aI, 1988). Although sib-pair

methods are fairly robust and have the advantage of no need to make ascertainment

corrections, their sta,tistic;al power IS especially when linkage is loose. In

addition, they ignore the interdependency aDlong sib-pairs from the same nuclear

At best, they can only tell whether there is a linkage, are thus primarily a

Dl(l,ppm.e; a qUlmt;ltat.lve to dlchot;orrnze

were dlE;cn~te

ttnierlJlttLllve nn;Ln.vu is to assume

but su1tteI:s loss of information. Additionally, penel;ranc~;s must

arblitr,ary cutoff.

pene'tra,nce functions for a quantltat][Ve

frequency, the mean and for each genotype (Ott, 1974). Since quantitative

traits are probably typically controlled by a number of loci acting in concert with

environmental effects, the adequacy of these models is questionable.

The regressive models proposed by Bonney (1984, 1988) represent a new

development. The model handles the residual variation unaccounted for the major

gene effects as if it "noise", without specifying its origin. Furthermore, the

model assumes a Markovian dependence structure with regard to the residuals among

first-degree relatives. By doing so, the model provides flexibility in incorporation

of covariates and efficiency in computation. However, while simple Markovian

a

Morton, 1981; Ott, Var'latIOn III

a qU<l,ntlta1ave

and/or other U"'jLn'<~IJL'0fnon·-hentabJle eJlte<:ts, of

the environment. Although model is biologically computational

difficulties have limited its use mainly to segregation rather than in

conjunction with linkage analysis, and primarily to data on nuclear tarml1es or small

pedigrees (Ott, 1979; Hasstedt, 1982).

Traditionally, analyses performed separately

(Ott, 1991). Historically, with limited marker data and computing power,

most linkage analyses were carried out only after sufficient information had been

gathered to infer a mode of inheritance for the trait. However, segregation analysis

can only, at best, demonstrate the presence of major gene(s). It cannot localize

them, and often lacks power to estimate genetic parameters correctly in the presence

of multiallelic trait loci or ,gerlet-lc h,ete:rol~eIJlel1;y 1."Hl'''-'U, 1984; Ott, 1990). Violation of

the distributional assumptions of the mixed model can lead to spurious support for a

major gene (MacLean et aI, 1975; Go et aI, 1978; Eaves, 1983). Incorporation of linked

markers might potentially improve robustness of the mixed model. Moreover,

linkage of a trait to a genetic marker evidence of the

eXlsteJD.Ce of a

the

to cornbme llnKa,,~e

et

cornplex, J!:enetlc heterC'J!:elleit,y and

homogeneous than a pooled sample of

both segregation

pel::IIJ!:rees, which, are more

nuclear families. It is also useful to

onbe able to an;:l,lv~~p

consider more realistic yet more complicated genetic models that can incorporate

various heritable/non-heritable random and fixed effects and to develop practical

computation.

In this paper, we propose a Monte Carlo approach to combined segregi1,tion and

linkage analysis for quantitative traits, which extends our previous work on the

Monte Carlo estimation of variance component models and mixed models (Guo and

Thompson, 1991,1992). The greatest attraction of the approach is that it can handle

complex genetic models and data on large pedigrees. In the next section, we describe

the •Illethodand computational algorithm. The practicality of the approach is then

illustrate by two examples. Finally, we discuss of the proposed method in relation to

f"e>r",nr work in this area, and indicate directions for future research.

Methods

Notation and assumptions

IJOnSl'Cler an n-rl1ernbt'~r pE~dl}:~ree a corltm.UOl1S

data not be a,V~LHa,Ulefor same individuals. i::iuppc)se

eHlerUN of eXl)OEatl,on, we consldler a lTIl,>eed

a major autoEionlalmodel <hU\.,IC;O an l'\r!,rh1:"nu:> nnh.,o'pn",r C;]llC;C;C. and

an independent effect, without fixed or CO'vaI'lat;e e:lte<:ts. Extension to

include fixed covariate effects, and dominance or or non-heritable

random effects is straightforward (Guo and Thompson, 1991, 1992), but in this paper

we focus on the inclusion of marker data rather on complexities of the trait

modeL

For technical reasons (see Discussion sec:tlolIl we COlISl<1er a diallelic marker

locus. To notation, let the two alleles of the major gene trait locus be D and d,

with gene frequencies p and 1-p respectively. Let the two alleles at the marker locus,

be Band b, with gene frequencies q and 1 - q. Let Gi denote the ith individual's

combined genotype at trait and marker loci. For a given genotypic configuration G

on the pedigree, let ii be an indicator vector, jthentryequal to 1 or 0 depending

whether jth individual has genotype i. Similarly, we let iF be an indicator vector

with entry i equal to 1 or 0, depending on whether the ith individual is a founder.

It is assumed that each of the three genotypes DD, Dd, and dd, denoted as 1, 2 and

3, makes a specific contribution Pi (i 1, the phenotype. It is also assumed

trait and loci are in equilibrium, locus

a

y

y +a+e 1

urh,,,,"'" a is a vector of adldltlve .e;erletJlC effects, and e is a vector of individual

environmental ettJects. of the ve<:tOJFS a and e is assumed Normally distributed

mean 0, e ha'Vlllll'! varrance-CO'VaIJIaI1Ce LU<JhOLLA <7;1, a having ValJlaIlCe o-;A

where A is the numerator relationship matrix (Henderson, 1976).

Monte Carlo estimation

There are total of eight paranreters to be estimated: the allele frequencies p and q, the

the reCOmbmi1Ltlclll tr'act,lon major gene e1tect;s JLi, i = 1,2,3, and the polygenic and

residual variances, 0-; and <7;. However, estimation of q within the pedigree analysis

is often of secondary interest, as considerable information on the marker may have

accumulated. Besides, if the marker is co-dominant, as is usually the case, q can be

easily estimated from observed marker phenotypes. Therefore, we assume q is known

and let 0 = { p, JLl , JL2, JL3, 0-;, 0-;, r} denote the vector of parameters to be estimated.

The likelihood for model (1) is:

L(O) Ps(y, M) L 1fs(yla, G)P(MIG)Ps(G)dPo(a)G a

L fu(yIG)P(MIG)Ps(G)G

(2)

where G is the combined two-locus genotypic configuration on the pedigree and

the sum is over all on marker

1 or 0, on WIJLeI11er or not

G, IS an

1979). Also,

)IIPe(G) = II PeCCi)founders

where im and if are the parents of i, P(Gi ) is the genotypic frequency and IS a

function of p and q, and P(Gj lGjll Gj.J is the two-locus transmission probability and

is, in general, a function of the recombination fraction r.

A framework for estimation of model (1) is as a "missing data problem" , with G

and a missing. Thus formulation of an EM algorithm is appropriate. The form of the

EM equations for p, a~, a; and fJi (i = 1,2,3) are given by Guo and Thompson (1992).

The added feature here is the inclusion of the linked marker, and the estimation of the

recombination fraction r. To obtain the EM equation for r, suppose that (a, G) were

observed for all n individuals of the pedigree. Then, estimation of r is just a matter

of counting. Of course, we can restrict attention to those parent pairs in which at

least one parent is doubly heterozygous; only these informative for linkage (see,

for example, Ott, 1991). Let Hi (Hi = 0,1,2) be the number of doubly heterozygous

parents in the ith parent-offspring trio, and R i the number of recombinant events in

segregation from the doubly heterozygous parents to the offspring (Ri = 0,1,2,

the expected number of recombination events (Thomas and Cortessis, 1991). Table

1 provides values of Hi and Ri for fJVCli:ll.J1C informative ffi21,tiIlgS.

gelletlc oontiguration G on two

rec;oIllblllatl0n tractlC)ll r

the equation r IS

coml>lete<; one

Sex-specific recombination can be estimated with minor modification, by

counting separately segregations in males and in females.

Despite the simplicity of the framework, implementation is not immediate,

since there is no way to evaluate explicitly the conditional expectations such as those

in (4). Since the distribution of major genotypes and polygenic values, given

the observed data, is intractable. Guo and Thompson (1991,1992) have proposed

Monte Carlo estimation of the required conditional expectations, using the Gibbs

sampler to obtain realizations of the major genotypes and polygenic values given the

data. The Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990)

IS an iterative procedure for drawing multiple (but dependent) realizations from

the unknown conditional distribution. In our case, it works as follows. Beginning

from any realization of polygenic values and combined major genotypes, (a, G),

that is consistent with phenotypic and marker observations, the polygenic values

and genotypes are updated, for each 111 pedigree in turn (in random

observed data (if

any) and the polygenic values and 1!.eIlot.yP(~S of all other members in the pedigree.

IS (which in our

case

as a recLlIz;atl.on

G y, reaJlz.at1()ns at su(:ce:ssnre

cv(;les are dejJerldent, it IS not

In prt'l,ctJlCe, we colJlect rea,llz,at1()ns

etlJich~nt to use

rn<lUu' 0'01'",,1""1">,0'" and Tlnill1lTPnlir V(UU~;;:) for

n of at of 20

polygenic values are stored used as (dependent) realizations from conditional

distribution fu(a, GIY, M). By the ergodic theorem, mean of any function of

(a, G) over the realizations is a consistent estimate of the expectation of that function.

Thus, we have estimates of expectations such as those in (4).

To implement Gibbs we need only, foreachindividual, conditional

distribution of his combined major genotype Gi given Yi, A{, ai, and the major

genotypes of other members in the pedigree, and the conditional distribution of his

polygenic value, ai, given Yi, his combined genotype Gi, and the polygenic values of

other members in the pedigree. For individual i, conditioning on the major genotypes

of all other pedigree members involves only his immediat,e neighbors; his parents,

(if in the pedigree), his spouse(s) (if any), his offspring (if any). The

genotypes of other pedigree member do not contribute further information. Hence,

if we let G(i) denote the genotypes of all pedigree members except i, Gi , Gm the

genotypes of the parents of ith individual, {Gj} his spouses', {GjLJ his offspring's,

then

ex

as

a tOlln<lerJa =lor

the combined lJW'O-lIJCtLS s(~,grie,ga,tlOin

population O'Aru"tunl'p tI'eql11erlCY

Pe(MdGi) = 1 for all POS,Sll)le Gi. updating Giisstraightforward.

Similarly, the value ai can straightforwardly, given an

observed polygenic values of

Quantitative traits are often affected covariates as age and sex. The

effects of these covariates can be estimated by appropriate EM equations as described

in Thompson and Shaw (1990). Alternatively, one can sample major genotypes and

polygenic effects based on current estimates of covariate effects and then use standard

regression methods to estimate covariate effects. The latter method is based on

the fact major genotypes, polygenic and heritable random effects,

major

observations on ditterlent individuals are Hence, for each realization of

nn'I"(f'Plyir and other heritable random one can treat these

Choice of starting realizations

to choose a good "b"rh,nO' genotypic configuration in order to

pr(>lo11gf~d iteration. observed data on each individual conditionally on

provides partial iniOrluation on major gene effects and re(;Olllbination lI'aCLI0J[l.

This local information can be used in a "posterior gene-dropping" method to provide

a sensible starting point, given the current parameter estimates (Guo and Thompson,

1992). Here the procedure is adapted for marker data.

First, the major genes are simulated, from the top of the pedigree down to the

bottom, using the information of CUITeIlt estimates of gene frequency, recombination

fraction, major gene effects. and data. For each founder i in the pedigree and each

possible two-locus genotype g, we calculate the probability

Pe(Gi = glYi, Atli) ex Pe(Gi g)Pe(MiIGi)Je(YiIGi = g)

ex Pe(Gi = g)Pe(A1i IGi )exp [_ ~i ~ ~Gi~;]- Cfa + Cfe

normalizing (for each i) the sum over g to 1. Here Pe(Gi = g) is just the frequency

of combined genotype, calculated on the assumption of linkage and Hardy-Weinberg

equilibria. If Yi is missing, we let Je(YilGi = g) = 1, for all g. If }.1i is missing,

P(MiIGi) is set to 1 for all.J\;h A genotype is then randomly selected according to

the calculated probability. Once all founders are assigned combined genotypes, we

can drop the to non-founders. For each non-founder i, a «""~r\tu",p IS .e;ellerat(~d

from IOllovV'm~ prot>alJ'lllI;y distribution:

ex

ex:

are alrlBadv a:sslF;ne'd.where f mare

vVith linked IU£tJrKel'l:>, h,OW1Bver. this "po,stelllor geIlle-c[rOI)pllllg" procedure not

aSEngIlea combined geIlOtyp(~S

prc)!)l(~m; if

able to carry through because it is Pe(Gi = ,}Y!i) =0 for all

possible g. This is some combined geIlotyp(~S a:3S1!~ne~d. to the pa.rents of,

the ith not be consistent Mi. In practice

"P()st(~rt()r ~!mE~-dlrOp,p1!l~77 until

COlnp,atlil:ne with their

this

all the mdJVl11uaJs

observed marker phenotypes.

Once the major genes at marker and trait loci have been dropped down the

pedigree, we drop the polygenic values similarly conditioning on individual trait

values, major genotypes, and already assigned parental polygenic values (Guo and

ThOmPson; 1992).

Estimation of variance-covariance matrix

errors of estimated paJrau1.etersIt is important to estimate the

construct cOllt1<leIlce ln1:pr"U';:,.ll<

or to

nrr1,'7111IP a Monte Carlo

on a

are x are u

cova

where Bi and Bj are Co]:np,onen1~sof B. In our u (a, G) is the "missing data"

(y,M). Each can be estlmate:d

consists of conditional expectations of simple functions of

u = (a, G) given x (y, M). For example, if N realizations U(l), U(2), ... ,uUV) are

drawn from Pa(ulx), the first term on the RHS can be estimated by

log Pa(um,x)8BlJBj

terms are estimated similarly, using the same realizations.

The first and second derivatives of the "complete data" log-likelihood,

logPa(u,x) = logPa(a,G,y,M), are easy to evaluate since

Pa(a, G, y, M) = Jo(yla, G)Jo(a)Pa(MIG)Pa(G)

and each term has a slIlrrpJle structure, with typically only a subset of the parameters

involved in anyone component of the model. For example, the recombination fraction

r appears only in Pa(G) and

8 )(7)

where the IS over all non-founders, the combined-genotype

con:ti~UlLatl()fl G. 3.

slm,pIe formuJae can

an estImate inf,orulation ma1GrlX IS obt,alIl.ed, can

to a no:rnllllal

can

Likelihood ratio evaluation

A general method for Monte Carlo estimation of likelihood ratios was given by

Thompson and Guo (1991). For the model (1), the likelihood (2) takes the form

L(O) I:fe(yIG)Pe(MIG)Pe(G)G

where the sum is over all possible in the pedigree.

Direetevaluation of likelihood is impossible on a large pedigree due to the

prohibitively large number of terms in the summation (Ott, 1979). However, it can

be shown that the likelihood ratio between two parameter values () and 00 can be

written in the form

(Thompson and Guo, 1991). Thus a Monte Carlo estimate is

OOLa,nll~(l by sarrlpl111g M).

thatM

not deJJen.d IS a nrl""lPllt

a be genera,te(1 alongsl(1e those of G. fe(yIG) IS a

po,ly~:enic llkellhC)O(1 jml01'1l111·.!r 1l1te,~ra,tic.nover unlobsrenred a values (equation

replaCE~(1 by Monte sarnplmJ&;, but for a

simple pOJlygem,c lIlO(1,el on a simple pedigree exact evaluation is possible. Moreover,

any evaluation may in fact be unnelcessar'y If () and (}o differ only in the recombination

fraction, fe(yIG) = feo (yIG) for all G and these terms also cancel from the likelihood

ratio estimator (8). For linkage analysis, one is often interested in computing the LOD

score-a log likelihood ratio at given values of the other genetic parameters. If r in (}o

is the recombination parameter, while r in () is 0.5, then the estimated LOD score is:

(9)

with no evaluation of the polygenic likelihood required.

In the Monte Carlo EM algorithm described above, given the current parameter

estimate (}(k), realizations are obtained from fe(k) (a, GIY, M) and used to obtain the

next parameter estimates (}(k+l), say. The realized major genotypes G are realizations

from the marginal conditional distribution Pe<k)(GIY, M). The same realizations can

thus be used to estimate the LOD score at (}(k); no additional realizations are needed.

However, when satisfactory est,lm,at<~s of other parameters are obtained, and the

run

to nrc~vlcle score curve. A

r nrcwH1es belbw(~en r same

r' ().

All the me1GllO(1S above can used on more than one pedi.lJ~re,e;

of Ca, G) conditional on (y, M) are simply obtained for and required

conditional expectations combined in equations. For est,imlatjion

information matrix, since pedigrees are unrelated, the total observed information

is simply the sum of the values for the individual pedigrees. The inverse of the

observed information matrix is then an estimate of the asymptotic variance-covariance

matrix on the total data set. Likewise, the overall LOD score is the sum of the LOD

scores on individual pedigrees.

Results

In this section we provide two examples to illustrate the method proposed in previous

section. The programs implementing the method proposed in this paper are written in

C. All the computations were carried out on a DECstation 3100. The random number

generator used was the run-time library drand48. The program psdraw (Geyer, 1988)

was used to draw pedigrees.

Example 1. Simulated data.

We consider simulated data on a 230-member, six-generation pedigree (Figure 1).

are 67 was to data;

:>UJlIUJ.dLJ,UJl V;tlllP" are <:hr,urn

is 0.5.

same

l.O, 113 = r= p=

Itel'atl,ons of IVlonte Itel"atI()ll 200

rea,llzatllons (a, G) were 20 of of pedigree

between each sampling. For the EM iterations, 1000 Gibbs realizations

were sampled, with 20 between each sampling. Once the final estimates were

obtained, 8000 realizations, with 30 cycles between two realizations, were

asji'mlPto,tic va'rianCle-cov.:xrianlce matrix and thedrawn

LOD scores at various recombination fractions.

2 shows the LOD score and the parameter estimates against the EM

iterations. The Monte Carlo samples used in the EM iterations are not large; figure 2

reflects the continuing random variation in the conditional expectations used for the

EM procedure. However, larger samples are unnecessary. Even for this case where

the data providesl.lbstantial information, the statistical standard errors (Table 4)

are much greater than the standard errors in the Monte Carlo sampling. The final

estimates, with their estimated standard errors and nominal 95% confidence intervals,

are shown in Table 4. 5 the asymptotic matrix of

score curve: estlJmatled mlaXlmUlm LOD

score IS nn,,,,al',c; is evident,

not seem to

eff€:cts are correctly mlerrecL

errors,

score at

val:lallce cOlnpow~nt est;]tm(J~lies have higher rpl;~J,nrp st;andal:d

addition, the well

95% confidence interval includes

the true parameters; in all cases

simulation value.

nominal

Example 2. Hypercholesterolemia and the LDL receptor

gene.

We re-analyze the data on LDL cholesterol levels and LDL receptor genotypes on a

60-member, five-generation pedigree (Leppert et al., 1986). The pedigree is shown

in figure 4. This data set has been extensively studied by several workers; the

following analysis is presented to illustrate the methods of this paper, and not to

draw conclusions about the genetic mechanisms of the disease.

Using the Pedigree Analysis Package, PAP, Leppert al (1986) carried out a

segregation analysis under the assumption of a model Then, they performed a

linkage analysis the parameters obtained from the analysis. They

of at r=

ease

al (1988) pertoruled a combined seg;re~~ation and linkage analysis using a regressive

To

assl1m{;d a dornmant

O.

leadlrl.!?: to pleval:ed

atscorea ma]l{lrnum

HIOnla8 and Cortessis a

no

the poste~rl(jr mean offound that the ranged from 0.065 to

recombination fraction from 0.076 to 0.318.

We performed a combined segregation and linkage analysis using the methods of

this paper. Since individual 7 is unobserved and does not have offspring, and thus

contributes no she the evident

that the genotypes of individuals 8, 18 and 23 can be the existing data.

Following Leppert et al (1986), we used the model (1), but made no ascertainment

correction nor any assumption of dominance. Starting values p = 0.4, o-~ = 718.0,

0-; = 3797.0, f.ll = 375.6, f.l2 = 139.7, and f.l3 = 95.3 were obtained by a Monte Carlo

EM of the mixed model without marker data. Then we performed Monte Carlo EM

for 200 iterations. At each EM iteratipn a sample of 400 Gibbs realizations were

drawn, with 10 cycles between each sampling. For the last ten iterations, 1000 Gibbs

realizations were sampled, with 20 cycles between each sampling. Based on the final

estimates, 12,000 realizations were drawn, with 20 cycles between two consecutive

samplings. The final estimates, standard errors and confidence

6. The estJma,ted ma:x:im.um

lterat,lOIls are sn~Jwn

score is 7.13 (Figure

no

asc:ertainUlerlt cc)rn~ct:lon was LHU'Ucv, it CaIln()t

tiec:onld, blecause of the nunlberofIOUJnaers (nF = 16),

a smce

IS k

information in these data is not great; the likelihood surface is flat. Although

the presence of the major is dear, the magnitudes of the major gene effects have

wide confidence intervals. As usual, the estimates of additive and error variances are

even less precise. fact, the wide confidence intervals for a; means that for these

data IS no of any polygenic (Table 7).

The results of our analysis of data are (not surprisingly) consistent with those

of previous authors. The current approach provides maximum likelihood estimates of

all the parameters in the model, together with standard errors or other measures of

precision. The procedure for likelihood ratio evaluation provides a LOD score curve,

and also permits exploration of the multiparameter likelihood surface.

Discussion

Almost every function human biology exhibits continuous variation. Aspects of

diabetes and hypelJte][1S110n, pnedlsposltlC)ll to cancer, drug and alcohol sensitivity, can

be m(~asl11re:a as qmmtlltat;ive cOlnplex behavioral psychological

are

are

quantItatIve ones tOC:USJlll,!( on at VH~'Cu. mstea:a

of a1t,ect,atllon 1:iL(1bLU1:i.

Our apl>rOl'tch nr()Vl.:iPR a prctctl,cal iVlonte

loci is

aPl:Jro,ach to COnnblJned

to handle complex models and large It simple,

numerically stable and computationally In a this

approach works quite welL a Monte Carlo EM approach, combined segregation

and linkage analysis does not substantially increase the computillg time, compared

with segregation an';<,.hl'R1R

Due to the formidable computational of complex segregation analysis

and increasing computing power, there is greatly increased interest in employing

Monte Carlo methods in pedigree analysis. Ott (1989), Ploughman and Boehnke

(1989) and Kong et aL (1991) have independently proposed Monte Carlo methods for

sampling from the pedigree genotype distribution conditioned on the trait phenotypes

observed in the pedigree. Unlike the current approach, those methods T'Ari1111'A

probability computations at the trait locus in order to simulate data at a linked

marker. Thus they are not feasible complex models or complex pedigrees. Closer

to this paper is the work of Lange and Matthysse (1989) and and Sobel (1991),

who proposed using a Metropolis to calculate LOD scores location

scores. and a

for tW'O-lJ)'Olllt llnK<tge an,alVSlS cornblnes a tlaveslan perspe'ctl'.:e

inf,ornlation on paJranlleters IS no consensus

likelihood sur'ta<:e

me~th<)as of LJ<.ULF-,'"'

peltletraIlCe turlCtIOIlS are known

Gibbs sanIPl€:r.

we

are

a COInplex

by eStlmalGmg

and CO~Jodcers

on chcnce of the

and LOD or location scores are generally, all previous have

been restricted to relatively simple genetic models for the trait. By contrast, the

Monte Carlo EM approach permits estimation of the parameters of complex models,

in conjunction with linkage analysis, and exploration of a multi-parameter likelihood

The efficiency and validity of alternative methods of Markov chain Monte Carlo is

currently an active research area in the statistical literature (Tierney, 1991; Gelfand

and Smith, 1990). For validity, the technical requirement is that of irreducibility

(hence ergodicity) of the Markov chain. For the Gibbs sampler employed in this

paper, as also in Thomas and Cortessis (1991), irreducibility is only assured for a

diallelic marker locus. Lange and Sobel (1991) point to the same requirement for

their Metropolis algorithm. However, irreducibility is not the main barrier in practice.

Depending on marker phenotypes observed on the pedigree, it may in fact obtain

for multi-allelic markers. Further, it can always be assured by modification of the

salnplmg procedure; one modification is the rejection sampling method proposed

by k>filBeIJlan and

The gre<'tter practIcal pl['obJlem IS conapultational e11lcl.en(:y

are as

the Me:tro,poJ]S IVU:l.,LJi:WV can "sticky" ).

same true of sampler to rare

rec;es~nv(~s on a complex (Thompson, a reason.

By contrast, the of genotype-phenotype correspondence a model a

complex quantitative trait results in less "stickiness" for the Markov chain of genotypic

configurations. However, marker information, with not all individuals observed,

and!or with tight linkage, is likely to create problems of computational efficiency.

The occurrence of multiple alleles at marker loci, and the consequent necessity of

using rejection sampling or some other method to ensure ergodicity, can only increase

these problems. Computational efficiency is an important issue that warrants further

investigation.

Finally, it should be pointed out that the approach of this paper is not limited to

the mapping of the single quantitative trait in the framework of mixed models. The

same approach can be applied to a variety of gene mapping problems, such as power

calculation for linkage analysis, and combined segregation and linkage analysis for

multivariate traits. It can developed to incorporate genetic heterogeneity among

different pedigrees, and to handle multiple trait and marker loci. It opens up new

ways to tackle complicated models which analytical methods are often

lacking.

fraitf1111 d.1SC111SSI0IJlS and

his

and COlnnleI1ltB,

programs.

for hellJful

::Shieel1lan for providing her pedigree neighborhood

References

On

COIltiIlUC)US human traits: .l:teJ~re:sshre models. AIlaer'Ica,n Journal of Medical

Bonney GE, Lathrop GM, Lalouel J-M (1988) Combined linkage and segregation

analysis using regressive models. American Journal of Human Genetics 43:29

37.

Eaves LJ (1983) Errors of inference in the detection of major gene effects on

psychological test scores. American Journal of Human Genetics 35:1179-1189

Elston RC (1984) Genetic Analysis Workshop II: Sib pair screening tests for linkage.

Genetic Epidemiology 1:175-178

Elston RC, MacCluer JW, Hodge SE, Spence MA, King RH (1989) Genetic Analysis

Workshop 6: Linkage analysis based on affected pedigree members. In l\1ultipoint

Mapping and Linkage Based on Affected Pedigree Members. RC Elston et al

(eds). Alan R. Liss, New York. pp 93-103

Gelfand Smith AFM (1990) Sampling based approaches to calculating marginal

1 ntfiSaCl;IOllS on

MacIll11e In.teHll~ence 6:

Geyer CJ

of pedigree

a,nd Sta,tistics: Proceedings of the 23rd Symposium on the lnterliwe, Pp 156-163.

Interfa,ce Foundation of North AnlerJlca.

Go Rep, Elston RC, Ka,pla,n EB Efficiency

segregation a,nalysis. American Journal ofHuman Genet;lcS 30:28-37

Guo SW, Thompson EA (1991) Monte Carlo estimation of variance component

models. IMA J Math Appl in Med & BioI 8:171-189

Guo SW, Thompson EA (1992) Monte Carlo estimation of mixed models for large

complex pedigrees. Submitted.

Hasema,n JK, Elston RC (1972) The investigation of linkage between a quantitative

and a, marker locus. Behav Genet 2:3-19

Hasstedt SJ (1982) A mixed-model approximation on large pedigrees.

Biomedical Research 15:295-307

A slill.ple lllt:LllUU a numE~ralGOr

A lllt:LIlLlU C()mlJ'ln111~ p<:ellIlJ?:

Gene M(l~pp.mg

UhakI'av,'trtI A, Cox D, MIS.110p Bale SJ, and ;'Kt,UniCK

of

quantitative traits nuclear families: Comparison of

Genetic Epidemiology 6:713-726

program packages.

Lalouel J-M, Morton NE (1981) Complex segregation analysis with pointers. Human

Heredity 31:312-321

Lange K, Matthysse S (1989) Simulation of pedigree genotypes by random walks.

American Journal of Human Genetics 45:959-970

Lange K, Sobel E (1991) A random walk method for computing genetic location

scores. American Journal of Human Genetics 49:1320-1334

Leppert MF, Hasstedt S et al (1986) A DNA probe for the LDL receptor gene is

tightly linked to hypercholesterolemia a pedigree with early coronary disease.


MacLean CJ, Morton NE, Lew R (1975) Analysis of family resemblance, IV.

Upera,tionai cha.racteristics of segregatH)n """"'" v 0'"", Human lielletlcs

Ott J

hUIJuan linkage studIes. Amenc<tn J1JUrl1dl of

Human lienetiIcs 26:58:8-1:)9

Ott J

and mixed models in human pedigrees. American. Journal of Human Genetics

31:161-175

Ott J (1989) Computer simulation IIle:LIl()usinhuman linkage analysis. Proc Nat Acad

Sci USA 86:4175-4178

Ott J (1990) Cutting a Gordian knot in the linkage analysis of complex human traits.


Ott J (1991) Analysis of Human Genetics Linkage. Revised edition. The Johns

Hopkins University Press. Baltimore.

Ploughman LM, Boehnke M (1989) Estimating the power of a proposed linkage study

for a complex genetic trait. American Journal of Human Genetics 44:.543-551

Risch N (1984) Segregation analysis incorporating linkage markers. I. Single-locus

mCldels with an application to I diabetes. American Journal of Human

\.ieIletl'cs 36:363-386

detlne'd on

a press

sampling apI>rO':l.Cn to linkage analysis.Thomas

J ::>tatlst L.-r.:,-ou

Uortessls V (1991) A

for data an

Thompson EA (1991) Probabilities on complex pedigrees; the Gibbs sampler

approach. Computer Science and Statistics: Proceedings of the 23rd Symposium

on the Pp 321-328. Foundation North America.

Thompson Shaw RG (1990) Pedigree analysis for quantitative traits: varIance

components without matrix inversion. Biometrics 46:399~413

Thompson EA, Guo SW (1991) Evaluation of likelihood ratios for complex genetic

models. IMA J of Math Appl in Med & BioI 8: 149-169

Tierney, L. (1991) Markov chains for exploring posterior distributions. Technical

Report No. 560, School of Statistics, University of Minnesota.

1: for estimation of the recombination fraction: number of double-heterozygous (H)(R), Only informative matings are listed, - denotes impossible combinations. <P

(1 1,)2]. Cortessis (1991).

Number of

db/db db/dB dB/dB db/DB dB/Db

1 0 1 1 0

1 1 0 - 0 1

1 0 l' 1 1 0 1 0

1 1 l' 0 0 1 0 1

1 - 0 1 - - 1 0

1 1 0 ........., - 0 1

1 0 1 - l' 0 1 - 1 0

1 1 0 - l' 1 0 - 0 1

2 0 1 2 1 0 2 1 2 1 0

2 1 <I> 1 <I> 1 1 <I> 1 <I> 1

1 0 1 0 1 l' 1 0

1 - - 0 - 1 1 0

1 - - - 0 0 1 1 1 l' 0

1 - - - 0 - 1 1 0

2 2 1 0 1 2 0 1 0 1 2

1 - 1 0 1 0 l' - 0 1

1 1 0 0 1

1 - - - 1 1 0 0 0 'r 1

1 - - 1 0 - 0 1

estimate

pe<i1j!ree ne]Lp;hoo]rhc~Od strucl;ure; Record the number of individuals

l.

2.

data;

tn€:cnumo,er ot oos,ervled 1IJLdlVlduals, k;

3. Compute A-I;

4. Set initial estimate of 0; i.e. of p, T, #h (j = 1,2, 3); o-~, and 0-;;

5. Posterior gene-dropping:

Drop G; Drop a;

6. De-memorization: Gibbs sample (a, GIY, M) for specified number of times,

at current parameter values p, r, #h (j 1,2, 3); o-~, and 0-;;

7. EM iteration

Set p* = r* = 0-~2 = 0-;2 = 0, #j 0, j = 1,2,3;

For j = 1 to given number of sample size N

Gibbs sample (a, GIY, M) at current parameter values,

p, r, #h (j = 1,2, 3); o-~, and 0-;;

for l = 1 to given number of cycles C

Randomly permute all individuals in the pedigree;

In order indicated above, update genotypes and random effects;

next l

After Cth cycle, we obtain one realization from Pe(a, GIY, M);

increment p* by (21F11 + 1F12 )!(21F1);

next j

2_

Table 3: Example of linkage segregation probabilities peG; IGj , Gk ) and the first order

derivatives of the logarithm of the segregation probabilities. The parental genotypes

are db/DB x dB/Db. The recombination fraction is r. The derivative does not exist

at r = O.

Offspring Segregation First order derivative of

genotype probability log segregation probability

db/db r{l - r)/4 l/r - 1/(1 - r)

db/dB [r2+ (1 - r)2]/4 2(2r 1)/[r2+ (1 - r)2]

dB/dB 1'(1 - r)/4 l/r - 1/(1 - 1')

db/Db [r2+ (1 - r)2]/4 2{2r - 1)/[r2+ (1- r)2]

dB/DB [r2+ (1 r)2]/4 2(2r 1)/[r2+ (1 - r)2]

Db/Db r(l - r)/4 l/r - 1/(1 - r)

Db/DB [r2+ (1 - r)2]j4 2(2r - 1)j[r2+ (1 - r)2]

DB/DB r(l -r)/4 l/r - 1/(1 - r)

db/DB 1 - r)/2 l/r - 1/(1 r)

dB/Db r(1-r)/2 l/r - 1/{1 r)

stalldard errors

2.0 2.3400 0.2190 (1.9109, 2.7692)

/1>2 0.0 O. 0.1805 (-0.1936, 0.5141)

/1>3 -2.0 -2.1133 0.1832 (-2.4724, -1.7541)

0.6 0.6280 0.1557 0.9332)

0.2 0.0694 (0.0174, 0.2895)

r 0.1 0.0385 0.0325 (0.0000, 0.1021)

Table 5~ Estimated variance-covariance matrix for simulated data. The actual value

of each element in the matrix is the shown value times a factor of 10-6•

p

-1794.3 47940

-2863.1 17731 32591

-2506.2 9524.0 25539 33578

.5102.5

6: LI""lliH"""'U palranlet~ers,with

vals, for hy,)er~choleslGer()lelnia data.

ST;al[lOara errors and

Parameter Estimate S.E.

p 0.3266

/11 378.880

/12 157.220

/13 94.980

(j2 862.150a

(j2 2933.538e

0.1066

27.133

21.851

21.106

847.456

1122.495

(0.118, 0.536)

(325.700, 432.061)

(1 200.049)

(53.613, .1au .. ,)"f0

(0.0, 2523.163)

(733.449, 5133.627)

Table 7: Estimated variance-covariance matrix for hypercholesterolemia data.

P /11 /12 /13 (j2 (j2a e

1.136 X 102

-6.159 X 101 X 102

7.725 X 101 3.006 X 102 4.775 X 102

-9.013 X 102 X X 101 X 102

7.067 X 10° X 102 X 7.

con

con-

1:

stltute a 4b-m€~ml:)er. Ivur-g,eneration

a :;1()-memc,er. IIvt~-gtmerationpe<iil!:I'ee:

co:nst,ittlte a L;,)l}-ITlenr1h<'l'

'"~

0

"l

i '"i ~

't0 '"9 1 IV.

'" '" V0

'" "!

'"0 50 tOO tOO 200

EMilera1ion EM_on(II) (b)

~'" .,

0C

E ~ S ...: 0 0C 'C0 ~'"..

~~

:8 ~ '"E '60

..,0 ..

'"~ 0

'" 't<> <>

0 50 tOO t50 200 0 50 tOO 150 200

EMit.,ation EMileration(c) (d)

~ r- ---~

.,.S

1;'t IC.. <>'C 1 '< ' •• ~'"

~ ,. -~ ••••. ••••• ~ •.•• <¥. -- , ••••

Is "! '"li; '" Is...'"

E '7 ,0

" '""t -- ...~-- --------------------0

0 50 tOO 150 0 50 tOO tOO 200

Figure of the ClUJ.H.Huu<.Al

LOD score,

error variance,

~oo(/J

Qo...J

o,...

o

0.0 0.1 0.2 0.3 0.4 0.5

recombination fraction

score curve

OQdOOCdtl8liI BI:> BB BB BI> BI> BB BBBB BB

2

P-

oBB

BI>

0 61 62

0 2 0BI> BI:> BI:> - -y

oBB

26

BI>

are

o

0.0 0.1 0.2 0.3 0.4 0.5

recombination fraction

5: NIOnte score curve

'".,

~0

~III .,0 l!l9

"' d

...

..,~

0 0 50 100 150 200

EM_ EMl1erafion(a) (b)

S ~d

c8,2

iU '":~I i ~

:z; Ji

'5

" ~~..

0

0 §d

50 100 150 0 50 100 150 200

EMl1eration EMit...ation(e) (a)

8 L"'...00

l'l ...2l 8 ~c ... 0..'iii ..

~> 0 ~IsIi ~ ,~

~E0 ..••• .,' ••• ".'+"

~ 2 ~----------~-,--------~-----~

0 50 100 150 200

COIJabllaed segreg.atlcm and LDL data

Est,IID.ate of

error val'lance,

TWO PAPERS ON MONTE CARLOESTIMATION OF MODELS FOR ... · The chain is also geometrically ergodic by...

Documents

Transcript of TWO PAPERS ON MONTE CARLOESTIMATION OF MODELS FOR ... · The chain is also geometrically ergodic by...