A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward...

15
A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward High-Resolution Mapping of Quantitative Traits Author(s): Xiang-Yang Lou, George Casella, Rory J. Todhunter, Mark C. K. Yang and Rongling Wu Source: Journal of the American Statistical Association, Vol. 100, No. 469 (Mar., 2005), pp. 158-171 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/27590526 . Accessed: 16/06/2014 06:22 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AM All use subject to JSTOR Terms and Conditions

Transcript of A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward...

A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping:Toward High-Resolution Mapping of Quantitative TraitsAuthor(s): Xiang-Yang Lou, George Casella, Rory J. Todhunter, Mark C. K. Yang and RonglingWuSource: Journal of the American Statistical Association, Vol. 100, No. 469 (Mar., 2005), pp.158-171Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27590526 .

Accessed: 16/06/2014 06:22

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward

High-Resolution Mapping of Quantitative Traits Xiang-Yang Lou, George Casella, Rory J. Todhunter, Mark C. K. Yang, and Rongling Wu

The nonrandom association between different genes, termed linkage disequilibrium (LD), provides a powerful tool for high-resolution

mapping of quantitative trait loci (QTL) underlying complex traits. This LD-based mapping approach can be made more efficient when it

is coupled with interval mapping characterizing the genetic distance between markers and QTL. This article describes a general statistical

framework for simultaneously estimating the linkage and LD that are related in a two-stage hierarchical sampling scheme. This framework

is constructed within a maximum likelihood context and can be expanded to fine-scale mapping of complex traits for different population structures and reproductive behaviors. We provide a closed-form solution for joint estimation of quantitative genetic parameters describing

QTL effects, QTL position and residual variances, and population genetic parameters describing al?ele frequencies and QTL-marker LD.

We perform simulation studies to investigate the statistical properties of our joint analysis model for interval and LD mapping. An example

using body weights of dogs from a multifamily outcrossed pedigree illustrates the use of the model.

KEY WORDS: EM algorithm; Interval mapping; Linkage disequilibrium; Maximum liklihood estimate; Quantitative trait loci.

1. INTRODUCTION

The motivation of this article arises from problems in the

high-resolution genetic mapping of quantitative traits with DNA-based polymorphic markers. The poly genie inheritance and environmental sensitivity of quantitative traits complicate the analysis and modeling of their genetic architecture, as spec

ified by the number, distribution, actions, and interactions of the

underlying genes, that is, quantitative trait loci (QTL) (Mackay 2001). Molecular genetic markers, derived from polymorphic sites in the genome, have proven to be powerful tools for char

acterizing and mapping individual QTL for a quantitative trait. The basic principle of such QTL mapping proceeds by count

ing the expected recombination events occurring between the

markers and putative QTL during meiosis. One of the genetic strategies for QTL mapping is to make use of the marker-QTL recombinants created in marker-genotyped generations through

controlled crosses. This strategy is called linkage analysis. In

terval mapping of QTL based on the multilocus linkage analysis of two flanking markers and a QTL bracketed by the markers al lows for a systematic scan for the genome-wide distribution of

QTL (Lander and Botstein 1989). Substantial extensions of in terval mapping into more general situations have been reported (reviewed in Jansen 2000).

Interval mapping utilizes the information about gene segre

gation and transmission in a well-constructed pedigree, and is

Xiang-Yang Lou is Postdoctoral Research Associate, under the guidance of Dr. Rongling Wu, and George Casella is Distinguished Professor and

Chair, Department of Statistics, University of Florida, Gainesville, FL 32611.

Rory J. Todhunter is Associate Professor, The James A. Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853. Mark Yang is Professor, Department of Statistics, University of

Florida, Gainesville, FL 32611. Rongling Wu is Associate Professor, De

partment of Statistics, University of Florida, Gainesville, FL 32611 (E-mail:

[email protected]) and Adjunct Distinguished Professor, Zhejiang Forestry Uni

versity, Lin'An, Zhejiang 311300, China. Rongling Wu conceived the idea of this study, wrote this manuscript, and was responsible for directing the research. The authors thank the associate editor and an anonymous referee for their constructive comments on this manuscript. This work is supported in part by a National Science Foundation grant (9971586) to G. Casella; an

Outstanding Young Investigators Award of the National Science Foundation of China (30128017), a University of Florida Research Opportunity Fund

(02050259); and a University of South Florida Biodefense grant (7222061-12) to R. Wu; and the Morris Animal Foundation contract grant (722206212) to

R. J. Todhunter. The publication of this manuscript is approved as journal series No. R-09584 by the Florida Agricultural Experiment Station.

limited for the fine-scale mapping of QTL unless the size of the pedigree is large. An alternative strategy of QTL mapping is based on linkage disequilibrium (LD) analysis, which uses

recombinant events in a population generated at a historical

time before genotyping starts. Because the extent of disequi

librium between two genes decays exponentially toward at a rate depending on the recombination fraction (Lynch and Walsh

1998), the detection of significant LD after a number of gener ations may imply a tight linkage between the marker and QTL. For this reason, LD-based mapping provides a powerful way of fine-scale mapping of QTL. Although the use of LD mapping has led to successful high-resolution mapping of several human diseases (e.g., H?stbacka et al. 1992, 1994), the prerequisite of this strategy, that LD is due to tight linkage between the QTL and marker, may not hold. In populations where evolutionary

forces such as selection, drift, population structure, and pop

ulation admixture act, the disequilibrium can also be detected

(Lynch and Walsh 1998). Such an LD is regarded as spurious because it provides no information about the marker-QTL link

age.

Many strategies have been developed to combine the advan

tages of linkage and LD analysis traditionally considered sepa rate strategies. For example, in human genetics, a family-based LD strategy, called transmission/disequilibrium testing (TDT), was devised to simultaneously determine the linkage and LD between a marker and a disease gene (Allison 1997). Several studies have considered the joint modeling of linkage and LD

mapping for quantitative traits as well as its advantages com

pared with pure linkage mapping or pure LD mapping (Wu and

Zeng 2001; Wu, Ma, and Casella 2002; Fan and Xiong 2002; Farnir et al. 2002; Meuwissen, Karlsen, Lien, Olsaker, and

Goddard 2002; Lund, Sorensen, Guldbrandtsen, and Sorensen

2003; Perez-Enciso 2003). However, these studies provided a

solution only for some very specific problems, because they were not formulated under a general statistical framework, by

which analysis and modeling of QTL mapping can be read

ily extended to different populations, reproductive behaviors, marker systems, and pedigree structures. Moreover, in many of

? 2005 American Statistical Association Journal of the American Statistical Association

March 2005, Vol. 100, No. 469, Theory and Methods DOI 10.1198/016214504000001295

158

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 159

the cited studies, the statistical properties of the joint linkage LD analysis have not been investigated through either analytical or numerical approaches.

In this article we develop a general framework for integrating interval and LD mapping to enforce the efficiency and effective ness of high-resolution mapping of quantitative traits. Our inte

grative model makes use of genetic information from the family pedigree and population data by adopting a two-stage hierarchi cal sampling scheme as reported by Wu and Zeng (2001) and

Wu et al. (2002). The upper hierarchy of this scheme describes the gene distribution in a natural population from which a sam

ple is randomly drawn (LD analysis), and the lower hierarchy describes gene segregation in the progeny population derived from matings (linkage analysis). These two hierarchies are con

nected through population and Mendelian genetic theories. In a recent study, Lou et al. (2003a) derived a closed-form solution for estimating population genetic parameters, such as al?ele fre

quencies and LD, within the maximum likelihood framework. Here we extend the idea of two-stage hierarchical sampling scheme to jointly model interval and LD mapping by imple

menting the closed-form EM algorithm of Lou et al. (2003a).

Compared with the earlier work by Wu and Zeng (2001) and Wu et al. (2002), this extended model displays increased com

putational efficiency and provides more precise estimates of the

QTL genomic location and genetic effects, because a pair of

flanking markers that bracket the QTL are used simultaneously.

2. TERMINOLOGY AND NOTATION

2.1 Haplotype, Diplotype, and Zygote Genotype

Suppose that there are L loci, Ax,..., AL, on a chromo

some which are segregating in a random mating natural pop

ulation. At locus / (/ = 1,..., L) there are k? al?eles, denoted

by A\,..., Alk . The haplotype, that is, a linear arrangement of

al?eles from the L loci on one member of a pair of homolo

gous chromosomes (Cepellini et al. 1967), can be described by a vector of L dimensions containing one al?ele from each of

the L loci. The al?eles from all of the L loci form nn = ]~]/=i k

haplotypes. During the process of pollination, paternally- and

maternally-derived haplotypes will randomly unite to form n\ combinations, among which there are a total of n^(nn + l)/2 different zygotic configurations, or diplotypes (Morton 1983). Because different diplotypes may have the same genotype, the number of multilocus genotypes, nz = ]~[/=i (^ + ^)h?2, will be less than the number of diplotypes. Let H?, V^> or H?H?', and Zv be the haplotype ? (? = 1,..., nn), diplotype ??' (? <

?' =

1,..., nn), and zygote genotype v (v = 1,..., nz). To dis

tinguish between polymorphic markers and quantitative trait loci composing L-locus haplotypes, we let ?(V^'), Qm(V^'), and

Gq?D??>) denote the many-to-one mapping operators tak

ing the genotypes at all loci, marker loci, and QTL of diplo type V&.

We use an example to demonstrate the concepts of haplo

type, diplotype, and zygote genotype. Consider two biallelic

loci, A and B, on the same chromosome, having al?eles A, a

and B, b. The haplotypes of these two loci are the combinations of al?eles, AB, Ab, aB, and ab. These haplotypes (or gametes) are randomly paired to generate 10 diplotypes, ABI AB, ABI Ab, ABlaB, ABI ab, Abi Ab, AblaB, Abi ab, aBlaB, aBlab, and ablab,

where "/" denotes the separation of the maternally and pater

nally derived gametes. Of the foregoing diplotypes, the fourth and sixth cannot be genotypically distinguished from each an other. Thus we actually have only nine distinguishable geno

types, expressed as AABB, AABb, AAbb, AaBB, AaBb, Aabb, aaBB, aaBb, and aabb.

2.2 Recombination and Reduced Recombination

During the meiotic stage of life, the diplotype (zygotic con

figuration) arising from the unification of maternal and paternal haplotypes will generate new haplotypes for the next genera tion. Some of these new haplotypes will be different from the

maternal and paternal haplotypes that form the diplotype be cause of the occurrence of recombination events between a pair of contiguous loci. For example, diplotype ABI ab will generate four haplotypes, AB, ab, Ab, and aB, with the first two identi cal to the haplotypes composing the parental diplotype and the second two due to the recombination or crossover between the

two loci. The frequency at which the recombinant types occur

among the total number of haplotypes, termed the recombina

tion fraction, depends on the genetic distance between the two loci. Thus the recombination fraction (and therefore the genetic distance) can be estimated by counting the relative numbers of recombinant and nonrecombinant haplotypes that the zy

gote produces. But whether recombinant and nonrecombinant

types can be genotypically distinguished from each other in a

genotyped family depends on the heterozygosity of a diplotype. For those completely homozygous diplotypes (e.g., ABI AB), the

genotypic distinction between recombinant and nonrecombi

nant haplotypes is not possible, despite the fact that these two

types occur at the same time.

In this study we construct a general framework for integrat

ing the advantages of interval and LD mapping of complex traits. Traditional interval mapping is based on a known link

age group throughout which every two flanking markers are scanned to test whether a putative QTL is located between the two markers. Unlike such a traditional treatment, however, our

integrative model assumes that the recombination fraction be

tween the two markers is unknown a priori, to better specify the relationships among the markers and QTL. Let r\, r2, and

r be the recombination fractions between the left marker (of al?eles A and a) and QTL (of al?eles Q and q), between the

QTL and the right marker (of al?eles B and b), and between the two markers. Thus, if r\ and r2 are estimated, then the

location of the QTL within the marker interval can be deter mined. The estimation of r\ or r2 is based on the observations

of the recombinant and nonrecombinant types. For a completely

heterozygotic diplotype, AQBIaqb, the eight haplotypes that it

produces are distinguishable based on their formation mecha nisms resulting from the recombination or nonrecombination.

For those partially heterozygotic diplotypes, however, different formation mechanisms may lead to the same haplotype. For ex

ample, diplotype, AQBIAQb produces two distinguishable hap lotypes AQB and AQb, each of which may result from either recombination (R) or nonrecombination (N) between each of two pairs of adjacent loci (i.e., A and Q, Q and B) described

by NN, NR, RN, or RR. Assuming that there is no interference, the frequencies of these four genotypically indistinguishable formation mechanisms for each haplotype can be expressed in

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

160 Journal of the American Statistical Association, March 2005

Table 1. Examples of Reducible Recombination

Parental Formation Formation Frequency of State diplotype Haplotype mechanism frequency reduced recombination vector

AQB/AQb AQB NN (1 -

n)(1 -

r2)/2 .5 (0 0) NR (1-^/2 RN r1(1-r2)/2 RR /V2/2

AQb NN (1 -n)(1 -r2)/2 .5 (0 0) NR (1-A?)r2/2

RN r1(1-r2)/2 RR /1/2/2

AQBIAqb AQB NN (1 - n)(1 - r2)/2 (1 -

r2)/2 (0 -1) RN ri(1-r2)/2

AQb NR (1 -ri)r2/2 r2/2 (0 1) RR r1r2/2

AqB NR (1-r,)r2/2 r2/2 (0 1) RR /ir2/2

?<# NN (1 -n)(1 -r2)/2 (1-r2)/2 (0-1) RN Ai(1

- r2)/2

AQBIaQb AQB NN (1 -

n)(1 - r2)/2 (1 - r,)(1 -

r2)/2 (-1 -1) RR r^/2 nr2/2 (11)

AQb NR (1-Ai)r2/2 (1-r-i)r2/2 (-11) RN r^-r2)/2 ri(1-r2)/2 (1-1)

aQB NR (1-ri)r2/2 tf -

r,)r2/2 (-11) RN /i(1-r2)/2 n(1-r2)/2 (1-1)

aQb NN (i_n)(1-r2)/2 (1 - M(1 -

r2)/2 (-1-1) RR r,r2/2 r,r2/2 (11)

NOTE: The state of recombination is expressed as 1 for recombined (R), -1 for nonrecombined (N), and 0 for reducible.

terms of r\ and V2 with the form given in Table 1, which sum to .5 independent of the values of r\ and ri. Thus these four

mechanisms together (or the corresponding haplotype) are suf ficient for the characterization of the recombination and can be

regarded as reduced recombination.

Diplotype AQBIAqb produces four different haplotypes, AQB and Aqb due to either NN or RN and AQb and AqB due to ei ther NR or RR (see Table 1). The sum of the frequencies of the two corresponding formation mechanisms for each haplo

type is dependent on only one term, determined by r\. Thus the two different mechanisms together form the reduced recombi nation. Similarly, diplotype AQBIaQb also forms four different

haplotypes. Yet two formation mechanisms for each haplotype should each be viewed as the reduced recombination, because

the sum of the frequencies of the two mechanisms includes two nonreducible terms (see Table 1). In general, the reduced re

combination for a haplotype can be characterized by three dif ferent states for any two adjacent loci, recombined, nonrecom

bined, and reducible (i.e., the haplotype can be reduced to a recombinant or nonrecombinant), coded by 1, ?1, and 0 (see Table 1). As shown later, the reducible state is trivially suffi cient for the formation of a closed form for estimating r\ or ri.

We use Y? (? = 1,..., 3L_1) to denote a reduced recombina

tion vector for a L-locus haplotype. (There are a total of 3L_1

such vectors because we have three states between two adjacent

genes.) The reduced recombination vector for a diplotype, de

noted by 1Z, is the combination of Y? f?r the maternal parent and Y?' for the paternal parent.

2.3 Population Genetic and Gene Transmission Parameters

We let p? and P^t denote the frequencies of haplotype H% and diplotype D^< in the study population. If the population is

at Hardy-Weinberg equilibrium, then we have

I pi when ? = ?'

2p^p^< otherwise.

The haplotype frequency p% contains different components de termined by al?ele frequencies at each loci and coefficients of linkage disequilibrium of different orders among these loci

(Lynch and Walsh 1998). Lou et al. (2003a) provided a gen eral model for describing p? in terms of al?ele frequencies and

linkage disequilibria. For pure LD mapping (Lou et al. 2003a), we draw samples

in the same generation from a natural population. By testing for the statistical significance of LD, we can make an inference

about a possible tight linkage between markers and QTL. How ever, in a joint analysis of interval mapping and LD mapping, our samples are derived from both the natural (parental) pop ulation and its offspring population, which thus forms a two

stage hierarchical sampling scheme. When the samples from the older (generation t) parental population contain information about al?ele frequencies and LD, the samples from the younger (generation t + 1 ) offspring population contain gene segrega tion and transmission determined by the Mendelian laws and recombination fractions. The estimation of the recombination

fractions is based on gene transmission from the parental to off

spring generation. We let r\ denote the recombination fraction between loci A1 and Al+l.

2.4 Penetrance Function

For a quantitative trait, there is no one-to-one correspon dence between the genotype and phenotype (y). The conditional

probability of observing the corresponding phenotype given a

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 161

specified genotype, termed the penetrance function, is thus for mulated to describe the expression of the genotype. Because

the phenotype of a trait is genetically determined only by the

genotypes at the putative QTL, the penetrance function, given a

diplotype T>??>, can be expressed as

\2"

f(y\V^ = J2l^s\^')-7^^P 2ttg 2o2 (1)

where ?jls is the genotypic mean of QTL genotype gs, a2 is the residual variance, and \(gs\V^>) is the indicator variable,

Kgs\Vw) 1 if gs = Gq(V^i)', that is, QTL genotype (

is compatible with diplotype D^> 0 otherwise.

We ?t?gVL^ttf(ym\V^l),f(yP\Vp^l), and/(/|P|r)

as the

penetrance functions of phenotype given diplotype V^> for the maternal parent (m), paternal parent (p), and offspring (o). There is the same penetrance function between the maternal

and paternal parent if there is no sex effect. In many studies,

the same penetrance may be assumed for the parental and off

spring generations. Table 2 lists the genetic terminology used in our genetic map

ping. It also gives the corresponding definitions and symbols.

3. THE MODEL

3.1 The Complete-Data Likelihood

For our mapping study aimed at estimating haplotype fre

quencies, QTL effects and positions (measured by the recom bination fractions), marker data (M), zygote genotype (Z),

diplotypes CD), haplotype (H), reduced recombination (H), and phenotype data (y) are viewed as the complete data. The marker and phenotypic data, denoted in boldface, are observed

data, whereas zygote genotypes and diplotypes (containing pu tative QTL) of the parents and offspring, and reduced recombi nation configurations of the offspring, denoted in mathematical

face, are unobserved (or missing) data. Suppose that there are TV

unrelated families drawn from a randomly mating population. If there are no phenotypic covariances between parents and off

spring, then the likelihood function of the complete data can be

expressed as

L(il\ym, yp, y?, Mm, M^, M?, Dm, Vp, D?, K)

cxVxYDm,Dp\{Pw}} N ?

xfl \?yT\^T)f(yf\^) 7=1 I Ni ?

x f][Pr(^,^7|Pr,Pf)/(^.|^)] , (2) 7=1 i

where y's are the phenotypic vectors; X>'s are the diplotype vec

tors; 1Z is the reduced recombination matrix for the offspring; f? is the unknown vector containing population genetic para

meters (haplotype frequencies, f?p), quantitative genetic pa rameters (QTL genotypic values and residual variance, SIq), and QTL positions (measured by recombination fractions, ?p); / denotes the ith family derived from the ith maternal and ith pa ternal parent; j denotes they'th offspring within the ith family; and Ni is the number of offspring within family i.

Table 2. Definitions and Symbols Used to Describe Genetic Terminology in QTL Mapping

Terminology Symbol Definition

Al?ele

Al?ele frequency

Diplotype

Maternal diplotype

Paternal diplotype

Offspring diplotype Diplotype frequency

Haplotype Haplotype frequency Hardy-Weinberg equilibrium

Genotype Genotype frequency Linkage Linkage disequilibrium Locus Molecular marker Nonallele Penetrance

Phenotype QTL Recombinant

Nonrecombinant Reduced recombination

Recombination fraction

Zygote H?t?rozygote Homozygote

A1 A1

P\ V&

H'

HWE

Pv

f(yW^) y or y

R N

n r

Zv

Different copies of a gene at the same locus

Relative proportion of an al?ele in a population A set of haplotype pairs (one from the maternal parent and the other from the paternal parent) with the same genotype

Relative proportion of a diplotype in a population A linear arrangement of al?eles on the same chromosome Relative proportion of a haplotype in a population A population state in which genotype frequency is the product of the corresponding al?ele frequen cies The particular al?eles at specified loci present in an organism Relative frequency of a genotype in a population Co-segregation between al?eles at different loci in a progeny population Nonrandom association between al?eles at different loci in a population The chromosomal position at which a gene resides A particular DNA sequence that displays polymorphisms among individuals Al?eles from different loci

Percentage frequency with which a gene exhibits its effect

Physical, visible characteristics of a genotype Quantitative trait locus, a hypothesized region with al?eles triggering an effect on a quantitative trait The exchange of DNA segments between the maternal and paternal chromosomes DNA sequence with no exchange between the homologous chromosomes The recombination that occurs between a pair of loci but cannot be observed Relative proportion of recombinant gametes, that is, gametes that contain al?eles from different pa ternal chromosomes The combination of two al?ele or haplotypes each from a different parent The combination of different al?eles at a locus The combination of the same al?ele at a locus

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

162 Journal of the American Statistical Association, March 2005

The probability Pr[X>m, DP\{P^>}] is the joint probability of

maternal and paternal parental diplotypes, which depends on

a specific sampling scheme, for example, stratified, grouped,

censored, or truncated. Specially, there is a multinomial distri

bution for simple random samples,

nh

?x[Dm, Dp\{Pw}] oc W (Pw)Nl\i+NP^i,

where NnA, and _V?w are the numbers of maternal and paternal

parents with diplotype V^>. The probability Py(V^,7Z?j\V}?\

Vf) is the conditional probability of offspring y of family / hav

ing diplotype V?? and reduced recombination IZij given parental

diplotypes Vf and Vf. Of the three subsets of unknown para meters, ftp is contained within Pr[X>m, DP\{P^<}], SIq is con

tained within the penetration functions, and ?r is contained

within Pr(^.,^|P^,Pf).

The maximum likelihood estimators (MLEs) of the unknown

parameters under the complete-data likelihood function (2) are

given in Appendix A.

3.2 The Incomplete-Data Likelihood

In the genetic mapping of complex traits, only marker and

phenotype data are observed, whereas the data on diplotypes,

recombination events, and QTL genotypes are missing. The ob

served data alone are viewed as incomplete data of the previ

ously specified complete-data model.

The likelihood of the incomplete data including the pheno type (y) and marker information (M) can be formulated as

L(Sl\ym,yp,y?,Mm,Mp,M?)

otPv[Mm,Mp\{P^>}] N nh nh (

x LI E E mV^,\Mf)Vr(V^,|Mf)

xf(y?\v'^)f(y?\v^)

j=\ z<r' ?,?'

x/(^|?rT')]L (3)

where Pr(?> , \M ) and ?r(Vp , |Aff ) are the conditional prob abilities of the maternal (Ww) and paternal diplotypes (Vp,)

given maternal (Mf) and paternal marker genotypes (Mp) for

the ith family, and ?v(Mij,V0Tif,R?g'\V^/,Vp^)

is the con

ditional probability of offspring having marker genotype M[j, diplotype V?TT,

and reduced recombination R??> given maternal

diplotype V ? and paternal diplotype Vp,. It can be seen from (4) that the observed data constitute a

normal mixture model that is difficult to solve for the MLEs.

However, maximizing the expected complexdata likelihood and

detecting the corresponding MLEs is simple. Thus, to obtain the

MLEs of the incomplete-data likelihood, we attempt to maxi mize the completedata likelihood by replacing it by its condi tional expectation given the incomplete data. This expectation

is computed with respect to the conditional distribution of the

complete data given the incomplete data and the current es

timate of the unknown vector ?. Appendix B describes the derivation procedures and the implementation of the EM algo rithm to obtain the estimates.

3.3 Asymptotic Variance-Covariance Matrix

After the point estimates of parameters are obtained by the EM algorithm, it is necessary to derive the variance-covariance

matrix and evaluate the standard errors of the estimates. Be

cause the EM algorithm does not automatically provide the estimates of the asymptotic variance-covariance matrix for

parameters, an additional procedure has been developed to es

timate this matrix (and thereby standard errors) when the EM

algorithm is used (Louis 1982; Meng and Rubin 1991). These

techniques involve calculation of the incomplete-data informa

tion matrix, which is the negative second-order derivative of

the incomplete-data log-likelihood. Louis (1982) established an

important relationship among the information for the complete (Icom), incomplete (Xincom), and missing data (XmissX expressed as 2incom

= Xcom

? Xmiss. Because it is difficult to evaluate

^incom directly, Louis' relationship allows for an indirect eval

uation based on X m and Xmiss. The derivation procedures of

Zcom and Tmiss are given in Appendix C. Meng and Rubin

(1991) proposed a so-called "supplemented" EM algorithm or

SEM algorithm to estimate the asymptotic variance-covariance

matrices, which can also be used for calculating the standard

errors for the MLEs of our QTL parameters.

4. A CASE STUDY FROM A CANINE HIP DYSPLASIA PROJECT

A canine pedigree was developed to map QTL responsible for canine hip dysplasia (CHD) using molecular markers. Seven

founding greyhounds and six founding Labrador retrievers were

intercrossed, followed by backcrossing F\'s to the greyhounds and Labrador retrievers and intercrossing the Fi's. A series of

subsequent intercrosses among the progeny at different genera

tion levels led to a complex network pedigree structure (Fig. 1), which maximized phenotypic ranges in CHD-related quantita tive traits and the chance to detect substantial LD (Todhunter et al. 2003; Bliss et al. 2002). A total of 148 dogs from this out bred population were genotyped for 240 microsatellite markers located on 38 pairs of autosomes and 1 pair of sex chromosome

(Mellersh et al. 2000; Breen et al. 2001). A number of CHD associated morphological traits that display remarkable dis

crepancies between greyhounds and Labrador retrievers were

measured to examine the genetic and developmental basis of

CHD. In this study, body weight at age 8 months is used as an

example to illustrate our approach. The 148 dogs selected from this canine pedigree (see Fig. 1)

include 2-18 offspring and their parents for each of 22 fami lies. Because most of the selected families were derived from crosses between the unrelated founders, the families selected

for a joint linkage and LD analysis can be roughly assumed to

be random samples. However, a more precise analysis should

implement the network structure of the pedigree. All of the 38 autosomes carrying different numbers of markers were

scanned for the existence of QTL affecting body weights us

ing our joint analytical model. The critical value for claim

ing the chromosome-wide existence of QTL was determined

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 163

Greyhound - Labrador Retrieveroa Pedigree

Djdi Sunshine Knucklehead Jay

ft ?_?

g ???? Pollux Poly

H

r C47

?a.

GX16 26 36 46 56 66 76 86 96

??????? FB17 27 37 47 57 67 77

A15 25 35 45

?b S J

C50 Zeta Lambda

?????? B13 23 33 43 53 73 Z

Andy

A14 Castor

O Z

iU~Utiti Ws?t AX16 26 36 46 56

] BX16 26 36 46 56 66 76

?o?n???? AB18 28 38 48 58 68 78 98 I_

FX18 28 38 48 58 68 78 HB27 37 47 57 67 87

J? Ester Isis Sue Who

FX16 26 36 46

GX15 25 35 45 55 65 75 85 95 5 45 55 65 75 85 95

h????iibb D2F18 28 38 48 58 68 78 88

a????????itt] KB17 27 37 47 57 67 77 87 97 107 117 _I

????? EB17 27 37 47 57 67 ????uufa AB10 20 30 40 50 60 70

???o??? ooM????i??? C2B19 29 39 49 59 69 79 AB11 21 31 41 51 61 71 81 91

f??l DB17 27 37 47

?????????? GB17 27 37 47 57 67 77 87 97 107

B2F18 28 38 48 58 68 78 88

????????uuu IB19 29 39 49 5?

?'o?o???

DB19 29 39 49 59 69 79 89 99 109 119

HB18 28 48 58 68 88

?????? CB10 20 30 40 50 60

Figure 1. Diagram of an Outbred Pedigree in Canine. Squares and circles represent males and females. Filled and open portions of each symbol

represent the proportion of greyhound and Labrador retriever al?eles, possessed by that dog.

through permutation tests (Churchill and Doerge 1994). The

log-likelihood ratios calculated from 1,000 permutation tests were approximately chi-squared-distributed, whose .1% per

centile was used as the critical threshold (significance level a =

.001).

By comparing the a = .001 threshold (32.29) determined from permutation tests with the log-likelihood values calcu

lated at every 2 cM on each marker interval composed of two

adjacent markers, we detected four QTL for body weights lo cated on different chromosomes. As an example, Figure 2 illus

trates a profile of the log-likelihood ratios across chromosome

35 of length 39.94 cM carrying four markers, REN172L08, REN94K23, REN214H22, and RENO 1 GO 1. The profile has a

peak of value 35 at 8.9 cM from marker REN172L08, suggest ing the existence of a significant body weight QTL around that

position. As with traditional interval mapping (Lander and Botstein

1989), our joint model can estimate the positions of the QTL and the effects due to their gene actions and interactions. Like usual LD mapping, our joint model provides the MLEs for the al?ele frequencies of QTL and QTL marker and QTL-QTL link

age disequilibria (Lou et al. 2003a). Table 3 gives the MLEs of haplotype frequencies, genotypic means of the QTL, and the residual variance, along with their respective standard er

rors. From the haplotype frequencies we estimate the al?ele fre

quencies and linkage disequilibria (Lou et al. 2003a), whereas

from the genotypic means we estimate the additive and domi

nant genetic effects (Lynch and Walsh 1998). The MLE of the

frequency of the favorable al?ele that increase body weight for the QTL detected on chromosome 35 was pq

? .505, with a

reasonably small standard error .0725 (see Table 3). This QTL,

displaying a high heterozygosity, was found to have strong link

age disequilibria with the two markers flanking it, whose esti mates are reasonably precise.

As expected, the estimate of the additive genetic effect (a =

4.78 kg) exhibits much greater precision than that of the dom

inant effect (d ? ?1.32 kg) (see Table 3). This QTL triggers an effect body weight in an additive manner. The estimated ad

ditive and dominant genetic effects, along the estimated QTL al?ele frequencies, allow for the estimation of average effect,

a = a + d(pq ?

pq), that quantifies the value of the QTL car

ried by a parent and transmitted to its offspring (Lynch and Walsh 1998). The average effect of this QTL was estimated as

4.80, or 1.17 times the phenotypic standard deviation for body

weight. The additive and dominant genetic variances of this

QTL estimated from a2 = 2pQpqa2 and aj

= Ap20p2d2

were

8.29 and .44, accounting for .49 (narrow-sense heritability) and

.03 of the total phenotypic variance. It should be pointed out that the profile of the log-likelihood

ratios is quite flat at the left of the peak in the case of chro

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

164 Journal of the American Statistical Association, March 2005

0C to

cM

Figure 2. The Profile of the Log-Likelihood Ratios for Testing a QTL Affecting Body Weights on Chromosome 35 of Canine. The horizontal line

denotes the critical threshold for claiming the existence of a QTL ata = .001.

mosome 35 (see Fig. 2). This may result from the left marker

REN172L08 being associated through LD with a second body

weight QTL on a different chromosome. In other words, the

possible existence of LD between this marker and any QTL lo

Table 3. MLEs and Their Standard Errors (SEs) of Population and Quantitative Genetic

Parameters for the Body Weight QTL Bracketed by a Pair of Markers on Canine

Chromosome 35 in a Multifounder and Multigeneration Dog Pedigree

Parameters MLE SE

Population genetic parameters Haplotype frequencies

Al?ele frequencies and LD

Quantitative genetic parameters Genotypic means (kg)

Genetic effects (kg)

Average effect (kg) Additive genetic variance

Dominance genetic variance

Residual variance

Phenotypic variance

Narrow-sense heritability Dominance ratio

Paqb PAQb

PAqB PAqb PaQB PaQb

PaqB Paqb

PA Pq Pb

Dab Daq Dqb

daqb

Pqq

P a d

a

a2

h2

d/a

.3731 8.08e-31

.2359

.0422

.0367

.0953

.2168 1.27e-07

.6512

.5051

.8626

.0473

.0442 -.0258

.0442

33.06

23.50

26.96

28.28 3.05

-6.51

3.12

4.86

10.59

8.02

23.47

.21 2.13

.0719 0

.0688

.0232

.0300

.0365

.0525 8.98e-05

.0521

.0725

.0414

.0237

.0379

.0232

.0480

.61

.60

.43

.41

.34

.70

.83

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 165

cated outside chromosome 35 affects the precision of the es

timate of the QTL position. A similar problem occurs in pure linkage analysis, which led Zeng (1994) to propose using mul

tiple regression analysis to remove the influence of all QTLs outside the marker interval considered.

5. MONTE CARLO SIMULATION

To demonstrate the statistical properties of our approach, we

performed a simulation study that includes two different steps: (1) map construction, assuming that marker order is unknown,

and (2) QTL mapping based on a known linkage map. In the second step, the estimates of QTL effect, QTL position, al?ele

frequency, and QTL-marker linkage disequilibrium can be used to assess the estimation precision of these parameters in the real

example of canine mapping described earlier. We mimic the dog pedigree structure described earlier to ran

domly sample 20 independent families and 9 offspring from each family. We also consider a linkage group whose mark

ers are not evenly distributed. This type of linkage group com

prises five markers, with the recombination fractions .200, .020,

.150, and .015, between a pair of two adjacent markers from

the left to right (Table 4). Except for a triallelic third marker, all other markers are biallelic. The hypothesized values for the al ?ele frequencies of different markers and their pairwise linkage disequilibria are given in Table 2, assuming that no high-order linkage disequilibria exist. These five markers segregating in a

natural population are transmitted from parents to offspring.

By assuming that the gene order of these simulated mark

ers is unknown, we calculate the log-likelihood ratios for all

possible markers and choose one having the largest ratio as a

most likely order. In our example, the joint model correctly de

termines the gene order in 60% of 200 simulation replicates. Table 4 lists the MLEs of the population genetic parameters and recombination fractions between different marker pairs av

eraged over all of the 200 replicates. As seen from the square roots of mean squared errors, the joint model proposed can well

estimate these parameters. However, the estimation precision

of the recombination fractions will be reduced when there is a

weak linkage (see Table 2), or when linkage disequilibria do not

exist between different markers (results not shown).

In the second step, we simulated a QTL in the middle of two sparse markers, 1 and 2, and of two tightly linked markers, 4 and 5, on a known marker order given in Table 4. This QTL is segregating in a simulated population with al?ele frequencies .6 versus .4, has the additive effect of 1.22 and dominant effect of .61, and contributes 40% to the total phenotypic variance

(equivalent to the residual variance of 1.0). As illustrated by the

profiles of the log-likelihood ratios, this QTL can be identified, but with better mapping precision when two flanking markers are tightly (Fig. 3) than loosely linked (Fig. 4). Also, the map

ping precision of QTL depends on the magnitude of LD. The

QTL can be better detected when the markers and QTL display LD, in contrast to when there is no LD between these loci. This

advantage is more pronounced for two loosely linked markers

(see Fig. 4) than for tightly linked markers (see Fig. 3). Similar trends were detected for the precision of the MLEs of the addi tive and dominant effects of the QTL and the residual variance

(results not shown). These results suggests that a high-density map is necessary for the precise mapping of QTL when LD is

low, whereas a moderate-density map is adequate when a high

degree of LD exists.

Table 4. MLEs of Population Genetic Parameters and the Linkage Between Five Simulated Markers Segregating Among 20 Full-Sib Families Each of 9 Individuals

Markers on

linkage group rl(l+1)

Al?ele frequencies

Linkage disequilibrium

Ji(i+1) D> '1(1+2) Dl(l+3) Jl(l+4)

Given values

I Mi

I M2

I M3

MA I

M5 I

Estimated values

I MA

I M2

I M3

I MA

I M5

.200

.020

.150

.015

2012(.o615)

0196(.0179)

1530(.o281)

0149(.0136)

.6

.6

[.3, .3]

.6

.6

5996(.o573)

5969(.o549)

i-3028(.0469)> -2975(.oo53)]

[ 6079(.0542)]

.6008(.0526)

.04

.04

[.02, .02]

[.02, .02]

0388( .0280)

.0382(0272)

[ 0196(.0237),-0189(.0219)]

[ 0194(.0227),.0202(.0225)]

.04

[.02, .02]

.04

0396(.0286)

;.0182(.0222)> -01 81 (.0238)]

.0373(.0289)

[.02, .02]

.04 .04

[ 0209(.0264), 0166(.o255)] .0383

.0377j (.0299)

(.0261)

NOTE: The numbers in the round parentheses are the square roots of the MSEs of the MLEs. f/(/+i) is the recombination fraction between two adjacent markers; whereas D/(/+1), D/(/_|_2), D i (/ +3), and D ?(/ +4) are the linkage disequilibria between two adjacent markers and two markers separated by one, two, and three markers.

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

166 Journal of the American Statistical Association, March 2005

Distance (cM)

Figure 3. The Profiles of the Log-Likelihood Ratios for a QTL Located Between Two Tightly Linked Markers (M4 and M5). A peak of the profile is detected, although it is higher in the case where there are LD (?) than in the case where there is no LD between the markers and QTL (-

- -).

Note that there is the same critical threshold between the two cases because they share identical simulation conditions.

6. DISCUSSION

We have developed a general statistical framework for inte

grating different mapping strategies, interval mapping and LD

mapping, for the molecular dissection of complex traits. By tak

ing advantage of each strategy, our integrative approach makes

it possible to perform high-resolution mapping of complex traits based on the principle of LD. The prerequisite of apply ing LD to QTL mapping is that its occurrence is due to a tight linkage rather than to other evolutionary factors (Lynch and

Walsh 1998). A pure LD analysis would result in false positives, as documented in Alzheimer's disease mapping (Emahazion et al. 2001), when evolutionary factors (e.g., selection, muta

tion, population structure, population admixture) are involved

in the generation of disequilibrium. In addition, our mapping

approach, which jointly models the linkage and LD between markers and QTL, can handle the information from multiple segregating families. It will be reduced to the traditional interval

mapping approach (Lander and Botstein 1989) when no infor mation is included about the genetic structure of the population from which the families are randomly sampled.

There are four significant contributions to current QTL map ping studies made by the approach proposed in this article.

First, although joint linkage and LD analysis has received much attention in fine-scale mapping of QTL (Wu and Zeng 2001 ; Wu et al. 2002; Fan and Xiong 2002; Farnir et al. 2002; Meuwissen et al. 2002; Lund et al. 2003; Perez-Enciso 2003), none ofthose studies presented genetic models and statistical methodologies that can accommodate different mapping populations, genetic designs, pedigree structures, and marker systems. Here we have

Distance (cM)

Figure 4. The Profiles of the Log-Likelihood Ratios for a QTL Located Between Two Sparsely Linked Markers (M1 and M2)- When there are

LD between the markers and QTL, the profile has a flat peak on the interval of markers flanking the QTL (?). When no LD exists between different

loci, such a peak is not obvious (- - -).

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 167

generalized all of these approaches within a statistical frame

work into which QTL mappers can easily incorporate variables of interest to fit their problems. This generalized framework can be reduced to various special cases. For example, by set

ting the maternal diplotype (VTt>) equal to the paternal diplo

type (Vp ,), the model is reduced to autogamous plant systems, such as Arabidopsis and rice.

Second, similar to our recent model of pure LD (Lou et al.

2003a), the joint model is sophisticated and robust to an arbi

trary number of markers and al?eles at each locus, and linkage

disequilibria of different orders and epistatic interactions be tween different QTL. Clearly, this will make our joint model

broadly useful in different situations and for different purposes. Third, with simulated data, we have demonstrated the power of high-resolution mapping for complex traits using LD. The use of LD in QTL mapping can increase the high-resolution mapping power of QTL, especially when marker density is not

high (see Figs. 3 and 4), as compared with pure linkage map ping using recombination events in genotyped generations. This result will have immediate implications for the determination of QTL mapping strategies in those organisms in which there are substantial linkage disequilibria throughout the genomes. For example, according to our previous observations (Lou et al.

2003b), linkage disequilibria frequently occur within a region of 20-30 cM in canines. Thus a moderately dense linkage map for canine may be adequate for high-resolution mapping of QTL. Finally, our QTL model proposed here does not rely on a linkage map of known marker order. We have also imple

mented the algorithm for map construction, and thus can make

use of markers whose map orders are not known for QTL map

ping. We performed different simulation schemes by changing the

values of model parameters in their space to study the statis

tical properties of our model on one hand, and to pinpoint the

optimal genetic experimental design for the most efficient uti lization of the model on the other hand. The results from these simulations suggest that our model displays adequate power to detect QTL if it exists and provides reasonable estimates of ge netic parameters for the detected QTL. The simulation results

guide us in developing a high-density map for the precise map ping of QTL when LD is low and a moderate-density map when a high degree of LD exists. Such a preference between high and moderate-density maps is consistent with empirical results

from LD analysis of molecular markers (Rafalski 2002). Our approach is unique in that it embeds LD analysis in a

traditional interval mapping process. It relies on a two-stage hi

erarchical sampling strategy at the population level and at the

offspring-within-family level. Theoretically, information about the parents can enhance our analytical power and precision, but

may be missing in practice. The model requires further inves

tigation about the effects of missing data at the parental level on the parameter estimation. The likelihood functions for two

missing-data situations are given in Appendix D. Also, in the current model, we used two flanking markers at a time to lo

cate a QTL bracketed by the markers based on the marker-QTL linkage and disequilibrium. This strategy does not consider the

impact on the estimate of the log-likelihood ratio test statistics of other possible QTL located on the same chromosome. This

question may have contributed to an overestimation of genetic

effects of the detected QTL in the dog example, although other

factors, such as a small size of dog samples and multiple linked

QTL, would also cause this overestimation. A hypothesized QTL in a tested marker interval can be separated from those outside the interval for a pure interval mapping by including the rest of the markers (except for the flanking markers under

consideration) as cofactors (Zeng 1994). For our joint interval and LD mapping analysis, QTL on different chromosomes hav

ing an association with one or two of the flanking markers may

also affect the calculations of log-likelihood ratios (see Fig. 2).

Similarly, by combining our joint mapping analysis and mul

tiple regression analysis, the influence of the QTL outside the interval considered can be removed. An extensive investigation

of this combination analysis is needed.

APPENDIX A: ESTIMATES FOR THE COMPLETE-DATA LIKELIHOOD FUNCTION

This appendix gives the MLEs of the unknown parameters under

the complete-data likelihood function (2). We define the indicators

II if the maternal and paternal parents of family / have

diplotypes V^t and Vrr<

0 otherwise

j

and

?J ^m'tr'){Tx'??')

1 if offspring j of family / derived from

parental diplotypes T>^/ and

V^> pro duces diplotype PrT/ and reduced recombi

nation 7Zr "?? 0 otherwise

Then we rewrite the likelihood function (2) as

L(ym, yp, y?,T>m, VP, T>?\ K\to)

N nh nh (

*=1 ?<?'?<?'

11 E ?^[htt'?niTT'??'

Ni nh 3L

nEi j=\ r<r' ?,?<

The natural logarithm of the likelihood function can be expressed

lnL(ym f, y?, Vm, V?, Vo, 1Z\il)

= constant - N nh

EE InP,

N nh

HT.

tc^r

ln/(3fl^)EW

+ in/(>fi^)E4r^

N N? nh nj-, n^ 3L [

+ LLLLL ?^htt'??'KTT'??')

x [lnPr?^.Tl^l^.^O+ln/?^l^)]

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

168 Journal of the American Statistical Association, March 2005

77,7 ? constant + V] ?? ^nP?

+ ?n Inri +?n ln(l -

r])

+ nn lnr2 + nr2 ln(l -

o) H

+ WrL_i lnrL_! -f ?^lnQ -rL_i)

ZU

77/7

77/7

+ ln/(yf|^)EW

/V Af/ 77/7 77/7 77/7

+EEEEE ln/(^_0?4m,)(rT,i?, ?,?

where

N 77/7 r ? 77/7

Z. (4m'+ 4m} + Z (4r^'+ 4r^/}

Af /V/ 77/7 77/7 nh 3L ] ,

n? "EEEEE E 4r..')(rr'iO

[y^/(i+y?/) + y^(i + y^/)]

/V N; 77/7 77/7 77/7 3L '

z^ZZ Z z_, ?^\zm'^')(^'

[^/(i-M + y^a-^/)]

,?r?0

and }/?/ (/ = 1,..., L ? 1) is element / of yg. The MLEs can be derived

by differentiating the log-likelihood with respect to fi, then setting

each derivative equal to 0 and solving the set of simultaneous equa

tions. The MLEs are

nH

P^2-2__V f

= l,2,...,w?;

n = /=1,2,...,L-1;

M, = Ef=iE??<r[ifel%0>fE^4rc^]

A?

Ef=iEr<r[ifel%0Er<^4rc^l

Ef=1E^r[ifel^r)^E^^4^r]. Ef=iEncA<c^fel^?OE?A<r4r?^]

'

/V Af; 77 h nh nh 3L_1

/=1 ?=1 ?<?'?<?'*<*'_;<_;'

(iV Ni ?/, nh 77/, 3L_1

7=1 /=l?<C'?<C'T<r's-,?'

Ifel^)] -1

om2

a

and

p2.

N

EiiE^iE^ zya/j:s[i(gs\v^)(yf - A?)2]}

N M ?/? ?/? "/? 3L '

-?2=eeei:ee

N

??'??'tt'??' i=lj=l %<!;'?<r't<t' ?,?'

s \/=l

APPENDIX B: ESTIMATES FOR THE INCOMPLETE-DATA LIKELIHOOD FUNCTION

In this Appendix we describe derivation of the MLEs of the un

known parameter vector ? for the incomplete data, followed by the

implementation of the EM algorithm to estimate the MLEs. We rewrite

the likelihood of the incomplete data (y and M) described in (4) as

L(?l\ym,yP,y?,Mm,Ml?,M?)

N nh nh (

Il E E \-WP^p^nyT\^)f(yf\^) ?=1 ?<?'?<?

Ni nh 3L-]

xnEE[> j=\T<T> ?,?>

*f(yijW?TT,)]

'mO(Tr'?i') "^rt" 'v??' '^fr ' ~?f .?Pr(v?TT?nccl\vTl:?vP)

where

vm,.,

and

JJ v{^>t;r>){TT>??i)

'

(B.l)

1 if diplotypes X> , and tP , are compatible with

the marker genotypes of the maternal and paternal

parent-forming family i

0 otherwise

1 ifgm(V^)=M?psin?gm(Vtr)=Mp

0 otherwise

1 if diplotype Vo , and reduced recombi

nation 1Z?r' produced by maternal and

paternal diplotypes Pfw and Vp , are com

patible with the marker genotype of off

spring j from family / 0 otherwise,

where the remaining notation is as described in Section 3.2.

As mentioned earlier, the missing data in our model are zygote

genotypes and diplotypes (containing putative QTL) of the parents and offspring, and reduced recombination configurations of the off

spring. The observed data are the marker genotypes and phenotypes of

the parents and offspring. The conditional distribution of the missing data given the observed data connects the complete-data likelihood,

Lc(a\ym,yP,y?,Mm,MP,M?,T>m,VP,V?,n), and the observed

data likelihood, L(ti\ym,yP,y?,Mm,MP,M?). The conditional dis

tribution of parental diplotypes given the observed data is expressed

ki (T>m, Vp\tt, ym, yP, y?, Mm, MP, M?)

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 169

L-l Ni ? nh 3^~l

^ EEKw)(rr'?') 7=1 It<t:' ?,?'

(nh nh ?

E E w^^.^.p^^.ri^v^^i^r)

M r 77/7 3?-]

Xrt EEKwKrr^) 7=1 It<t' ?.?'

xPr^T^P^

(B.2)

and the conditional distribution of the diplotype and reduced recombi

nation of the offspring given the observed data and parental diplotypes is expressed as

k2(V?, n\to, yn\ yp, y?, Mm, Mp, M?, T>m,Vp, T>?)

nh 3L-] / nh 3"-1

Denote the conditional distributions of the missing data at different

hierarchies [(B.2) and (B.3)] by x*,, and 7r(it/^^)(TT/cc/)-

Maxi

mization of the observed data likelihood is achieved by differentiating the log-likelihood of (B.l), which leads to

? lnL(ft|ym, yp, y?, Mm, Mp, M?) dit

=EE^ E-W + E-W M nh ?dW(ym\Vftl) nh

+ EeM|^E*W

d\nf(yp\VP?,) ?O. dtt Q

nh \

M Ni nh nh nh 3L l

+ ZZZ E E Z ^m'Krr'?T?r')

31nPr(^r/,^^|^r,^) ain/(yg-|PgT,) 9?_!/? 8?n

(B.4)

and solving the corresponding log-likelihood equations. We implement the EM algorithm to estimate population genetic

parameters Up (including marker-QTL haplotype frequencies, al?ele

frequencies, and linkage disequilibria), quantitative genetic parame ters

SIq (including QTL effects and residual variance) and QTL posi tions SIr. In the E step, using (B.2) and (B.3), calculate the conditional

probability 7T?jtt/ of the diplotypes of the maternal

(T>^,) and pater

nal parents (Dp ,) generating family /, and the conditional probability

jr'iw /w , f) of offspring j with diplotype D? , and reduced re

combination R??/

from family i derived from maternal parental diplo

type Ww and paternal parental diplotype Dp ,. In the M step, us

ing (B.4), obtain the MLEs of the unknown parameters by substituting

4?'ff and4rc?')(rT'?i') with4V and4(??r)(rr'fio-These two steps are repeated until convergence.

APPENDIX C: DERIVATIONS OF THE ASYMPTOTIC VARIANCE-COVARIANCE MATRIX

As shown by Louis (1982), the observed information (^ncom) can

be obtained by subtracting the missing information (Xmjss) from the

complete information (Xcom). Using Louis' notation, we denote the

complete data by x = (x\,x2, ...,xn)T, the observed data by y =

(y\,y2, . ,yn)T, and the parameter vector by 0. The likelihood of

the incomplete data is expressed as

fY(y\0)= [ fx(*\0)dri(x), (C.l)

where R = (x : y(x)

= y) and ?i(x) is a dominating measure. For (C. 1 ),

we have the score,

31n/y(y|?) f &(x|0) 3\nfx(x\0) - dfi(x), (C.2)

r /x(x|i " Jr /r/x(x|?) 3? JR/R/x(x|?)dM(x) 90

and the Hessian matrix,

32lnfr(y|tf) sede7

fx(x\0) d2infx(x\0)

-L RfRfx(x\0)dii(x) 30 801 d?(x)

. JAyx\0) d\nfx(x\0) d\nfx(x\0) + l '-x\0) de ^W^?r-?W r fxWQ) JR Jr/x(x|i

/. fx(x\0) d\nfx(x\0) ̂

dfi(x) RfRfx(x\0)dii(x) dO

fx(x\0) d\nfx(x\0)

1* y d?i(x). (C.3) lfx(x\0)dfi(x) dB1

When x\, x2,..., xn are independent but not necessarily identically distributed and y/(x)

= y/(x?), we have R =

R\ x R2 x x Rn. To

make the score (C.2) and the Hessian matrix (C.3) more tractable, we

need to use Fubini's theorem. By incorporating Fubini's theorem, the

score and the Hessian matrix can now be expressed as

91n/r(y|0) ^ f fx(Xi\0) 31nfr(*f|?) . , .

'hJR>y de ?^JR?JrMx?WMx?)

de (C.4)

and

92ln/F(y|fl) d6d0T

fX(xi\0) d2lnfx(xj\0)

f^jRifR/xiximd?ixi) d0 30T d?(Xi)

A r fxU.m d\nfx(xi\0)d\nfx(xi\0) d

n r

JRi ?R?

fx(xiW) d\nfx(xj\0) dix(xj)

Mxi\0)d?(xi) 90

fx(xi\0) d\nfx{xi\0) Mxi\9)dti(x?) 30'

d?(xj) (C.5)

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

170 Journal of the American Statistical Association, March 2005

Based on (C.4) and (C.5), the complete XCom and missing informa

tion Xmjss can be calculated using Louis' formulas, from which the

observed information TmQOm is estimated.

In our situation, the observed information matrix can be readily es

timated for haplotype frequencies, genotypic means, the residual vari

ance, and the QTL position (measured by the recombination fractions

between the QTL and flanking markers). A further step is needed to de

rive the variance-covariance matrices for al?ele frequencies and link

age disequilibria from the haplotype frequencies (see Lou et al. 2003a) and for the additive and dominant genetic effects from the genotypic means. It is not difficult to obtain the variance-covariance matrices for

al?ele frequencies and genetic effects, because they are linear functions

of haplotype frequencies and genotypic means.

Because the analytical expression for estimating the variances and

covariances of linkage disequilibria does not exist, we derive an ap

proximate expression for these variances and covariances using the

delta method derived from a Taylor' series expansion. If a parameter vector 0 is a function of the characteristic parameter 6 [i.e., 0 =f(0)], then the approximate variance-covariance of 0 =f(0) is

,?, 3/(0) ?.MB) var(0) = on var(0)

d0 do1 (C.6)

In our situation, we have

' Dab \ DAQ

pBQ \DAqbJ

/] -PA -PB 1 -PA -PQ ! -PB-PQ -PB

I -PA -PB -PB -PA

0 V -PA

1 -PA -PQ -PQ -PQ -PA -PA

0

-PB

-PQ 0

1-PB-PQ -PB -PQ

1 - DAB -

bAQ -

DBQ -

PA -

PB -

PQ + IpAPB + ?-PaPQ + 2-PBPQ \ IPAPB + ^PBPQ

- DAB -D?Q- PB

IPAPQ + 2PBPQ -DAQ-DBQ-PQ 2pBPQ-DBQ ^

2pAPB + IPAPQ -

DAB ~DAQ-PA IpAPB

- Dab

^papq-Daq )

(PAQB \ PAQb PAqB PAqb PaQB PaQb

\ PaqB '

/I -PA -PB 1 -PA -PQ ! -PB-PQ -PB 1 -PA- PO -PB

PQ 0

1 -PA -

-PB -PA

0 -PA

PA -

PQ PB -PQ

-PQ -PA -PA

0

1 -PB-PQ -PB -PQ

1 - DAB -DaQ-Dbq-Pa- PB -

PQ + 2PAPB + 2pAPQ + 2PBPQ \ 2-PAPB + 2PBPQ

- Dab -Dbq-pb

IPAPQ + 2-PBPQ -DaQ-DbQ-PQ

2PBPQ-b?Q^ lPA?>B + 2pAPQ

- dab -Daq-Pa

IPAPB -

Dab

2-PAPQ -DAQ )

APPENDIX D: LIKELIHOOD FUNCTIONS FOR SOME SPECIAL CASES

We have derived a general likelihood function for joint interval and

LD mapping. However, in practice there are many special cases. If

parent phenotypes are not available, then the likelihood can be reduced

to a marginal form,

L(f?|y?, M?, X>m, T>p, Vo, 71)

N Ni

aPr[X^2>P|{P?r}]nn 7=17=1

The marginal likelihoods for two other special cases can also be for

mulated as

L(tt\ym,yp,y?,V0,K)

a^Pr[X>m,D^|{P^/}] N (

*Y\\f(y?wr)f(y?Wp) 7=11

Ni 1

x Y[[Pv(V^nij\Vip,Vp)f(y^J\Vp] (D.l) 7=1 i

when parental diplotypes are not available, and as

L(ti\y?,V?,11) (xJ2MT>m,Vp\{P^,}] N Ni

^n^'^i^'^^1^1 (d-2) '=17=1

when both diplotypes and phenotypes of the parents are not available.

In both (D.l) and (D.2), ? denotes the summation over [Vm, T>p] e

{[T>,T>]: compatible with Vo}.

[Received June 2003. Revised June 2004.]

REFERENCES Allison, D. B. (1997), "Transmission-Disequilibrium Tests for Quantitative

Traits," American Journal of Human Genetics, 60, 676-690.

Bliss, S., Todhunter, R. J., Quaas, R., Casella, G., Wu, R. L., Lust, G., Williams, A. J., Hamilton, S., Dykes, N. L., Yeager, A., Gilbert, R. O., Burton-Wurster, N. I., and Acland, G. M. (2002), "Quantitative Genetics of Traits Associated With Hip Dysplasia in a Canine Pedigree Constructed by Mating Dysplastic Labrador Retrievers With Unaffected Greyhounds," Amer ican Journal of Veterinary Research, 63, 1029-1035.

Breen, M., Jouquand, S., Renier, C, Mellersh, C. S. et al. (2001),

"Chromosome-Specific Single-Locus FISH Probes Allow Anchorage of an

1800-Marker Integrated Radiation-Hybrid/Linkage Map of the Domestic Dog Genome to All Chromosomes," Genome Research, 11, 1784-1795.

Cepellini, R., Curtoni, E. S., Mattiuz, P. L., Miggiano, V., Scudeller, G., and

Serra, A. (1967), Histocompatibility Testing, Copenhagen: Munksgaard. Churchhill, G. A., and Doerge, R. W. (1994), "Empirical Threshold Values for

Quantitative Trait Mapping," Genetics, 138, 963-971.

Emahazion, T., Feuk, L., Jobs, M., Sawyer, S. L., Fredman, D. et al. (2001), "SNP Associations Studies in Alzheimer's Disease Highlight Problems for

Complex Disease Analysis," Trends in Genetics, 17, 407-413.

Fan, R. Z., and Xiong, M. M. (2002), "Combined High-Resolution Linkage and Association Mapping of Quantitative Trait Loci," European Journal of Human Genetics, 11, 125-137.

Farnir, F., Grisart, B., Coppieters, W., Riquet, J., Berzi, P., Cambisano, N., Karim, L., Mni, M., Moisio, S., Simon, P., Wagenaar, D., Vilkki, J., and

Georges, M. S. (2002), "Simultaneous Mining of Linkage and Linkage Dis

equilibrium to Fine Map Quantitative Trait Loci in Outbred Half-Sib Pedi

grees: Revisiting the Location of a Quantitative Trait Locus With Major Effect on Milk Production on Bovine Chromosome 14," Genetics, 161, 275-287.

H?stbacka, J., de la Chapelle, A., Kaitila, I., Sistonen, P., Weaver, A., and

Lander, E. (1992), "Linkage Disequilibrium Mapping in Isolated Founder

Populations: Diastropic Dysplasia in Finland," Nature Genetics, 2, 204-221.

H?stbacka, J., de la Chapelle, A., Mahtani, M., Clines, G., Reeve-Daly, M. P.,

Daly, M. et al. (1994), "The Diastropic Dysplasia Gene Encodes a Novel Sul fate Transporter: Positional Cloning by Fine-Structure Linkage Disequilib rium Mapping," Cell, 78, 1073-1087.

Jansen, R. C. (2000), "Quantitative Trait Loci in Inbred Lines," in Handbook

of Statistical Genetics, eds. D. J. Balding, M. Bishop, and C. Cannings, New York: Wiley, pp. 567-597.

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions

Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 171

Lander, E. S., and Botstein, D. (1989), "Mapping Mendelian Factors Underly ing Quantitative Traits Using RFLP Linkage Maps," Genetics, 121, 185-199.

Lou, X.-Y, Casella, G., Littell, R. C, Yang, M. C. K., Johnson, J. A., and

Wu, R. L. (2003a), "A Haplotype-Based Algorithm for Multilocus Linkage

Disequilibrium Mapping of Quantitative Trait Loci With Epistasis," Genet

ics, 163,1533-1548.

Lou, X.-Y, Todhunter, R. L, Lin, M., Lu, Q., Liu, T., Wang, Z. H., Bliss, S., Casella, G., Acland, A. C., Lust, G., and Wu, R. L. (2003b), "The Extent and

Distribution of Linkage Disequilibrium in Canine," Mammalian Genome, 14, 555-564.

Louis, T. A. (1982), "Finding the Observed Information Matrix When Using the EM Algorithm," Journal of the Royal Statistical Society, Ser. B, 44, 226-233.

Lund, M. S., Sorensen, R, Guldbrandtsen, B., and Sorensen, D. A. (2003), "Multitrait Fine Mapping of Quantitative Trait Loci Using Combined Linkage Disequilibria and Linkage Analysis," Genetics, 163, 405^-10.

Lynch, M., and Walsh, B. (1998), Genetics and Analysis of Quantitative Traits, Sunderland, MA: Sinauer.

Mackay, T. F. C. (2001), "The Genetic Architecture of Quantitative Traits," Annual Reviews of Genetics, 35, 303-339.

Mellersh, C. S., Hitte, C, Richman, M., Vignaux, F., Pri?t, C, Jouquand, S., Werner, R, Andre, C, DeRose, S., Patterson, D. F., Ostrander, E. A., and

Galibert, F. (2000), "An Integrated Linkage-Radiation Hybrid Map of the Ca nine Genome," Mammalian Genome, 11, 120-130.

Meng, X.-L., and Rubin, D. B. (1991), "Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm," Journal of the Ameri can Statistical Association, 86, 899-909.

Meuwissen, T. H. E., Karlsen, A., Lien, S., Olsaker, I., and Goddard, M. E.

(2002), "Fine Mapping of a Quantitative Trait Locus for Twinning Rate Using Combined Linkage and Linkage Disequilibrium Mapping," Genetics, 161, 373-379.

Morton, N. E. (1983), Methods in Genetic Epidemiology, Basel: Karger. P?rez-Enciso, M. (2003), "Fine Mapping of Complex Trait Genes Combin

ing Pedigree and Linkage Disequilibrium Information: A Bayesian Unified

Framework," Genetics, 163, 1497-1510.

Rafaski, A. (2002), "Applications of Single Nucleotide Polymorphisms in Crop Genetics," Current Opinions in Plant Biology, 5, 94-100.

Todhunter, R. J., Bliss, S. P., Casella, G., Wu, R., Lust, G., Burton

Wurster, N. I., Williams, A. J., Gilbert, R. O., and Acland, G. M. (2003), "Ge netic Structure of Susceptibility Traits for Hip Dysplasia and Microsatellite Informativeness of an Outcrossed Canine Pedigree," Journal of Heredity, 94, 39-48.

Wu, R. L., and Zeng, Z.-B. (2001), "Joint Linkage and Linkage Disequilibrium Mapping in Natural Populations," Genetics, 157, 899-909.

Wu, R. L., Ma, C.-X., and Casella, G. (2002), "Joint Linkage and Linkage Dis

equilibrium Mapping of Quantitative Trait Loci in Natural Populations," Ge

netics, 160,779-792.

Zeng, Z.-B. (1994), "Precision Mapping of Quantitative Trait Loci," Genetics, 136,1457-1468.

This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions