A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward...
-
Upload
mark-c-k-yang-and-rongling-wu -
Category
Documents
-
view
212 -
download
0
Transcript of A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward...
A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping:Toward High-Resolution Mapping of Quantitative TraitsAuthor(s): Xiang-Yang Lou, George Casella, Rory J. Todhunter, Mark C. K. Yang and RonglingWuSource: Journal of the American Statistical Association, Vol. 100, No. 469 (Mar., 2005), pp.158-171Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27590526 .
Accessed: 16/06/2014 06:22
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].
.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.
http://www.jstor.org
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
A General Statistical Framework for Unifying Interval and Linkage Disequilibrium Mapping: Toward
High-Resolution Mapping of Quantitative Traits Xiang-Yang Lou, George Casella, Rory J. Todhunter, Mark C. K. Yang, and Rongling Wu
The nonrandom association between different genes, termed linkage disequilibrium (LD), provides a powerful tool for high-resolution
mapping of quantitative trait loci (QTL) underlying complex traits. This LD-based mapping approach can be made more efficient when it
is coupled with interval mapping characterizing the genetic distance between markers and QTL. This article describes a general statistical
framework for simultaneously estimating the linkage and LD that are related in a two-stage hierarchical sampling scheme. This framework
is constructed within a maximum likelihood context and can be expanded to fine-scale mapping of complex traits for different population structures and reproductive behaviors. We provide a closed-form solution for joint estimation of quantitative genetic parameters describing
QTL effects, QTL position and residual variances, and population genetic parameters describing al?ele frequencies and QTL-marker LD.
We perform simulation studies to investigate the statistical properties of our joint analysis model for interval and LD mapping. An example
using body weights of dogs from a multifamily outcrossed pedigree illustrates the use of the model.
KEY WORDS: EM algorithm; Interval mapping; Linkage disequilibrium; Maximum liklihood estimate; Quantitative trait loci.
1. INTRODUCTION
The motivation of this article arises from problems in the
high-resolution genetic mapping of quantitative traits with DNA-based polymorphic markers. The poly genie inheritance and environmental sensitivity of quantitative traits complicate the analysis and modeling of their genetic architecture, as spec
ified by the number, distribution, actions, and interactions of the
underlying genes, that is, quantitative trait loci (QTL) (Mackay 2001). Molecular genetic markers, derived from polymorphic sites in the genome, have proven to be powerful tools for char
acterizing and mapping individual QTL for a quantitative trait. The basic principle of such QTL mapping proceeds by count
ing the expected recombination events occurring between the
markers and putative QTL during meiosis. One of the genetic strategies for QTL mapping is to make use of the marker-QTL recombinants created in marker-genotyped generations through
controlled crosses. This strategy is called linkage analysis. In
terval mapping of QTL based on the multilocus linkage analysis of two flanking markers and a QTL bracketed by the markers al lows for a systematic scan for the genome-wide distribution of
QTL (Lander and Botstein 1989). Substantial extensions of in terval mapping into more general situations have been reported (reviewed in Jansen 2000).
Interval mapping utilizes the information about gene segre
gation and transmission in a well-constructed pedigree, and is
Xiang-Yang Lou is Postdoctoral Research Associate, under the guidance of Dr. Rongling Wu, and George Casella is Distinguished Professor and
Chair, Department of Statistics, University of Florida, Gainesville, FL 32611.
Rory J. Todhunter is Associate Professor, The James A. Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853. Mark Yang is Professor, Department of Statistics, University of
Florida, Gainesville, FL 32611. Rongling Wu is Associate Professor, De
partment of Statistics, University of Florida, Gainesville, FL 32611 (E-mail:
[email protected]) and Adjunct Distinguished Professor, Zhejiang Forestry Uni
versity, Lin'An, Zhejiang 311300, China. Rongling Wu conceived the idea of this study, wrote this manuscript, and was responsible for directing the research. The authors thank the associate editor and an anonymous referee for their constructive comments on this manuscript. This work is supported in part by a National Science Foundation grant (9971586) to G. Casella; an
Outstanding Young Investigators Award of the National Science Foundation of China (30128017), a University of Florida Research Opportunity Fund
(02050259); and a University of South Florida Biodefense grant (7222061-12) to R. Wu; and the Morris Animal Foundation contract grant (722206212) to
R. J. Todhunter. The publication of this manuscript is approved as journal series No. R-09584 by the Florida Agricultural Experiment Station.
limited for the fine-scale mapping of QTL unless the size of the pedigree is large. An alternative strategy of QTL mapping is based on linkage disequilibrium (LD) analysis, which uses
recombinant events in a population generated at a historical
time before genotyping starts. Because the extent of disequi
librium between two genes decays exponentially toward at a rate depending on the recombination fraction (Lynch and Walsh
1998), the detection of significant LD after a number of gener ations may imply a tight linkage between the marker and QTL. For this reason, LD-based mapping provides a powerful way of fine-scale mapping of QTL. Although the use of LD mapping has led to successful high-resolution mapping of several human diseases (e.g., H?stbacka et al. 1992, 1994), the prerequisite of this strategy, that LD is due to tight linkage between the QTL and marker, may not hold. In populations where evolutionary
forces such as selection, drift, population structure, and pop
ulation admixture act, the disequilibrium can also be detected
(Lynch and Walsh 1998). Such an LD is regarded as spurious because it provides no information about the marker-QTL link
age.
Many strategies have been developed to combine the advan
tages of linkage and LD analysis traditionally considered sepa rate strategies. For example, in human genetics, a family-based LD strategy, called transmission/disequilibrium testing (TDT), was devised to simultaneously determine the linkage and LD between a marker and a disease gene (Allison 1997). Several studies have considered the joint modeling of linkage and LD
mapping for quantitative traits as well as its advantages com
pared with pure linkage mapping or pure LD mapping (Wu and
Zeng 2001; Wu, Ma, and Casella 2002; Fan and Xiong 2002; Farnir et al. 2002; Meuwissen, Karlsen, Lien, Olsaker, and
Goddard 2002; Lund, Sorensen, Guldbrandtsen, and Sorensen
2003; Perez-Enciso 2003). However, these studies provided a
solution only for some very specific problems, because they were not formulated under a general statistical framework, by
which analysis and modeling of QTL mapping can be read
ily extended to different populations, reproductive behaviors, marker systems, and pedigree structures. Moreover, in many of
? 2005 American Statistical Association Journal of the American Statistical Association
March 2005, Vol. 100, No. 469, Theory and Methods DOI 10.1198/016214504000001295
158
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 159
the cited studies, the statistical properties of the joint linkage LD analysis have not been investigated through either analytical or numerical approaches.
In this article we develop a general framework for integrating interval and LD mapping to enforce the efficiency and effective ness of high-resolution mapping of quantitative traits. Our inte
grative model makes use of genetic information from the family pedigree and population data by adopting a two-stage hierarchi cal sampling scheme as reported by Wu and Zeng (2001) and
Wu et al. (2002). The upper hierarchy of this scheme describes the gene distribution in a natural population from which a sam
ple is randomly drawn (LD analysis), and the lower hierarchy describes gene segregation in the progeny population derived from matings (linkage analysis). These two hierarchies are con
nected through population and Mendelian genetic theories. In a recent study, Lou et al. (2003a) derived a closed-form solution for estimating population genetic parameters, such as al?ele fre
quencies and LD, within the maximum likelihood framework. Here we extend the idea of two-stage hierarchical sampling scheme to jointly model interval and LD mapping by imple
menting the closed-form EM algorithm of Lou et al. (2003a).
Compared with the earlier work by Wu and Zeng (2001) and Wu et al. (2002), this extended model displays increased com
putational efficiency and provides more precise estimates of the
QTL genomic location and genetic effects, because a pair of
flanking markers that bracket the QTL are used simultaneously.
2. TERMINOLOGY AND NOTATION
2.1 Haplotype, Diplotype, and Zygote Genotype
Suppose that there are L loci, Ax,..., AL, on a chromo
some which are segregating in a random mating natural pop
ulation. At locus / (/ = 1,..., L) there are k? al?eles, denoted
by A\,..., Alk . The haplotype, that is, a linear arrangement of
al?eles from the L loci on one member of a pair of homolo
gous chromosomes (Cepellini et al. 1967), can be described by a vector of L dimensions containing one al?ele from each of
the L loci. The al?eles from all of the L loci form nn = ]~]/=i k
haplotypes. During the process of pollination, paternally- and
maternally-derived haplotypes will randomly unite to form n\ combinations, among which there are a total of n^(nn + l)/2 different zygotic configurations, or diplotypes (Morton 1983). Because different diplotypes may have the same genotype, the number of multilocus genotypes, nz = ]~[/=i (^ + ^)h?2, will be less than the number of diplotypes. Let H?, V^> or H?H?', and Zv be the haplotype ? (? = 1,..., nn), diplotype ??' (? <
?' =
1,..., nn), and zygote genotype v (v = 1,..., nz). To dis
tinguish between polymorphic markers and quantitative trait loci composing L-locus haplotypes, we let ?(V^'), Qm(V^'), and
Gq?D??>) denote the many-to-one mapping operators tak
ing the genotypes at all loci, marker loci, and QTL of diplo type V&.
We use an example to demonstrate the concepts of haplo
type, diplotype, and zygote genotype. Consider two biallelic
loci, A and B, on the same chromosome, having al?eles A, a
and B, b. The haplotypes of these two loci are the combinations of al?eles, AB, Ab, aB, and ab. These haplotypes (or gametes) are randomly paired to generate 10 diplotypes, ABI AB, ABI Ab, ABlaB, ABI ab, Abi Ab, AblaB, Abi ab, aBlaB, aBlab, and ablab,
where "/" denotes the separation of the maternally and pater
nally derived gametes. Of the foregoing diplotypes, the fourth and sixth cannot be genotypically distinguished from each an other. Thus we actually have only nine distinguishable geno
types, expressed as AABB, AABb, AAbb, AaBB, AaBb, Aabb, aaBB, aaBb, and aabb.
2.2 Recombination and Reduced Recombination
During the meiotic stage of life, the diplotype (zygotic con
figuration) arising from the unification of maternal and paternal haplotypes will generate new haplotypes for the next genera tion. Some of these new haplotypes will be different from the
maternal and paternal haplotypes that form the diplotype be cause of the occurrence of recombination events between a pair of contiguous loci. For example, diplotype ABI ab will generate four haplotypes, AB, ab, Ab, and aB, with the first two identi cal to the haplotypes composing the parental diplotype and the second two due to the recombination or crossover between the
two loci. The frequency at which the recombinant types occur
among the total number of haplotypes, termed the recombina
tion fraction, depends on the genetic distance between the two loci. Thus the recombination fraction (and therefore the genetic distance) can be estimated by counting the relative numbers of recombinant and nonrecombinant haplotypes that the zy
gote produces. But whether recombinant and nonrecombinant
types can be genotypically distinguished from each other in a
genotyped family depends on the heterozygosity of a diplotype. For those completely homozygous diplotypes (e.g., ABI AB), the
genotypic distinction between recombinant and nonrecombi
nant haplotypes is not possible, despite the fact that these two
types occur at the same time.
In this study we construct a general framework for integrat
ing the advantages of interval and LD mapping of complex traits. Traditional interval mapping is based on a known link
age group throughout which every two flanking markers are scanned to test whether a putative QTL is located between the two markers. Unlike such a traditional treatment, however, our
integrative model assumes that the recombination fraction be
tween the two markers is unknown a priori, to better specify the relationships among the markers and QTL. Let r\, r2, and
r be the recombination fractions between the left marker (of al?eles A and a) and QTL (of al?eles Q and q), between the
QTL and the right marker (of al?eles B and b), and between the two markers. Thus, if r\ and r2 are estimated, then the
location of the QTL within the marker interval can be deter mined. The estimation of r\ or r2 is based on the observations
of the recombinant and nonrecombinant types. For a completely
heterozygotic diplotype, AQBIaqb, the eight haplotypes that it
produces are distinguishable based on their formation mecha nisms resulting from the recombination or nonrecombination.
For those partially heterozygotic diplotypes, however, different formation mechanisms may lead to the same haplotype. For ex
ample, diplotype, AQBIAQb produces two distinguishable hap lotypes AQB and AQb, each of which may result from either recombination (R) or nonrecombination (N) between each of two pairs of adjacent loci (i.e., A and Q, Q and B) described
by NN, NR, RN, or RR. Assuming that there is no interference, the frequencies of these four genotypically indistinguishable formation mechanisms for each haplotype can be expressed in
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
160 Journal of the American Statistical Association, March 2005
Table 1. Examples of Reducible Recombination
Parental Formation Formation Frequency of State diplotype Haplotype mechanism frequency reduced recombination vector
AQB/AQb AQB NN (1 -
n)(1 -
r2)/2 .5 (0 0) NR (1-^/2 RN r1(1-r2)/2 RR /V2/2
AQb NN (1 -n)(1 -r2)/2 .5 (0 0) NR (1-A?)r2/2
RN r1(1-r2)/2 RR /1/2/2
AQBIAqb AQB NN (1 - n)(1 - r2)/2 (1 -
r2)/2 (0 -1) RN ri(1-r2)/2
AQb NR (1 -ri)r2/2 r2/2 (0 1) RR r1r2/2
AqB NR (1-r,)r2/2 r2/2 (0 1) RR /ir2/2
?<# NN (1 -n)(1 -r2)/2 (1-r2)/2 (0-1) RN Ai(1
- r2)/2
AQBIaQb AQB NN (1 -
n)(1 - r2)/2 (1 - r,)(1 -
r2)/2 (-1 -1) RR r^/2 nr2/2 (11)
AQb NR (1-Ai)r2/2 (1-r-i)r2/2 (-11) RN r^-r2)/2 ri(1-r2)/2 (1-1)
aQB NR (1-ri)r2/2 tf -
r,)r2/2 (-11) RN /i(1-r2)/2 n(1-r2)/2 (1-1)
aQb NN (i_n)(1-r2)/2 (1 - M(1 -
r2)/2 (-1-1) RR r,r2/2 r,r2/2 (11)
NOTE: The state of recombination is expressed as 1 for recombined (R), -1 for nonrecombined (N), and 0 for reducible.
terms of r\ and V2 with the form given in Table 1, which sum to .5 independent of the values of r\ and ri. Thus these four
mechanisms together (or the corresponding haplotype) are suf ficient for the characterization of the recombination and can be
regarded as reduced recombination.
Diplotype AQBIAqb produces four different haplotypes, AQB and Aqb due to either NN or RN and AQb and AqB due to ei ther NR or RR (see Table 1). The sum of the frequencies of the two corresponding formation mechanisms for each haplo
type is dependent on only one term, determined by r\. Thus the two different mechanisms together form the reduced recombi nation. Similarly, diplotype AQBIaQb also forms four different
haplotypes. Yet two formation mechanisms for each haplotype should each be viewed as the reduced recombination, because
the sum of the frequencies of the two mechanisms includes two nonreducible terms (see Table 1). In general, the reduced re
combination for a haplotype can be characterized by three dif ferent states for any two adjacent loci, recombined, nonrecom
bined, and reducible (i.e., the haplotype can be reduced to a recombinant or nonrecombinant), coded by 1, ?1, and 0 (see Table 1). As shown later, the reducible state is trivially suffi cient for the formation of a closed form for estimating r\ or ri.
We use Y? (? = 1,..., 3L_1) to denote a reduced recombina
tion vector for a L-locus haplotype. (There are a total of 3L_1
such vectors because we have three states between two adjacent
genes.) The reduced recombination vector for a diplotype, de
noted by 1Z, is the combination of Y? f?r the maternal parent and Y?' for the paternal parent.
2.3 Population Genetic and Gene Transmission Parameters
We let p? and P^t denote the frequencies of haplotype H% and diplotype D^< in the study population. If the population is
at Hardy-Weinberg equilibrium, then we have
I pi when ? = ?'
2p^p^< otherwise.
The haplotype frequency p% contains different components de termined by al?ele frequencies at each loci and coefficients of linkage disequilibrium of different orders among these loci
(Lynch and Walsh 1998). Lou et al. (2003a) provided a gen eral model for describing p? in terms of al?ele frequencies and
linkage disequilibria. For pure LD mapping (Lou et al. 2003a), we draw samples
in the same generation from a natural population. By testing for the statistical significance of LD, we can make an inference
about a possible tight linkage between markers and QTL. How ever, in a joint analysis of interval mapping and LD mapping, our samples are derived from both the natural (parental) pop ulation and its offspring population, which thus forms a two
stage hierarchical sampling scheme. When the samples from the older (generation t) parental population contain information about al?ele frequencies and LD, the samples from the younger (generation t + 1 ) offspring population contain gene segrega tion and transmission determined by the Mendelian laws and recombination fractions. The estimation of the recombination
fractions is based on gene transmission from the parental to off
spring generation. We let r\ denote the recombination fraction between loci A1 and Al+l.
2.4 Penetrance Function
For a quantitative trait, there is no one-to-one correspon dence between the genotype and phenotype (y). The conditional
probability of observing the corresponding phenotype given a
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 161
specified genotype, termed the penetrance function, is thus for mulated to describe the expression of the genotype. Because
the phenotype of a trait is genetically determined only by the
genotypes at the putative QTL, the penetrance function, given a
diplotype T>??>, can be expressed as
\2"
f(y\V^ = J2l^s\^')-7^^P 2ttg 2o2 (1)
where ?jls is the genotypic mean of QTL genotype gs, a2 is the residual variance, and \(gs\V^>) is the indicator variable,
Kgs\Vw) 1 if gs = Gq(V^i)', that is, QTL genotype (
is compatible with diplotype D^> 0 otherwise.
We ?t?gVL^ttf(ym\V^l),f(yP\Vp^l), and/(/|P|r)
as the
penetrance functions of phenotype given diplotype V^> for the maternal parent (m), paternal parent (p), and offspring (o). There is the same penetrance function between the maternal
and paternal parent if there is no sex effect. In many studies,
the same penetrance may be assumed for the parental and off
spring generations. Table 2 lists the genetic terminology used in our genetic map
ping. It also gives the corresponding definitions and symbols.
3. THE MODEL
3.1 The Complete-Data Likelihood
For our mapping study aimed at estimating haplotype fre
quencies, QTL effects and positions (measured by the recom bination fractions), marker data (M), zygote genotype (Z),
diplotypes CD), haplotype (H), reduced recombination (H), and phenotype data (y) are viewed as the complete data. The marker and phenotypic data, denoted in boldface, are observed
data, whereas zygote genotypes and diplotypes (containing pu tative QTL) of the parents and offspring, and reduced recombi nation configurations of the offspring, denoted in mathematical
face, are unobserved (or missing) data. Suppose that there are TV
unrelated families drawn from a randomly mating population. If there are no phenotypic covariances between parents and off
spring, then the likelihood function of the complete data can be
expressed as
L(il\ym, yp, y?, Mm, M^, M?, Dm, Vp, D?, K)
cxVxYDm,Dp\{Pw}} N ?
xfl \?yT\^T)f(yf\^) 7=1 I Ni ?
x f][Pr(^,^7|Pr,Pf)/(^.|^)] , (2) 7=1 i
where y's are the phenotypic vectors; X>'s are the diplotype vec
tors; 1Z is the reduced recombination matrix for the offspring; f? is the unknown vector containing population genetic para
meters (haplotype frequencies, f?p), quantitative genetic pa rameters (QTL genotypic values and residual variance, SIq), and QTL positions (measured by recombination fractions, ?p); / denotes the ith family derived from the ith maternal and ith pa ternal parent; j denotes they'th offspring within the ith family; and Ni is the number of offspring within family i.
Table 2. Definitions and Symbols Used to Describe Genetic Terminology in QTL Mapping
Terminology Symbol Definition
Al?ele
Al?ele frequency
Diplotype
Maternal diplotype
Paternal diplotype
Offspring diplotype Diplotype frequency
Haplotype Haplotype frequency Hardy-Weinberg equilibrium
Genotype Genotype frequency Linkage Linkage disequilibrium Locus Molecular marker Nonallele Penetrance
Phenotype QTL Recombinant
Nonrecombinant Reduced recombination
Recombination fraction
Zygote H?t?rozygote Homozygote
A1 A1
P\ V&
H'
HWE
Pv
f(yW^) y or y
R N
n r
Zv
Different copies of a gene at the same locus
Relative proportion of an al?ele in a population A set of haplotype pairs (one from the maternal parent and the other from the paternal parent) with the same genotype
Relative proportion of a diplotype in a population A linear arrangement of al?eles on the same chromosome Relative proportion of a haplotype in a population A population state in which genotype frequency is the product of the corresponding al?ele frequen cies The particular al?eles at specified loci present in an organism Relative frequency of a genotype in a population Co-segregation between al?eles at different loci in a progeny population Nonrandom association between al?eles at different loci in a population The chromosomal position at which a gene resides A particular DNA sequence that displays polymorphisms among individuals Al?eles from different loci
Percentage frequency with which a gene exhibits its effect
Physical, visible characteristics of a genotype Quantitative trait locus, a hypothesized region with al?eles triggering an effect on a quantitative trait The exchange of DNA segments between the maternal and paternal chromosomes DNA sequence with no exchange between the homologous chromosomes The recombination that occurs between a pair of loci but cannot be observed Relative proportion of recombinant gametes, that is, gametes that contain al?eles from different pa ternal chromosomes The combination of two al?ele or haplotypes each from a different parent The combination of different al?eles at a locus The combination of the same al?ele at a locus
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
162 Journal of the American Statistical Association, March 2005
The probability Pr[X>m, DP\{P^>}] is the joint probability of
maternal and paternal parental diplotypes, which depends on
a specific sampling scheme, for example, stratified, grouped,
censored, or truncated. Specially, there is a multinomial distri
bution for simple random samples,
nh
?x[Dm, Dp\{Pw}] oc W (Pw)Nl\i+NP^i,
where NnA, and _V?w are the numbers of maternal and paternal
parents with diplotype V^>. The probability Py(V^,7Z?j\V}?\
Vf) is the conditional probability of offspring y of family / hav
ing diplotype V?? and reduced recombination IZij given parental
diplotypes Vf and Vf. Of the three subsets of unknown para meters, ftp is contained within Pr[X>m, DP\{P^<}], SIq is con
tained within the penetration functions, and ?r is contained
within Pr(^.,^|P^,Pf).
The maximum likelihood estimators (MLEs) of the unknown
parameters under the complete-data likelihood function (2) are
given in Appendix A.
3.2 The Incomplete-Data Likelihood
In the genetic mapping of complex traits, only marker and
phenotype data are observed, whereas the data on diplotypes,
recombination events, and QTL genotypes are missing. The ob
served data alone are viewed as incomplete data of the previ
ously specified complete-data model.
The likelihood of the incomplete data including the pheno type (y) and marker information (M) can be formulated as
L(Sl\ym,yp,y?,Mm,Mp,M?)
otPv[Mm,Mp\{P^>}] N nh nh (
x LI E E mV^,\Mf)Vr(V^,|Mf)
xf(y?\v'^)f(y?\v^)
j=\ z<r' ?,?'
x/(^|?rT')]L (3)
where Pr(?> , \M ) and ?r(Vp , |Aff ) are the conditional prob abilities of the maternal (Ww) and paternal diplotypes (Vp,)
given maternal (Mf) and paternal marker genotypes (Mp) for
the ith family, and ?v(Mij,V0Tif,R?g'\V^/,Vp^)
is the con
ditional probability of offspring having marker genotype M[j, diplotype V?TT,
and reduced recombination R??> given maternal
diplotype V ? and paternal diplotype Vp,. It can be seen from (4) that the observed data constitute a
normal mixture model that is difficult to solve for the MLEs.
However, maximizing the expected complexdata likelihood and
detecting the corresponding MLEs is simple. Thus, to obtain the
MLEs of the incomplete-data likelihood, we attempt to maxi mize the completedata likelihood by replacing it by its condi tional expectation given the incomplete data. This expectation
is computed with respect to the conditional distribution of the
complete data given the incomplete data and the current es
timate of the unknown vector ?. Appendix B describes the derivation procedures and the implementation of the EM algo rithm to obtain the estimates.
3.3 Asymptotic Variance-Covariance Matrix
After the point estimates of parameters are obtained by the EM algorithm, it is necessary to derive the variance-covariance
matrix and evaluate the standard errors of the estimates. Be
cause the EM algorithm does not automatically provide the estimates of the asymptotic variance-covariance matrix for
parameters, an additional procedure has been developed to es
timate this matrix (and thereby standard errors) when the EM
algorithm is used (Louis 1982; Meng and Rubin 1991). These
techniques involve calculation of the incomplete-data informa
tion matrix, which is the negative second-order derivative of
the incomplete-data log-likelihood. Louis (1982) established an
important relationship among the information for the complete (Icom), incomplete (Xincom), and missing data (XmissX expressed as 2incom
= Xcom
? Xmiss. Because it is difficult to evaluate
^incom directly, Louis' relationship allows for an indirect eval
uation based on X m and Xmiss. The derivation procedures of
Zcom and Tmiss are given in Appendix C. Meng and Rubin
(1991) proposed a so-called "supplemented" EM algorithm or
SEM algorithm to estimate the asymptotic variance-covariance
matrices, which can also be used for calculating the standard
errors for the MLEs of our QTL parameters.
4. A CASE STUDY FROM A CANINE HIP DYSPLASIA PROJECT
A canine pedigree was developed to map QTL responsible for canine hip dysplasia (CHD) using molecular markers. Seven
founding greyhounds and six founding Labrador retrievers were
intercrossed, followed by backcrossing F\'s to the greyhounds and Labrador retrievers and intercrossing the Fi's. A series of
subsequent intercrosses among the progeny at different genera
tion levels led to a complex network pedigree structure (Fig. 1), which maximized phenotypic ranges in CHD-related quantita tive traits and the chance to detect substantial LD (Todhunter et al. 2003; Bliss et al. 2002). A total of 148 dogs from this out bred population were genotyped for 240 microsatellite markers located on 38 pairs of autosomes and 1 pair of sex chromosome
(Mellersh et al. 2000; Breen et al. 2001). A number of CHD associated morphological traits that display remarkable dis
crepancies between greyhounds and Labrador retrievers were
measured to examine the genetic and developmental basis of
CHD. In this study, body weight at age 8 months is used as an
example to illustrate our approach. The 148 dogs selected from this canine pedigree (see Fig. 1)
include 2-18 offspring and their parents for each of 22 fami lies. Because most of the selected families were derived from crosses between the unrelated founders, the families selected
for a joint linkage and LD analysis can be roughly assumed to
be random samples. However, a more precise analysis should
implement the network structure of the pedigree. All of the 38 autosomes carrying different numbers of markers were
scanned for the existence of QTL affecting body weights us
ing our joint analytical model. The critical value for claim
ing the chromosome-wide existence of QTL was determined
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 163
Greyhound - Labrador Retrieveroa Pedigree
Djdi Sunshine Knucklehead Jay
ft ?_?
g ???? Pollux Poly
H
r C47
?a.
GX16 26 36 46 56 66 76 86 96
??????? FB17 27 37 47 57 67 77
A15 25 35 45
?b S J
C50 Zeta Lambda
?????? B13 23 33 43 53 73 Z
Andy
A14 Castor
O Z
iU~Utiti Ws?t AX16 26 36 46 56
] BX16 26 36 46 56 66 76
?o?n???? AB18 28 38 48 58 68 78 98 I_
FX18 28 38 48 58 68 78 HB27 37 47 57 67 87
J? Ester Isis Sue Who
FX16 26 36 46
GX15 25 35 45 55 65 75 85 95 5 45 55 65 75 85 95
h????iibb D2F18 28 38 48 58 68 78 88
a????????itt] KB17 27 37 47 57 67 77 87 97 107 117 _I
????? EB17 27 37 47 57 67 ????uufa AB10 20 30 40 50 60 70
???o??? ooM????i??? C2B19 29 39 49 59 69 79 AB11 21 31 41 51 61 71 81 91
f??l DB17 27 37 47
?????????? GB17 27 37 47 57 67 77 87 97 107
B2F18 28 38 48 58 68 78 88
????????uuu IB19 29 39 49 5?
?'o?o???
DB19 29 39 49 59 69 79 89 99 109 119
HB18 28 48 58 68 88
?????? CB10 20 30 40 50 60
Figure 1. Diagram of an Outbred Pedigree in Canine. Squares and circles represent males and females. Filled and open portions of each symbol
represent the proportion of greyhound and Labrador retriever al?eles, possessed by that dog.
through permutation tests (Churchill and Doerge 1994). The
log-likelihood ratios calculated from 1,000 permutation tests were approximately chi-squared-distributed, whose .1% per
centile was used as the critical threshold (significance level a =
.001).
By comparing the a = .001 threshold (32.29) determined from permutation tests with the log-likelihood values calcu
lated at every 2 cM on each marker interval composed of two
adjacent markers, we detected four QTL for body weights lo cated on different chromosomes. As an example, Figure 2 illus
trates a profile of the log-likelihood ratios across chromosome
35 of length 39.94 cM carrying four markers, REN172L08, REN94K23, REN214H22, and RENO 1 GO 1. The profile has a
peak of value 35 at 8.9 cM from marker REN172L08, suggest ing the existence of a significant body weight QTL around that
position. As with traditional interval mapping (Lander and Botstein
1989), our joint model can estimate the positions of the QTL and the effects due to their gene actions and interactions. Like usual LD mapping, our joint model provides the MLEs for the al?ele frequencies of QTL and QTL marker and QTL-QTL link
age disequilibria (Lou et al. 2003a). Table 3 gives the MLEs of haplotype frequencies, genotypic means of the QTL, and the residual variance, along with their respective standard er
rors. From the haplotype frequencies we estimate the al?ele fre
quencies and linkage disequilibria (Lou et al. 2003a), whereas
from the genotypic means we estimate the additive and domi
nant genetic effects (Lynch and Walsh 1998). The MLE of the
frequency of the favorable al?ele that increase body weight for the QTL detected on chromosome 35 was pq
? .505, with a
reasonably small standard error .0725 (see Table 3). This QTL,
displaying a high heterozygosity, was found to have strong link
age disequilibria with the two markers flanking it, whose esti mates are reasonably precise.
As expected, the estimate of the additive genetic effect (a =
4.78 kg) exhibits much greater precision than that of the dom
inant effect (d ? ?1.32 kg) (see Table 3). This QTL triggers an effect body weight in an additive manner. The estimated ad
ditive and dominant genetic effects, along the estimated QTL al?ele frequencies, allow for the estimation of average effect,
a = a + d(pq ?
pq), that quantifies the value of the QTL car
ried by a parent and transmitted to its offspring (Lynch and Walsh 1998). The average effect of this QTL was estimated as
4.80, or 1.17 times the phenotypic standard deviation for body
weight. The additive and dominant genetic variances of this
QTL estimated from a2 = 2pQpqa2 and aj
= Ap20p2d2
were
8.29 and .44, accounting for .49 (narrow-sense heritability) and
.03 of the total phenotypic variance. It should be pointed out that the profile of the log-likelihood
ratios is quite flat at the left of the peak in the case of chro
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
164 Journal of the American Statistical Association, March 2005
0C to
cM
Figure 2. The Profile of the Log-Likelihood Ratios for Testing a QTL Affecting Body Weights on Chromosome 35 of Canine. The horizontal line
denotes the critical threshold for claiming the existence of a QTL ata = .001.
mosome 35 (see Fig. 2). This may result from the left marker
REN172L08 being associated through LD with a second body
weight QTL on a different chromosome. In other words, the
possible existence of LD between this marker and any QTL lo
Table 3. MLEs and Their Standard Errors (SEs) of Population and Quantitative Genetic
Parameters for the Body Weight QTL Bracketed by a Pair of Markers on Canine
Chromosome 35 in a Multifounder and Multigeneration Dog Pedigree
Parameters MLE SE
Population genetic parameters Haplotype frequencies
Al?ele frequencies and LD
Quantitative genetic parameters Genotypic means (kg)
Genetic effects (kg)
Average effect (kg) Additive genetic variance
Dominance genetic variance
Residual variance
Phenotypic variance
Narrow-sense heritability Dominance ratio
Paqb PAQb
PAqB PAqb PaQB PaQb
PaqB Paqb
PA Pq Pb
Dab Daq Dqb
daqb
Pqq
P a d
a
a2
h2
d/a
.3731 8.08e-31
.2359
.0422
.0367
.0953
.2168 1.27e-07
.6512
.5051
.8626
.0473
.0442 -.0258
.0442
33.06
23.50
26.96
28.28 3.05
-6.51
3.12
4.86
10.59
8.02
23.47
.21 2.13
.0719 0
.0688
.0232
.0300
.0365
.0525 8.98e-05
.0521
.0725
.0414
.0237
.0379
.0232
.0480
.61
.60
.43
.41
.34
.70
.83
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 165
cated outside chromosome 35 affects the precision of the es
timate of the QTL position. A similar problem occurs in pure linkage analysis, which led Zeng (1994) to propose using mul
tiple regression analysis to remove the influence of all QTLs outside the marker interval considered.
5. MONTE CARLO SIMULATION
To demonstrate the statistical properties of our approach, we
performed a simulation study that includes two different steps: (1) map construction, assuming that marker order is unknown,
and (2) QTL mapping based on a known linkage map. In the second step, the estimates of QTL effect, QTL position, al?ele
frequency, and QTL-marker linkage disequilibrium can be used to assess the estimation precision of these parameters in the real
example of canine mapping described earlier. We mimic the dog pedigree structure described earlier to ran
domly sample 20 independent families and 9 offspring from each family. We also consider a linkage group whose mark
ers are not evenly distributed. This type of linkage group com
prises five markers, with the recombination fractions .200, .020,
.150, and .015, between a pair of two adjacent markers from
the left to right (Table 4). Except for a triallelic third marker, all other markers are biallelic. The hypothesized values for the al ?ele frequencies of different markers and their pairwise linkage disequilibria are given in Table 2, assuming that no high-order linkage disequilibria exist. These five markers segregating in a
natural population are transmitted from parents to offspring.
By assuming that the gene order of these simulated mark
ers is unknown, we calculate the log-likelihood ratios for all
possible markers and choose one having the largest ratio as a
most likely order. In our example, the joint model correctly de
termines the gene order in 60% of 200 simulation replicates. Table 4 lists the MLEs of the population genetic parameters and recombination fractions between different marker pairs av
eraged over all of the 200 replicates. As seen from the square roots of mean squared errors, the joint model proposed can well
estimate these parameters. However, the estimation precision
of the recombination fractions will be reduced when there is a
weak linkage (see Table 2), or when linkage disequilibria do not
exist between different markers (results not shown).
In the second step, we simulated a QTL in the middle of two sparse markers, 1 and 2, and of two tightly linked markers, 4 and 5, on a known marker order given in Table 4. This QTL is segregating in a simulated population with al?ele frequencies .6 versus .4, has the additive effect of 1.22 and dominant effect of .61, and contributes 40% to the total phenotypic variance
(equivalent to the residual variance of 1.0). As illustrated by the
profiles of the log-likelihood ratios, this QTL can be identified, but with better mapping precision when two flanking markers are tightly (Fig. 3) than loosely linked (Fig. 4). Also, the map
ping precision of QTL depends on the magnitude of LD. The
QTL can be better detected when the markers and QTL display LD, in contrast to when there is no LD between these loci. This
advantage is more pronounced for two loosely linked markers
(see Fig. 4) than for tightly linked markers (see Fig. 3). Similar trends were detected for the precision of the MLEs of the addi tive and dominant effects of the QTL and the residual variance
(results not shown). These results suggests that a high-density map is necessary for the precise mapping of QTL when LD is
low, whereas a moderate-density map is adequate when a high
degree of LD exists.
Table 4. MLEs of Population Genetic Parameters and the Linkage Between Five Simulated Markers Segregating Among 20 Full-Sib Families Each of 9 Individuals
Markers on
linkage group rl(l+1)
Al?ele frequencies
Linkage disequilibrium
Ji(i+1) D> '1(1+2) Dl(l+3) Jl(l+4)
Given values
I Mi
I M2
I M3
MA I
M5 I
Estimated values
I MA
I M2
I M3
I MA
I M5
.200
.020
.150
.015
2012(.o615)
0196(.0179)
1530(.o281)
0149(.0136)
.6
.6
[.3, .3]
.6
.6
5996(.o573)
5969(.o549)
i-3028(.0469)> -2975(.oo53)]
[ 6079(.0542)]
.6008(.0526)
.04
.04
[.02, .02]
[.02, .02]
0388( .0280)
.0382(0272)
[ 0196(.0237),-0189(.0219)]
[ 0194(.0227),.0202(.0225)]
.04
[.02, .02]
.04
0396(.0286)
;.0182(.0222)> -01 81 (.0238)]
.0373(.0289)
[.02, .02]
.04 .04
[ 0209(.0264), 0166(.o255)] .0383
.0377j (.0299)
(.0261)
NOTE: The numbers in the round parentheses are the square roots of the MSEs of the MLEs. f/(/+i) is the recombination fraction between two adjacent markers; whereas D/(/+1), D/(/_|_2), D i (/ +3), and D ?(/ +4) are the linkage disequilibria between two adjacent markers and two markers separated by one, two, and three markers.
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
166 Journal of the American Statistical Association, March 2005
Distance (cM)
Figure 3. The Profiles of the Log-Likelihood Ratios for a QTL Located Between Two Tightly Linked Markers (M4 and M5). A peak of the profile is detected, although it is higher in the case where there are LD (?) than in the case where there is no LD between the markers and QTL (-
- -).
Note that there is the same critical threshold between the two cases because they share identical simulation conditions.
6. DISCUSSION
We have developed a general statistical framework for inte
grating different mapping strategies, interval mapping and LD
mapping, for the molecular dissection of complex traits. By tak
ing advantage of each strategy, our integrative approach makes
it possible to perform high-resolution mapping of complex traits based on the principle of LD. The prerequisite of apply ing LD to QTL mapping is that its occurrence is due to a tight linkage rather than to other evolutionary factors (Lynch and
Walsh 1998). A pure LD analysis would result in false positives, as documented in Alzheimer's disease mapping (Emahazion et al. 2001), when evolutionary factors (e.g., selection, muta
tion, population structure, population admixture) are involved
in the generation of disequilibrium. In addition, our mapping
approach, which jointly models the linkage and LD between markers and QTL, can handle the information from multiple segregating families. It will be reduced to the traditional interval
mapping approach (Lander and Botstein 1989) when no infor mation is included about the genetic structure of the population from which the families are randomly sampled.
There are four significant contributions to current QTL map ping studies made by the approach proposed in this article.
First, although joint linkage and LD analysis has received much attention in fine-scale mapping of QTL (Wu and Zeng 2001 ; Wu et al. 2002; Fan and Xiong 2002; Farnir et al. 2002; Meuwissen et al. 2002; Lund et al. 2003; Perez-Enciso 2003), none ofthose studies presented genetic models and statistical methodologies that can accommodate different mapping populations, genetic designs, pedigree structures, and marker systems. Here we have
Distance (cM)
Figure 4. The Profiles of the Log-Likelihood Ratios for a QTL Located Between Two Sparsely Linked Markers (M1 and M2)- When there are
LD between the markers and QTL, the profile has a flat peak on the interval of markers flanking the QTL (?). When no LD exists between different
loci, such a peak is not obvious (- - -).
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 167
generalized all of these approaches within a statistical frame
work into which QTL mappers can easily incorporate variables of interest to fit their problems. This generalized framework can be reduced to various special cases. For example, by set
ting the maternal diplotype (VTt>) equal to the paternal diplo
type (Vp ,), the model is reduced to autogamous plant systems, such as Arabidopsis and rice.
Second, similar to our recent model of pure LD (Lou et al.
2003a), the joint model is sophisticated and robust to an arbi
trary number of markers and al?eles at each locus, and linkage
disequilibria of different orders and epistatic interactions be tween different QTL. Clearly, this will make our joint model
broadly useful in different situations and for different purposes. Third, with simulated data, we have demonstrated the power of high-resolution mapping for complex traits using LD. The use of LD in QTL mapping can increase the high-resolution mapping power of QTL, especially when marker density is not
high (see Figs. 3 and 4), as compared with pure linkage map ping using recombination events in genotyped generations. This result will have immediate implications for the determination of QTL mapping strategies in those organisms in which there are substantial linkage disequilibria throughout the genomes. For example, according to our previous observations (Lou et al.
2003b), linkage disequilibria frequently occur within a region of 20-30 cM in canines. Thus a moderately dense linkage map for canine may be adequate for high-resolution mapping of QTL. Finally, our QTL model proposed here does not rely on a linkage map of known marker order. We have also imple
mented the algorithm for map construction, and thus can make
use of markers whose map orders are not known for QTL map
ping. We performed different simulation schemes by changing the
values of model parameters in their space to study the statis
tical properties of our model on one hand, and to pinpoint the
optimal genetic experimental design for the most efficient uti lization of the model on the other hand. The results from these simulations suggest that our model displays adequate power to detect QTL if it exists and provides reasonable estimates of ge netic parameters for the detected QTL. The simulation results
guide us in developing a high-density map for the precise map ping of QTL when LD is low and a moderate-density map when a high degree of LD exists. Such a preference between high and moderate-density maps is consistent with empirical results
from LD analysis of molecular markers (Rafalski 2002). Our approach is unique in that it embeds LD analysis in a
traditional interval mapping process. It relies on a two-stage hi
erarchical sampling strategy at the population level and at the
offspring-within-family level. Theoretically, information about the parents can enhance our analytical power and precision, but
may be missing in practice. The model requires further inves
tigation about the effects of missing data at the parental level on the parameter estimation. The likelihood functions for two
missing-data situations are given in Appendix D. Also, in the current model, we used two flanking markers at a time to lo
cate a QTL bracketed by the markers based on the marker-QTL linkage and disequilibrium. This strategy does not consider the
impact on the estimate of the log-likelihood ratio test statistics of other possible QTL located on the same chromosome. This
question may have contributed to an overestimation of genetic
effects of the detected QTL in the dog example, although other
factors, such as a small size of dog samples and multiple linked
QTL, would also cause this overestimation. A hypothesized QTL in a tested marker interval can be separated from those outside the interval for a pure interval mapping by including the rest of the markers (except for the flanking markers under
consideration) as cofactors (Zeng 1994). For our joint interval and LD mapping analysis, QTL on different chromosomes hav
ing an association with one or two of the flanking markers may
also affect the calculations of log-likelihood ratios (see Fig. 2).
Similarly, by combining our joint mapping analysis and mul
tiple regression analysis, the influence of the QTL outside the interval considered can be removed. An extensive investigation
of this combination analysis is needed.
APPENDIX A: ESTIMATES FOR THE COMPLETE-DATA LIKELIHOOD FUNCTION
This appendix gives the MLEs of the unknown parameters under
the complete-data likelihood function (2). We define the indicators
II if the maternal and paternal parents of family / have
diplotypes V^t and Vrr<
0 otherwise
j
and
?J ^m'tr'){Tx'??')
1 if offspring j of family / derived from
parental diplotypes T>^/ and
V^> pro duces diplotype PrT/ and reduced recombi
nation 7Zr "?? 0 otherwise
Then we rewrite the likelihood function (2) as
L(ym, yp, y?,T>m, VP, T>?\ K\to)
N nh nh (
*=1 ?<?'?<?'
11 E ?^[htt'?niTT'??'
Ni nh 3L
nEi j=\ r<r' ?,?<
The natural logarithm of the likelihood function can be expressed
lnL(ym f, y?, Vm, V?, Vo, 1Z\il)
= constant - N nh
EE InP,
N nh
HT.
tc^r
ln/(3fl^)EW
+ in/(>fi^)E4r^
N N? nh nj-, n^ 3L [
+ LLLLL ?^htt'??'KTT'??')
x [lnPr?^.Tl^l^.^O+ln/?^l^)]
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
168 Journal of the American Statistical Association, March 2005
77,7 ? constant + V] ?? ^nP?
+ ?n Inri +?n ln(l -
r])
+ nn lnr2 + nr2 ln(l -
o) H
+ WrL_i lnrL_! -f ?^lnQ -rL_i)
ZU
77/7
77/7
+ ln/(yf|^)EW
/V Af/ 77/7 77/7 77/7
+EEEEE ln/(^_0?4m,)(rT,i?, ?,?
where
N 77/7 r ? 77/7
Z. (4m'+ 4m} + Z (4r^'+ 4r^/}
Af /V/ 77/7 77/7 nh 3L ] ,
n? "EEEEE E 4r..')(rr'iO
[y^/(i+y?/) + y^(i + y^/)]
/V N; 77/7 77/7 77/7 3L '
z^ZZ Z z_, ?^\zm'^')(^'
[^/(i-M + y^a-^/)]
,?r?0
and }/?/ (/ = 1,..., L ? 1) is element / of yg. The MLEs can be derived
by differentiating the log-likelihood with respect to fi, then setting
each derivative equal to 0 and solving the set of simultaneous equa
tions. The MLEs are
nH
P^2-2__V f
= l,2,...,w?;
n = /=1,2,...,L-1;
M, = Ef=iE??<r[ifel%0>fE^4rc^]
A?
Ef=iEr<r[ifel%0Er<^4rc^l
Ef=1E^r[ifel^r)^E^^4^r]. Ef=iEncA<c^fel^?OE?A<r4r?^]
'
/V Af; 77 h nh nh 3L_1
/=1 ?=1 ?<?'?<?'*<*'_;<_;'
(iV Ni ?/, nh 77/, 3L_1
7=1 /=l?<C'?<C'T<r's-,?'
Ifel^)] -1
om2
a
and
p2.
N
EiiE^iE^ zya/j:s[i(gs\v^)(yf - A?)2]}
N M ?/? ?/? "/? 3L '
-?2=eeei:ee
N
??'??'tt'??' i=lj=l %<!;'?<r't<t' ?,?'
s \/=l
APPENDIX B: ESTIMATES FOR THE INCOMPLETE-DATA LIKELIHOOD FUNCTION
In this Appendix we describe derivation of the MLEs of the un
known parameter vector ? for the incomplete data, followed by the
implementation of the EM algorithm to estimate the MLEs. We rewrite
the likelihood of the incomplete data (y and M) described in (4) as
L(?l\ym,yP,y?,Mm,Ml?,M?)
N nh nh (
Il E E \-WP^p^nyT\^)f(yf\^) ?=1 ?<?'?<?
Ni nh 3L-]
xnEE[> j=\T<T> ?,?>
*f(yijW?TT,)]
'mO(Tr'?i') "^rt" 'v??' '^fr ' ~?f .?Pr(v?TT?nccl\vTl:?vP)
where
vm,.,
and
JJ v{^>t;r>){TT>??i)
'
(B.l)
1 if diplotypes X> , and tP , are compatible with
the marker genotypes of the maternal and paternal
parent-forming family i
0 otherwise
1 ifgm(V^)=M?psin?gm(Vtr)=Mp
0 otherwise
1 if diplotype Vo , and reduced recombi
nation 1Z?r' produced by maternal and
paternal diplotypes Pfw and Vp , are com
patible with the marker genotype of off
spring j from family / 0 otherwise,
where the remaining notation is as described in Section 3.2.
As mentioned earlier, the missing data in our model are zygote
genotypes and diplotypes (containing putative QTL) of the parents and offspring, and reduced recombination configurations of the off
spring. The observed data are the marker genotypes and phenotypes of
the parents and offspring. The conditional distribution of the missing data given the observed data connects the complete-data likelihood,
Lc(a\ym,yP,y?,Mm,MP,M?,T>m,VP,V?,n), and the observed
data likelihood, L(ti\ym,yP,y?,Mm,MP,M?). The conditional dis
tribution of parental diplotypes given the observed data is expressed
ki (T>m, Vp\tt, ym, yP, y?, Mm, MP, M?)
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 169
L-l Ni ? nh 3^~l
^ EEKw)(rr'?') 7=1 It<t:' ?,?'
(nh nh ?
E E w^^.^.p^^.ri^v^^i^r)
M r 77/7 3?-]
Xrt EEKwKrr^) 7=1 It<t' ?.?'
xPr^T^P^
(B.2)
and the conditional distribution of the diplotype and reduced recombi
nation of the offspring given the observed data and parental diplotypes is expressed as
k2(V?, n\to, yn\ yp, y?, Mm, Mp, M?, T>m,Vp, T>?)
nh 3L-] / nh 3"-1
Denote the conditional distributions of the missing data at different
hierarchies [(B.2) and (B.3)] by x*,, and 7r(it/^^)(TT/cc/)-
Maxi
mization of the observed data likelihood is achieved by differentiating the log-likelihood of (B.l), which leads to
? lnL(ft|ym, yp, y?, Mm, Mp, M?) dit
=EE^ E-W + E-W M nh ?dW(ym\Vftl) nh
+ EeM|^E*W
d\nf(yp\VP?,) ?O. dtt Q
nh \
M Ni nh nh nh 3L l
+ ZZZ E E Z ^m'Krr'?T?r')
31nPr(^r/,^^|^r,^) ain/(yg-|PgT,) 9?_!/? 8?n
(B.4)
and solving the corresponding log-likelihood equations. We implement the EM algorithm to estimate population genetic
parameters Up (including marker-QTL haplotype frequencies, al?ele
frequencies, and linkage disequilibria), quantitative genetic parame ters
SIq (including QTL effects and residual variance) and QTL posi tions SIr. In the E step, using (B.2) and (B.3), calculate the conditional
probability 7T?jtt/ of the diplotypes of the maternal
(T>^,) and pater
nal parents (Dp ,) generating family /, and the conditional probability
jr'iw /w , f) of offspring j with diplotype D? , and reduced re
combination R??/
from family i derived from maternal parental diplo
type Ww and paternal parental diplotype Dp ,. In the M step, us
ing (B.4), obtain the MLEs of the unknown parameters by substituting
4?'ff and4rc?')(rT'?i') with4V and4(??r)(rr'fio-These two steps are repeated until convergence.
APPENDIX C: DERIVATIONS OF THE ASYMPTOTIC VARIANCE-COVARIANCE MATRIX
As shown by Louis (1982), the observed information (^ncom) can
be obtained by subtracting the missing information (Xmjss) from the
complete information (Xcom). Using Louis' notation, we denote the
complete data by x = (x\,x2, ...,xn)T, the observed data by y =
(y\,y2, . ,yn)T, and the parameter vector by 0. The likelihood of
the incomplete data is expressed as
fY(y\0)= [ fx(*\0)dri(x), (C.l)
where R = (x : y(x)
= y) and ?i(x) is a dominating measure. For (C. 1 ),
we have the score,
31n/y(y|?) f &(x|0) 3\nfx(x\0) - dfi(x), (C.2)
r /x(x|i " Jr /r/x(x|?) 3? JR/R/x(x|?)dM(x) 90
and the Hessian matrix,
32lnfr(y|tf) sede7
fx(x\0) d2infx(x\0)
-L RfRfx(x\0)dii(x) 30 801 d?(x)
. JAyx\0) d\nfx(x\0) d\nfx(x\0) + l '-x\0) de ^W^?r-?W r fxWQ) JR Jr/x(x|i
/. fx(x\0) d\nfx(x\0) ̂
dfi(x) RfRfx(x\0)dii(x) dO
fx(x\0) d\nfx(x\0)
1* y d?i(x). (C.3) lfx(x\0)dfi(x) dB1
When x\, x2,..., xn are independent but not necessarily identically distributed and y/(x)
= y/(x?), we have R =
R\ x R2 x x Rn. To
make the score (C.2) and the Hessian matrix (C.3) more tractable, we
need to use Fubini's theorem. By incorporating Fubini's theorem, the
score and the Hessian matrix can now be expressed as
91n/r(y|0) ^ f fx(Xi\0) 31nfr(*f|?) . , .
'hJR>y de ?^JR?JrMx?WMx?)
de (C.4)
and
92ln/F(y|fl) d6d0T
fX(xi\0) d2lnfx(xj\0)
f^jRifR/xiximd?ixi) d0 30T d?(Xi)
A r fxU.m d\nfx(xi\0)d\nfx(xi\0) d
n r
JRi ?R?
fx(xiW) d\nfx(xj\0) dix(xj)
Mxi\0)d?(xi) 90
fx(xi\0) d\nfx{xi\0) Mxi\9)dti(x?) 30'
d?(xj) (C.5)
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
170 Journal of the American Statistical Association, March 2005
Based on (C.4) and (C.5), the complete XCom and missing informa
tion Xmjss can be calculated using Louis' formulas, from which the
observed information TmQOm is estimated.
In our situation, the observed information matrix can be readily es
timated for haplotype frequencies, genotypic means, the residual vari
ance, and the QTL position (measured by the recombination fractions
between the QTL and flanking markers). A further step is needed to de
rive the variance-covariance matrices for al?ele frequencies and link
age disequilibria from the haplotype frequencies (see Lou et al. 2003a) and for the additive and dominant genetic effects from the genotypic means. It is not difficult to obtain the variance-covariance matrices for
al?ele frequencies and genetic effects, because they are linear functions
of haplotype frequencies and genotypic means.
Because the analytical expression for estimating the variances and
covariances of linkage disequilibria does not exist, we derive an ap
proximate expression for these variances and covariances using the
delta method derived from a Taylor' series expansion. If a parameter vector 0 is a function of the characteristic parameter 6 [i.e., 0 =f(0)], then the approximate variance-covariance of 0 =f(0) is
,?, 3/(0) ?.MB) var(0) = on var(0)
d0 do1 (C.6)
In our situation, we have
' Dab \ DAQ
pBQ \DAqbJ
/] -PA -PB 1 -PA -PQ ! -PB-PQ -PB
I -PA -PB -PB -PA
0 V -PA
1 -PA -PQ -PQ -PQ -PA -PA
0
-PB
-PQ 0
1-PB-PQ -PB -PQ
1 - DAB -
bAQ -
DBQ -
PA -
PB -
PQ + IpAPB + ?-PaPQ + 2-PBPQ \ IPAPB + ^PBPQ
- DAB -D?Q- PB
IPAPQ + 2PBPQ -DAQ-DBQ-PQ 2pBPQ-DBQ ^
2pAPB + IPAPQ -
DAB ~DAQ-PA IpAPB
- Dab
^papq-Daq )
(PAQB \ PAQb PAqB PAqb PaQB PaQb
\ PaqB '
/I -PA -PB 1 -PA -PQ ! -PB-PQ -PB 1 -PA- PO -PB
PQ 0
1 -PA -
-PB -PA
0 -PA
PA -
PQ PB -PQ
-PQ -PA -PA
0
1 -PB-PQ -PB -PQ
1 - DAB -DaQ-Dbq-Pa- PB -
PQ + 2PAPB + 2pAPQ + 2PBPQ \ 2-PAPB + 2PBPQ
- Dab -Dbq-pb
IPAPQ + 2-PBPQ -DaQ-DbQ-PQ
2PBPQ-b?Q^ lPA?>B + 2pAPQ
- dab -Daq-Pa
IPAPB -
Dab
2-PAPQ -DAQ )
APPENDIX D: LIKELIHOOD FUNCTIONS FOR SOME SPECIAL CASES
We have derived a general likelihood function for joint interval and
LD mapping. However, in practice there are many special cases. If
parent phenotypes are not available, then the likelihood can be reduced
to a marginal form,
L(f?|y?, M?, X>m, T>p, Vo, 71)
N Ni
aPr[X^2>P|{P?r}]nn 7=17=1
The marginal likelihoods for two other special cases can also be for
mulated as
L(tt\ym,yp,y?,V0,K)
a^Pr[X>m,D^|{P^/}] N (
*Y\\f(y?wr)f(y?Wp) 7=11
Ni 1
x Y[[Pv(V^nij\Vip,Vp)f(y^J\Vp] (D.l) 7=1 i
when parental diplotypes are not available, and as
L(ti\y?,V?,11) (xJ2MT>m,Vp\{P^,}] N Ni
^n^'^i^'^^1^1 (d-2) '=17=1
when both diplotypes and phenotypes of the parents are not available.
In both (D.l) and (D.2), ? denotes the summation over [Vm, T>p] e
{[T>,T>]: compatible with Vo}.
[Received June 2003. Revised June 2004.]
REFERENCES Allison, D. B. (1997), "Transmission-Disequilibrium Tests for Quantitative
Traits," American Journal of Human Genetics, 60, 676-690.
Bliss, S., Todhunter, R. J., Quaas, R., Casella, G., Wu, R. L., Lust, G., Williams, A. J., Hamilton, S., Dykes, N. L., Yeager, A., Gilbert, R. O., Burton-Wurster, N. I., and Acland, G. M. (2002), "Quantitative Genetics of Traits Associated With Hip Dysplasia in a Canine Pedigree Constructed by Mating Dysplastic Labrador Retrievers With Unaffected Greyhounds," Amer ican Journal of Veterinary Research, 63, 1029-1035.
Breen, M., Jouquand, S., Renier, C, Mellersh, C. S. et al. (2001),
"Chromosome-Specific Single-Locus FISH Probes Allow Anchorage of an
1800-Marker Integrated Radiation-Hybrid/Linkage Map of the Domestic Dog Genome to All Chromosomes," Genome Research, 11, 1784-1795.
Cepellini, R., Curtoni, E. S., Mattiuz, P. L., Miggiano, V., Scudeller, G., and
Serra, A. (1967), Histocompatibility Testing, Copenhagen: Munksgaard. Churchhill, G. A., and Doerge, R. W. (1994), "Empirical Threshold Values for
Quantitative Trait Mapping," Genetics, 138, 963-971.
Emahazion, T., Feuk, L., Jobs, M., Sawyer, S. L., Fredman, D. et al. (2001), "SNP Associations Studies in Alzheimer's Disease Highlight Problems for
Complex Disease Analysis," Trends in Genetics, 17, 407-413.
Fan, R. Z., and Xiong, M. M. (2002), "Combined High-Resolution Linkage and Association Mapping of Quantitative Trait Loci," European Journal of Human Genetics, 11, 125-137.
Farnir, F., Grisart, B., Coppieters, W., Riquet, J., Berzi, P., Cambisano, N., Karim, L., Mni, M., Moisio, S., Simon, P., Wagenaar, D., Vilkki, J., and
Georges, M. S. (2002), "Simultaneous Mining of Linkage and Linkage Dis
equilibrium to Fine Map Quantitative Trait Loci in Outbred Half-Sib Pedi
grees: Revisiting the Location of a Quantitative Trait Locus With Major Effect on Milk Production on Bovine Chromosome 14," Genetics, 161, 275-287.
H?stbacka, J., de la Chapelle, A., Kaitila, I., Sistonen, P., Weaver, A., and
Lander, E. (1992), "Linkage Disequilibrium Mapping in Isolated Founder
Populations: Diastropic Dysplasia in Finland," Nature Genetics, 2, 204-221.
H?stbacka, J., de la Chapelle, A., Mahtani, M., Clines, G., Reeve-Daly, M. P.,
Daly, M. et al. (1994), "The Diastropic Dysplasia Gene Encodes a Novel Sul fate Transporter: Positional Cloning by Fine-Structure Linkage Disequilib rium Mapping," Cell, 78, 1073-1087.
Jansen, R. C. (2000), "Quantitative Trait Loci in Inbred Lines," in Handbook
of Statistical Genetics, eds. D. J. Balding, M. Bishop, and C. Cannings, New York: Wiley, pp. 567-597.
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions
Lou et al.: Unifying Interval and Linkage Disequilibrium Mapping 171
Lander, E. S., and Botstein, D. (1989), "Mapping Mendelian Factors Underly ing Quantitative Traits Using RFLP Linkage Maps," Genetics, 121, 185-199.
Lou, X.-Y, Casella, G., Littell, R. C, Yang, M. C. K., Johnson, J. A., and
Wu, R. L. (2003a), "A Haplotype-Based Algorithm for Multilocus Linkage
Disequilibrium Mapping of Quantitative Trait Loci With Epistasis," Genet
ics, 163,1533-1548.
Lou, X.-Y, Todhunter, R. L, Lin, M., Lu, Q., Liu, T., Wang, Z. H., Bliss, S., Casella, G., Acland, A. C., Lust, G., and Wu, R. L. (2003b), "The Extent and
Distribution of Linkage Disequilibrium in Canine," Mammalian Genome, 14, 555-564.
Louis, T. A. (1982), "Finding the Observed Information Matrix When Using the EM Algorithm," Journal of the Royal Statistical Society, Ser. B, 44, 226-233.
Lund, M. S., Sorensen, R, Guldbrandtsen, B., and Sorensen, D. A. (2003), "Multitrait Fine Mapping of Quantitative Trait Loci Using Combined Linkage Disequilibria and Linkage Analysis," Genetics, 163, 405^-10.
Lynch, M., and Walsh, B. (1998), Genetics and Analysis of Quantitative Traits, Sunderland, MA: Sinauer.
Mackay, T. F. C. (2001), "The Genetic Architecture of Quantitative Traits," Annual Reviews of Genetics, 35, 303-339.
Mellersh, C. S., Hitte, C, Richman, M., Vignaux, F., Pri?t, C, Jouquand, S., Werner, R, Andre, C, DeRose, S., Patterson, D. F., Ostrander, E. A., and
Galibert, F. (2000), "An Integrated Linkage-Radiation Hybrid Map of the Ca nine Genome," Mammalian Genome, 11, 120-130.
Meng, X.-L., and Rubin, D. B. (1991), "Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm," Journal of the Ameri can Statistical Association, 86, 899-909.
Meuwissen, T. H. E., Karlsen, A., Lien, S., Olsaker, I., and Goddard, M. E.
(2002), "Fine Mapping of a Quantitative Trait Locus for Twinning Rate Using Combined Linkage and Linkage Disequilibrium Mapping," Genetics, 161, 373-379.
Morton, N. E. (1983), Methods in Genetic Epidemiology, Basel: Karger. P?rez-Enciso, M. (2003), "Fine Mapping of Complex Trait Genes Combin
ing Pedigree and Linkage Disequilibrium Information: A Bayesian Unified
Framework," Genetics, 163, 1497-1510.
Rafaski, A. (2002), "Applications of Single Nucleotide Polymorphisms in Crop Genetics," Current Opinions in Plant Biology, 5, 94-100.
Todhunter, R. J., Bliss, S. P., Casella, G., Wu, R., Lust, G., Burton
Wurster, N. I., Williams, A. J., Gilbert, R. O., and Acland, G. M. (2003), "Ge netic Structure of Susceptibility Traits for Hip Dysplasia and Microsatellite Informativeness of an Outcrossed Canine Pedigree," Journal of Heredity, 94, 39-48.
Wu, R. L., and Zeng, Z.-B. (2001), "Joint Linkage and Linkage Disequilibrium Mapping in Natural Populations," Genetics, 157, 899-909.
Wu, R. L., Ma, C.-X., and Casella, G. (2002), "Joint Linkage and Linkage Dis
equilibrium Mapping of Quantitative Trait Loci in Natural Populations," Ge
netics, 160,779-792.
Zeng, Z.-B. (1994), "Precision Mapping of Quantitative Trait Loci," Genetics, 136,1457-1468.
This content downloaded from 62.122.79.21 on Mon, 16 Jun 2014 06:22:04 AMAll use subject to JSTOR Terms and Conditions