A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

22
Copyright Ó 2010 by the Genetics Society of America DOI: 10.1534/genetics.110.115055 A Hierarchical Bayesian Model for a Novel Sparse Partial Diallel Crossing Design Anthony J. Greenberg,* ,1 Sean R. Hackett,* Lawrence G. Harshman and Andrew G. Clark* *Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853 and School of Biological Sciences, University of Nebraska, Lincoln, Nebraska 68588 Manuscript received February 1, 2010 Accepted for publication February 6, 2010 ABSTRACT Partial diallel crossing designs are in common use among evolutionary geneticists, as well as among plant and animal breeders. When the goal is to make statements about populations represented by a given set of lines, it is desirable to maximize the number of lines sampled given a set number of crosses among them. We propose an augmented round-robin design that accomplishes this. We develop a hierarchical Bayesian model to estimate quantitative genetic parameters from our scheme. For example, we show how to partition genetic effects into specific and general combining abilities, and the method provides estimates of heritability, dominance, and genetic correlations in the face of complex and unbalanced designs. We test our approach with simulated and real data. We show that although the models slightly overestimate genetic variances, main effects are assessed accurately and precisely. We also illustrate how our approach allows the construction of posterior distributions of combinations of parameters by calculating narrow-sense heritability and a genetic correlation between activities of two enzymes. C ROSSES among inbred lines (diallel crosses) have a long history in quantitative genetics (Sprague and Tatum 1942; Lynch and Walsh 1998). When one is interested in easily measurable properties of a par- ticular, and limited, set of lines, the full diallel provides an efficient means of estimation of many parameters of interest to, for example, animal and plant breeders (Lynch and Walsh 1998). In evolutionary genetics, however, a common aim is to estimate the properties of populations the inbred lines represent. This requires a large sample of lines from a given population, and the crosses have to be replicated to control for environ- mental variance. Furthermore, the phenotypes assayed can be complicated or expensive to determine. Because the number of crosses in the full diallel grows very rap- idly with the number of lines, this design is not practical for such purposes (Kempthorne and Curnow 1961; Lynch and Walsh 1998). A variety of designs where only a subset of possible crosses is performed have been developed over the years (Lynch and Walsh 1998). Ideally, one would like a scheme where the number of crosses grows linearly with the number of lines sampled, while maintaining the ability to estimate important parameters relevant to evolutionary processes, such as narrow-sense heritability. The round-robin design (Wayne et al. 2004), where each line participates in one cross as a male and one as fe- male, comes close. This design is related to the circulant crosses proposed by Kempthorne and Curnow (1961). The round-robin scheme is attractive because the number of crosses is equal to the number of lines, and thus the number of lines sampled with a given number of crosses is maximized. However, this design provides low power to estimate the main effects of lines [general combining ability (Sprague and Tatum 1942; Lynch and Walsh 1998)], and for X-linked traits assayed in males the task is impossible (Wayne et al. 2004). Another possibility is to use only inbred lines them- selves (e.g.,Ayroles et al. 2009a). While useful for association mapping, this approach precludes separate estimates of general and specific (i.e., pertaining to particular cross) combining ability. Furthermore, the effect of inbreeding on the genetic architectures of traits is largely unknown (Lynch and Walsh 1998; Charlesworth and Charlesworth 1999). Recent studies show that mRNA levels of many genes, especially those involved in metabolism and stress response, are appreciably altered in inbred lines of Drosophila mela- nogaster (Kristensen et al. 2006; Ayroles et al. 2009b), potentially changing the associations of nucleotide variants with phenotypes. Inbreeding may thus con- found the results of quantitative genetic experiments. While partial diallel designs are an attractive option when the number of crosses is limited, estimation of quantitative genetic parameters from such schemes is in general difficult (Lynch and Walsh 1998). Traditional Supporting information is available online at http://www.genetics.org/ cgi/content/full/genetics.110.115055/DC1. 1 Corresponding author: 221 Biotechnology Bldg., Cornell University, Ithaca, NY 14853. E-mail: [email protected] Genetics 185: 361–373 (May 2010) Downloaded from https://academic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 December 2021

Transcript of A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

Page 1: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

Copyright � 2010 by the Genetics Society of AmericaDOI: 10.1534/genetics.110.115055

A Hierarchical Bayesian Model for a Novel Sparse PartialDiallel Crossing Design

Anthony J. Greenberg,*,1 Sean R. Hackett,* Lawrence G. Harshman† and Andrew G. Clark*

*Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853 and †School of Biological Sciences,University of Nebraska, Lincoln, Nebraska 68588

Manuscript received February 1, 2010Accepted for publication February 6, 2010

ABSTRACT

Partial diallel crossing designs are in common use among evolutionary geneticists, as well as amongplant and animal breeders. When the goal is to make statements about populations represented by a givenset of lines, it is desirable to maximize the number of lines sampled given a set number of crosses amongthem. We propose an augmented round-robin design that accomplishes this. We develop a hierarchicalBayesian model to estimate quantitative genetic parameters from our scheme. For example, we show howto partition genetic effects into specific and general combining abilities, and the method providesestimates of heritability, dominance, and genetic correlations in the face of complex and unbalanceddesigns. We test our approach with simulated and real data. We show that although the models slightlyoverestimate genetic variances, main effects are assessed accurately and precisely. We also illustrate howour approach allows the construction of posterior distributions of combinations of parameters bycalculating narrow-sense heritability and a genetic correlation between activities of two enzymes.

CROSSES among inbred lines (diallel crosses) havea long history in quantitative genetics (Sprague

and Tatum 1942; Lynch and Walsh 1998). When oneis interested in easily measurable properties of a par-ticular, and limited, set of lines, the full diallel providesan efficient means of estimation of many parameters ofinterest to, for example, animal and plant breeders(Lynch and Walsh 1998). In evolutionary genetics,however, a common aim is to estimate the propertiesof populations the inbred lines represent. This requiresa large sample of lines from a given population, andthe crosses have to be replicated to control for environ-mental variance. Furthermore, the phenotypes assayedcan be complicated or expensive to determine. Becausethe number of crosses in the full diallel grows very rap-idly with the number of lines, this design is not practicalfor such purposes (Kempthorne and Curnow 1961;Lynch and Walsh 1998).

A variety of designs where only a subset of possiblecrosses is performed have been developed over the years(Lynch and Walsh 1998). Ideally, one would like ascheme where the number of crosses grows linearly withthe number of lines sampled, while maintaining theability to estimate important parameters relevant toevolutionary processes, such as narrow-sense heritability.The round-robin design (Wayne et al. 2004), where each

line participates in one cross as a male and one as fe-male, comes close. This design is related to the circulantcrosses proposed by Kempthorne and Curnow

(1961). The round-robin scheme is attractive becausethe number of crosses is equal to the number of lines,and thus the number of lines sampled with a givennumber of crosses is maximized. However, this designprovides low power to estimate the main effects of lines[general combining ability (Sprague and Tatum 1942;Lynch and Walsh 1998)], and for X-linked traitsassayed in males the task is impossible (Wayne et al.2004).

Another possibility is to use only inbred lines them-selves (e.g., Ayroles et al. 2009a). While useful forassociation mapping, this approach precludes separateestimates of general and specific (i.e., pertaining toparticular cross) combining ability. Furthermore, theeffect of inbreeding on the genetic architectures oftraits is largely unknown (Lynch and Walsh 1998;Charlesworth and Charlesworth 1999). Recentstudies show that mRNA levels of many genes, especiallythose involved in metabolism and stress response, areappreciably altered in inbred lines of Drosophila mela-nogaster (Kristensen et al. 2006; Ayroles et al. 2009b),potentially changing the associations of nucleotidevariants with phenotypes. Inbreeding may thus con-found the results of quantitative genetic experiments.

While partial diallel designs are an attractive optionwhen the number of crosses is limited, estimation ofquantitative genetic parameters from such schemes is ingeneral difficult (Lynch and Walsh 1998). Traditional

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.115055/DC1.

1Corresponding author: 221 Biotechnology Bldg., Cornell University,Ithaca, NY 14853. E-mail: [email protected]

Genetics 185: 361–373 (May 2010)

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 2: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

ANOVA and maximum-likelihood approaches can dealwith relatively simple designs, especially in the absenceof replication (Lynch and Walsh 1998). Developmentof Bayesian methods, coupled to Markov chain MonteCarlo (MCMC) sampling, greatly expanded the flexibil-ity to estimate parameters from crossing schemes ofarbitrary complexity (Sorensen and Gianola 2002).Moreover, Bayesian posterior distributions of parame-ters of interest take account of uncertainty in estimatesof all other elements of a model. Finally, MCMC meth-ods allow sampling from distributions of arbitrarycombinations of parameters. This feature greatly ex-pands the set of questions that can be probed statisti-cally. Animal breeders have long embraced Bayesianstatistics (Gianola and Fernando 1986; Blasco 2001;Sorensen and Gianola 2002; Thompson et al. 2005)and these methods are popular in population genet-ics and quantitative trait mapping (Beaumont andRannala 2004). In contrast, applications in evolution-ary quantitative genetics are still rare (Walsh 2001;O’Hara et al. 2008). We therefore set out to explore theutility of a Bayesian approach in estimating parametersof interest to evolutionary quantitative geneticists. In par-ticular, we examined hierarchical modeling, a methodthat is popular in the social sciences (Gelman and Hill

2007) but has not been widely adopted by geneticists.Our purpose is not to demonstrate the superiority of theBayesian perspective—we believe that the choice be-tween Bayesian and maximum-likelihood approachesshould be dictated by pragmatic considerations. Rather,we seek to demonstrate that Bayesian modeling canprovide an accurate and useful alternative to traditionalmethods, even in the face of a complicated experimen-tal design.

We propose an augmented round-robin crossingdesign and a hierarchical Bayesian model to estimatequantitative genetic parameters from it. This approachis motivated by our interest in assessing within- andbetween-population variation in a variety of metabolictraits, building on the work of Clark (1989). We aug-ment the round-robin design with crosses between pop-ulations and measurements of the inbred lines linesthemselves. Thus, each line participates in four crossesand is also ‘‘selfed,’’ enabling us to separate general andspecific combining ability. We replicate the crosses tomeasure environmental effects.

The nesting of environmental replicates withincrosses, and lines within populations, is naturally mod-eled using a hierarchical Bayesian approach (Gelman

et al. 2004; Gelman and Hill 2007). We implement sucha scheme and perform simulations to show that itperforms well in estimating quantitative genetic param-eters of interest to us. We illustrate the utility of ourapproach with an example using real data. The techni-ques we describe should be useful in a variety of settingsin evolutionary quantitative genetics, as well as in animaland plant breeding.

METHODS

Simulated data: The crossing scheme we simulated isdepicted in Figure 1A. We deterministically set meanvalues for four populations (population means: A¼ 8.0,B ¼ 9.0, C ¼ 8.5, and D ¼ 6.0) and generated 15 linesamples from a normal distribution with the mean equalto the corresponding population mean. We ran two setsof simulations—with high and low narrow-sense herita-bility (h2) (Lynch and Walsh 1998). We list the para-meters of these simulations in Table 1. The standarddeviations (SD) for the normal distributions we used togenerate line means were 1.5 for the high- and 0.5 forthe low-heritability sets. To simulate crosses, we drewvalues from normal distributions with means equal tomidparental values and standard deviations 1.0 (highheritability) and 0.4 (low heritability). Cross values werethen adjusted for inbreeding and outcrossing by sub-tracting a value drawn from a normal distribution withmean 2.0 and SD 0.25 for inbred lines and adding avalue drawn from a normal distribution with mean 1.0and SD 0.1 for the between-population crosses. Inkeeping with the real data sets we are generating inthe lab, we hierarchically simulated two kinds ofenvironmental effects (Figure 1B). First, we generatedtwo ‘‘blocks’’ of replicates by drawing two values for eachcross from a normal distribution with mean equal to thecross mean and SD 0.7 (high heritability) and 1.0 (lowheritability). Next, we generated ‘‘replicates’’ from blocksby drawing three values for each block mean from anormal distribution with SD 0.5 (high heritability) and1.5 (low heritability).

In the absence of epistasis, narrow-sense heritabilitycan be estimated by the formula

h2 ¼ s2A

ðs2A 1 s2

D 1 s2EÞ

(Lynch and Walsh 1998). For crosses of inbred lines,s2

A can be estimated by calculating s2GCA ðs2

A ¼ 2s2GCAÞ

and s2D by s2

SCA ðs2D ¼ s2

SCAÞ, assuming epistatic inter-actions are weak (Lynch and Walsh 1998). s2

A (additivegenetic variance) is the variance among line means froma given population. In our case, the true s2

A is the squareof the SD of the distribution from which our simulationdrew the line means for each population (see support-ing information, File S1 for details). Dominance is thedeviation of each cross from the mean of the twoparental lines. Thus, the true s2

D in our simulations isthe square of the SD of the distribution used to pickcross values from parental line means. The environ-mental variance, s2

E, is the sum of the block andreplicate variances. Plugging in the values for theseparameters enumerated above, h2 for high- and low-heritability cases is 0.56 and 0.07, respectively.

In our experiments, we measure enzyme kinetics oneach replicate. We do this by providing a substrate to fly

362 A. J. Greenberg et al.D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 3: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

extracts and measuring changes in optical density of asubstrate or product of an enzymatic reaction with time.The slope of the resulting line is the maximum enzymerate, Vmax. This rate is thus measured with some error,which is in practice variable from reaction to reaction.To investigate the effect of this on the inferences ofmodel parameters, we simulated Vmax values fromnormal distributions centered on the correspondingreplicate values. To mimic the real world variation inassay quality, standard deviations for these distributionswere drawn from truncated normal distributions (con-straining SD to be above zero) with mean 20 times thereplicate mean and SD 4 times the replicate value. Wethen generated points from these distributions andreestimated the means (replicate values) and theirstandard deviations.

In measuring enzyme activity, some reactions runaberrantly and produce outliers. To assess the impact ofoutlier observations on inference, we duplicated eachdata set, randomly drew 1% of observations, and multi-plied each by a value drawn from a uniform distributionbetween 3 and 6. The regression standard deviationswere multiplied by the same number. Thus, each data sethas a ‘‘regular’’ version and a version with outliers.

Each simulated data set consists of 1080 sets ofestimated enzyme rates with their standard deviations.We implemented the simulations in R (R Development

Core Team 2008) and generated 500 data sets for eachof the heritability values. The R script we used, withdetailed annotations, is included in File S1.

Real data: To check the performance of our modelon real data, we chose two well-characterized enzymesfrom the pentose phosphate pathway, glucose-6-phosphate dehydrogenase (G6PD, EC 1.1.1.49) and 6-phosphogluconate dehydrogenase (6PGD, EC 1.1.1.44).We performed the assays as described in Clark andKeith (1989). We chose 92 D. melanogaster lines fromfive populations and inbred them for 12 generations.Nineteen lines came from The Netherlands [courtesyof Z. Bochdanovits (Bochdanovits and Jong 2003)],Ithaca, New York (collected in 2004 by E. M. Hill-Burns and B. P. Lazzaro), and Tasmania (sent in 2003courtesy of A. A. Hoffman); 17 were from Beijing [pro-vided by C. Aquadro (Begun and Aquadro 1995)];and 18 were from Zimbabwe [ZH lines are from Ha-rare and ZS lines are from Sengawa, provided by C.Aquadro (Begun and Aquadro 1993), and ZW linesare from Victoria Falls, provided by W. Ballard]. Thenames of the lines are listed in the data file (enz_data.tsv)in File S4. We crossed the lines in a scheme similar tothe one depicted in Figure 1A and replicated the cross-es in two blocks, with three replicates for each block.Five male flies were collected per replicate, weighed,crushed, and resuspended in buffer (Clark andKeith 1989). We performed the kinetic assays in 96-well plates, each plate containing extracts from onereplicate of a particular kind of cross (i.e., inbred lines

or within- or between-population crosses). The com-plete data set is provided in the enz_data.tsv file inFile S4.

Normal model for simulated data: We describe theimportant features of our Bayesian hierarchical modelhere. A full description of the distributions of parame-ters and the Gibbs sampling scheme can be found inFileS2.

Traditionally, data similar to ours are analyzed usinglinear mixed models. The effects of, say, line are either‘‘random,’’ if one is interested in the population thelines come from, or ‘‘fixed,’’ if one is interested in theparticular sample of lines (Box and Tiao 1973),although different interpretations of these terms arealso in use (Gelman 2005). Bayesians consider allparameters as random variables; thus the terms fixedand random effects are confusing. We use the terms‘‘sample’’ and ‘‘population’’ parameters, when refer-ring to groups of variables, or specifically identify themas means or variances. Depending on the biologicalquestion we are considering, we may be interestedin either or both kinds of variables, and the strengthof the Bayesian approach is that it allows us to simul-taneously estimate them all (Box and Tiao 1973;Gelman 2005).

Bayesian analysis of mixed-effects models, togetherwith MCMC sampling from posterior distributions, isgenerally successful (Box and Tiao 1973; Sorensen andGianola 2002; Gelman et al. 2004). However, a numberof computational difficulties exist (Gilks and Roberts

1996; Gelman et al. 2004), largely due to posteriorcorrelations between the population and sample pa-rameters (Gelfand et al. 1995; Gilks and Roberts

1996; Gelman et al. 2004). Bayesian hierarchicalmodels make use of the data structure to alleviatethese difficulties using two basic techniques. First, thepopulation parameters are centered on means forthe corresponding level [i.e., the sample parametervalues—hierarchical centering (Gelfand et al. 1995)],rather than on zero as is customary in mixed-effectsmodels (Lynch and Walsh 1998; Sorensen andGianola 2002). Thus, for example, in our model theblock values mbl

j are distributed normally with the meanamong all blocks derived from the same cross (mcross

k½j � )and variance s2

bl.Second, for a given level in the hierarchy, the likeli-

hood for a sample parameter depends on the informa-tion from the level below, while the prior comes fromthe level above (Gelman et al. 2004; Gelman and Hill

2007). To illustrate this, we consider estimation of ablock mean, mbl

j . The likelihood is

N ðmrepi2j ; s2

repÞ;

where mrepi2j is the mean of all the replicate values that

belong to the same block j (hence the notation i 2 j),and s2

rep is the variance among replicates. The prior is

Bayesian Quantitative-Genetic Model 363D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 4: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

N ðmcrossk½j � ; s2

blÞ;

where mcrossk½j� is the value for cross k that the block be-

longs to, and s2bl is the among-block variance. Given that

the normal prior is conjugate to the normal likelihood,the posterior distribution of mbl

j is

mblj � N

ðnj=s2repÞm

repi2j 1 ð1=s2

blÞmcrossk½j �

nj=s2rep 1 ð1=s2

blÞ;

1

nj=s2rep 1 1=s2

bl

!

ð1Þ(Box and Tiao 1973; Gelman et al. 2004; Gelman andHill 2007), where nj is the number of replicates in eachblock j and the other parameters are as before. Thus,each value mbl

j is pulled toward the relevant cross mean,and the strength of the pulling is dependent on theprecision of the estimate of mbl

j from the data on onehand and the precision of the estimate of the crossmean on the other (Gelman et al. 2004; Gelman andHill 2007). Note also that unbalanced designs (wherenj are not the same across blocks, for example) are easilydealt with in this framework. The grand mean does nothave a level above it, and therefore we assign it animproper flat prior in the model. It has been shown thatwith the types of model we are considering here, thischoice of prior for the grand mean still leads to a properposterior (Gelman et al. 2004). An additional compli-cation is that this parameter is estimated from a smallnumber of populations and thus the assumption ofnormality seems unrealistic. Instead, we assumed thatpopulation means come from a Student’s t distributionwith 3 d.f.

To estimate line means, we consider all the crosses agiven line participates in. Because there are three typesof crosses—the selfed inbred lines and the within- andbetween-population crosses—we first have to correct forcross type effects. In principle, this is straightforward.Taking the within-population crosses as the baseline, weregress the cross means on two indicator variables: onetakes the value of 1 when the cross is a selfed inbred lineand 0 otherwise; the other takes the value of 1 only whenthe cross is between populations. The intercept is thenthe specific combining ability, or the cross mean aftercorrecting for cross type. Since the coefficients of in-breeding and outcrossing (binbr and boutc) are not fur-ther modeled (and are given flat priors), this is thewell-known fixed-slope, variable-intercept regression(Gelman and Hill 2007). However, in practice it turnsout that this approach leads to posterior correlationsbetween the regression coefficients and the among-block variance (Gelman et al. 2004), leading to poormixing and often to convergence of s2

bl estimates to zero(not shown).

To alleviate this problem, we implemented parameterexpansion (Gelman et al. 2004, 2008) when estimatingthe cross means. To do this, we modeled block means as

mblj � N

mln$l ½j �1 mln

#l ½j�2

1 X b 1 bSCAk½j � ; s2

bl

!;

where mln$l ½j � and mln

#l ½j� are the line means for the femaleand male parents of the cross, X is the matrix of indexesfor inbreeding and outcrossing, b ¼ ðbinbr; boutcÞT , andbSCA

k½j� is the coefficient for specific combining ability,which we model as normally distributed with mean 0 andvariance s2

SCA. Note that ðmln$l ½j�1 mln

#l ½j�Þ=2 1 X b 1 bSCAk½j� ¼

mcrossk½j� of Equation 1. Thus, we abandon hierarchical

centering for this level and use a mixed model. To breakthe posterior correlation among b, bSCA

k½j � , and s2bl, we

introduce parameter expansion by replacing bSCAk½j� with a

product of two values:

bSCAk½j� ¼ aSCAgk½j�

gk � N ð0; s2gÞ:

The three values, aSCA, gk, and sg are not independentlyidentified in the model and thus do not converge on anyvalue (Gelman et al. 2004, 2008). However, the param-eters we are interested in can be recovered at eachiteration of the Markov chain:

mSCAk ¼

mln$l ½k�1 mln

#l ½k�2

1 bSCAk

¼mln

$l ½k�1 mln#l ½k�

21 aSCAgk

and

s2SCA ¼ ðaSCAÞ2s2

g:

These variables are defined in the model and convergeon the appropriate values, but because they are prod-ucts of undefined parameters that are free to wander inthe sample space, the posterior correlations are bro-ken, leading to faster and more accurate convergence(Gelman et al. 2004, 2008).

In our implementation of the Markov chain, weupdate the b-vector as a batch by first calculating themean of mbl values that belong to the same cross k,subtracting the current values for ðmln

$l ½k�1 mln#l ½k�Þ=2 and

aSCAgk, and regressing the resulting value on X withoutan intercept as suggested in section 18.4 of Gelman andHill (2007).

Once the inbreeding and outcrossing effects havebeen controlled for, we use the values of mSCA to calculateline effects or general combining abilities. For each linel, we examine all the crosses it participates in (five intotal). We take the mSCA

k2l value of the selfed cross as is. Forthe other crosses, we subtract the line mean for theopposite parent from twice the mSCA

k2l value for eachcross. Thus, we have five estimates of the line mean fromfive different crosses. The likelihood for the line effect isa normal distribution with mean equal to average of the

364 A. J. Greenberg et al.D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 5: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

five estimates and variance s2SCA. The variance among

line effects is twice the general combining abilityvariance (2 3 s2

GCA).Likelihoods for estimates of variances are inverse-x2.

We allow the three kinds of crosses to have different s2rep

and s2bl, on the basis of a priori biological considerations

and some preliminary analyses of real data. We assume asingle value for both s2

SCA and s2GCA. We used flat

improper priors on s [s(�) } 1 (Gelman 2006; Gelman

and Hill 2007)]. We also tried proper Gamma priors,but these tend to shrink variance estimates to zero(Gelman 2006; Gelman and Hill 2007), leading to slowmixing and often to convergence failure (not shown).

Each replicate value in our data set is the slope of thecorresponding kinetic curve or the enzyme maximalrate (Vmax). We treated these two ways. First, we simplytook the point estimates of Vmax. Second, we modeledVmax values hierarchically with the prior being the meanvalue for the relevant block of replicates and thelikelihood based on estimates of replicate values andtheir standard deviations computed as described at theend of the Simulated data section.

Student’s t models for outliers: For data sets withoutliers, constructed as discussed above, we imple-mented a further extension of our model. Instead of anormal, we used Student’s t distributions for replicatemeans:

mrepi � tnrepðmbl

j½i�; s2repÞ:

Student’s t distributions have fatter tails than thenormal, and therefore estimates of mbl

j should be morerobust to outliers (Gelman et al. 2004). For a given valueof the variance, samples generated from a t distributionare expected to contain more observations that are farfrom the mean compared to samples generated from anormal. Thus, outlier observations do not inflateestimates of the variance, and the extent to which thisis true depends on the degrees of freedom. As thedegrees of freedom increase, a t distribution approachesthe normal with the corresponding mean and variance.

We report analyses of the simulated data sets using tdistributions with 3 and 6 d.f. Three degrees of freedomis the smallest value for which the mean and thevariance for the t distribution are finite.

We implemented sampling from t distributions on thebasis of their interpretation as mixtures of normals(Gelman et al. 2004). To improve mixing, we usedparameter expansion for variance estimates (as sug-gested in section 11.8 of Gelman et al. 2004). Details ofthe sampling scheme are in File S2.

Computation: Because we assume normal or t distri-butions for sample parameters throughout, the poste-rior distributions for all variables are available in closedform. Thus, we can take advantage of the efficiency ofGibbs sampling (Gilks et al. 1996) to construct the mar-ginal posterior distributions for all parameters. We used

blockwise updating for variables of the same level, withthe exception of line effects, which had to be updatedone at a time. The full sampling schemes for both thenormal and the Student’s t model can be found in File S2.

To set initial values, we first calculated approximatepoint estimates of parameters. For example, blockmeans are means of the corresponding replicates, crossmeans are means of blocks, and so on. We then pickedstarting values from overdispersed distributions cen-tered on these approximate estimates. For example,starting values for block means were picked fromnormal distributions with means equal to the approxi-mate block estimates and variances 4 times the approx-imate estimate of s2

rep. We picked initial values forstandard errors from uniform distributions with lowerbounds 0.2 times the approximate estimate and upperbounds 5 times the estimate.

We implemented the sampling algorithms in R. Weprovide an example R script that implements theStudent’s t model, with detailed annotations, in FileS2. We ran three independent chains for each data set,with 500 iterations of burn-in followed by 1000 samplingiterations. We processed the chains using the codapackage in R (version 0.13-4) (Plummer et al. 2009). Ifwe suspected lack of convergence, we looked at time-series graphs of parameters. In a number of cases, lackof convergence was clearly due to insufficient burn-inbecause an initial value happened to be far from theeventual estimate. In these cases, we reran all threechains with new initial values.

For each variable, we compiled several statistics acrossthe 500 high-heritability and 500 low-heritability datasets. For each data set, we noted whether the true valuefell within the posterior 95% credibility interval, thefractional difference (i.e., the absolute difference di-vided by the true value) between the median of theposterior distribution and the true value, the time-seriesestimate of the coefficient of variation (CVR) (the stan-dard error divided by the true value; time-series SE cal-culated by the summary function in coda) of the posteriordistribution, and the Gelman–Rubin convergence diag-nostic (Gelman and Rubin 1992, as implemented incoda).

Restricted maximum-likelihood estimates ofparameters: As an alternative to our Bayesian hierarchi-cal models, we used restricted maximum likelihood(REML) to estimate parameters using a traditionalmixed-effects model. We employed the REML imple-mentation provided by the lmer() function in the lme4package in R (Bates and Maechler 2009). Gelman andHill (2007) provide several examples of the use of thisfunction for the analysis of hierarchical models. We usedpoint estimates of Vmax as the response variable andtreated the replicate, block, cross (SCA), and line (GCA)effects as random. Population and cross-type effects werefixed. We describe the details of our model, the R code,and the results in File S3.

Bayesian Quantitative-Genetic Model 365D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 6: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

Model and computation for real data: To analyze realkinetic data for G6PD and 6PGD, we implemented aversion of the Student’s t model with 3 d.f. describedabove. The model was programmed in R. We providethe annotated script, along with the raw data, in File S4.Three complications arise in the real data that we didnot incorporate into our simulations. First, a smallsubset (0.4%) of the data is missing. Second, becausethe number of lines in each population is unequal, thedesign is somewhat unbalanced. Third, we performedthe enzyme assays in 96-well plates and inspection ofslope estimates clearly indicated the presence of plateeffects.

In the analyses presented here, we did not implementmissing data imputation, treating the missing data asnot collected. Thus, the problem reduces to slightimbalance in the replication structure. The modeldetailed above does not assume equal sample sizes atany level of the hierarchy, and thus we did not have tomake any changes in our algorithms to accommodatethe missing data and the imbalance in the crossingdesign. We corrected for plate effects within the Gibbssampler, as described in File S3.

We ran the Gibbs sampler for both enzymes simulta-neously, updating variables at each level in blocks asdescribed above for simulated data, one enzyme at atime. We generated five chains, sampling 20,000 iter-ations per chain after a 2000-iteration burn-in. We thenthinned the output, retaining 2000 values per chain.

RESULTS

Cross design and analysis: Our choice of cross designand analysis methods is motivated by our empiricalresearch. We study evolution of metabolic functionwithin and between populations, using D. melanogasteras a model organism. Looking at within-populationvariation, we are interested in the potential for adapta-tion to new conditions (and hence in narrow-senseheritability), as well as the effects of deleterious mu-tations that are maintained by mutation–selectionbalance (and hence in the effects of inbreeding).Between-population variation should be informativeabout local adaptation, and thus we want to comparepopulation means and look for potential geneticincompatibility by examining between-populationcrosses. The regulatory network that binds enzymestogether can be probed by measuring genetic correla-tions among enzyme activities (Clark 1989). For this,we need estimates of line effects (general combiningabilities).

Given these considerations, we developed a modifiedround-robin crossing design (Figure 1A). It combinesround-robin (Wayne et al. 2004) crosses within pop-ulations, where each line participates in one cross asa male and in another cross as a female. To this, weadded the inbred lines themselves and crosses between

populations. Each line participates in two interpopula-tion crosses, once as a male and once as a female, eachwith a different population. This makes the between-population crosses balanced both at the line level (alllines participate in an equal number of crosses) and atthe population level (each population is crossed to allother populations an equal number of times). Like theregular round-robin crossing design, the number ofcrosses grows linearly with the number of lines evalu-ated, allowing adequate sampling to estimate popula-tion parameters. Since each line is crossed five times, wecan additionally separate specific and general combin-ing ability.

To estimate environmental effects, we perform struc-tured replications of our crosses (Figure 1B). Each crossis carried out on two batches of food (block level) andthree samples are taken from each batch (replicatelevel). This allows us to capture two kinds of randomenvironmental effects: variations in food quality frombatch to batch and fluctuations within vials where werear the flies. We then perform enzyme kinetic assays oneach replicate, obtaining estimates of enzyme maximalrates with some error (Clark 1989; Clark and Keith

1989).As is clear from Figure 1B, our data are structured

hierarchically. Each level is completely nested in thelevel above, with the exception of the cross level that isnot perfectly nested within lines (two different linesparticipate in within- and between-population crosses)or populations. To make use of this structure in the data,it is natural to employ hierarchical models (Gelman andHill 2007) to estimate the parameters of interest.Furthermore, a Bayesian approach is attractive becauseit allows for models of arbitrary complexity and permits

Figure 1.—Data structure. (A) The diallel table. Solid andshaded squares mark the crosses that we performed. The tableis subdivided by population. The order of the lines is from leftto right for the female axis and from top to bottom for themale axis; i.e., the top left corner of the table representsthe cross of the A1 male to the A1 female. Within each pop-ulation, the F1 round-robin crosses between lines (off-diagonal)are depicted as solid squares, whereas the inbred lines them-selves (diagonal) are shown as shaded squares. (B) Hierarchi-cal levels of the data. For the ‘‘crosses’’ level, the solid arrowrepresents a ‘‘selfed’’ inbred line, dashed arrows representwithin-population crosses, and dotted arrows show between-population crosses.

366 A. J. Greenberg et al.D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 7: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

a full account of uncertainty in all parameters(Sorensen and Gianola 2002; Gelman et al. 2004;Gelman and Hill 2007). However, the limitation of thisapproach is that the posterior distributions of variablesare not available, and thus one has to resort tonumerical methods to estimate them (MCMC in thiscase). While in theory MCMC methods should accu-rately estimate posterior distributions given an infinitechain (Gilks et al. 1996), in practice one has todemonstrate that they do so after a reasonable numberof iterations. Therefore, we generated simulated dataand used them to test the performance of our models.

Analysis of simulated data: We constructed simulateddata sets to answer three main questions. First, we want-ed to know if our combination of crossing design andmodeling approach can be used to estimate parametersof interest to us and to quantify the accuracy andprecision of such estimates. Second, we wished to knowwhether taking full account of uncertainty in enzymeactivity estimates was necessary or if we could save oncomputational time and use point estimates of slopevalues. Third, because high-throughput processing ofenzymatic assays yields a small but significant number ofoutliers, we were interested in gauging their effect oninference and testing strategies to minimize the influ-ence of outliers.

To answer these questions, we generated two sets ofsimulated crosses with environmental replication: onewith low narrow-sense heritability (h2 ¼ 0.07) and onewith high (h2 ¼ 0.56; see methods for details). We alsoadded outlier observations to each group of data sets.We chose values of fixed parameters (Table 1) in thesimulations to fall in the range we typically see with realdata. We generated 500 data sets with each type ofsimulation and analyzed them under three models. Thefirst used point estimates of kinetic curve slopes andassumed normal distributions of replicates. The secondalso assumed normal distributions for replicates, butmodeled the uncertainty in their estimation. For datasets with outliers, we also used a third model thatassumed Student’s t distributions for replicates. Becauset distributions have fatter tails than the normal distri-bution, we expect estimates of sample parameters fromthis model to be robust to outliers (Gelman et al. 2004).On the basis of a priori biological considerations, wemodeled environmental variances (s2

rep and s2bl) sepa-

rately for each type of cross, although they were thesame in the simulations. The results we show are forthe variances from the within-population crosses. Theresults for the variances from other cross types wereindistinguishable.

We first turn to estimates of population parameters.For technical reasons, it is easier to estimate standarddeviations rather than variances when programming inR, and this has an added advantage of bringing thevalues for these parameters to the same scale as samplevariables. Population parameters are notoriously diffi-

cult to estimate accurately (Lynch and Walsh 1998;Sorensen and Gianola 2002; O’Hara et al. 2008). Wequantified the quality of our estimates several ways. Todetermine the accuracy of our models, we plotted therange of differences between the estimated medians ofposterior distributions and the true values (Figure 2, toppanels). Precision of the estimates is reflected in theextent of the spread of their posterior distributions,quantified by standard errors of the sampling Markovchains (Figure 2, bottom panels). We scaled thesestatistics by true values to enable comparisons amongvariables. Another measure of accuracy is the fraction ofthe time the true value falls within the posterior 95%credibility interval (95% C.I.) (Table 2). Formally, un-der the Bayesian paradigm this statistic does not carrythe same straightforward interpretation as for frequent-ist confidence intervals (Box and Tiao 1973). However,from a practical standpoint this measure still reflects theaccuracy of parameter estimation, as well as the degreeto which the uncertainty of the estimate is captured by amodel. Finally, we assessed convergence properties ofthe various models using the Gelman–Rubin diagnostic(Gelman and Rubin 1992). Values of this statistic closeto 1.0 indicate that the particular set of chains hasconverged. We considered a particular run to havefailed to converge if the value of the diagnostic was .1.5and counted the number of such instances for everymodel–data set combination (listed as percentage offailure to converge in Table 2). We tried a range of cutoffvalues (not shown). Our conclusions are not sensitive tothe particular value chosen.

Overall, our model and cross combination yieldsestimates of population parameters that are close toreality (Figures 2 and 3). When outliers are absent, fulltreatment of uncertainty in slope estimates does notmake an appreciable difference, with the exception ofamong-replicate standard deviations (Figure 2). In thelatter case, however, the model yields overestimates that

TABLE 1

Parameters common to all simulations

Low h2 High h2

Population parametersA 8.0 8.0B 9.0 9.0C 8.5 8.5D 6.0 6.0

Cross type parametersb

inbr �2.0 �2.0b

outc1.0 1.0

Standard deviationssrep 1.5 0.5sbl 1.0 0.7s

d0.4 1.0

sa

0.5 1.5

Bayesian Quantitative-Genetic Model 367D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 8: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

are falsely precise. In contrast, in the presence ofoutliers the model with point estimates yields resultsthat are both incorrect and imprecise (Figure 2 and leftset of columns in Table 2). Standard deviations aregenerally overestimated (with the exception of sbl),leading to drastic underestimation of heritability whentrue h2 is high (model 3 in Figure 3A). Furthermore, themodel fails to converge more frequently than any otherdata set–model combination (Table 2).

The deficiencies of the model with point estimatescan be corrected to a large degree by modeling theuncertainty of slope estimates (model 4 in Figure 2 andTable 2). Most of the estimates (with the exception ofsbl) are further improved by using t models forreplicates (models 5 and 6 in Figure 2). However, inevery case our models overestimate the genetic stan-dard deviations(sSCA and s

gca) by�20%. This does not

result in misestimates of h2 when true heritability ishigh (Figure 3A), although for data sets with lowheritability, where these patterns are similar but exag-gerated, we tend to overestimate h2 by almost twofold(Figure 3B).

Modeling the slopes also improves the probability ofconvergence (right set of columns in Table 2), althoughproblems still persist for estimates of sbl when herita-bility is low. Convergence problems are essentiallyeliminated with Student’s t models (columns 5 and 6in the right set of Table 2).

Despite problems with estimating population param-eters, our models do well when evaluating samplevariables (Figure 4 for high-heritability data sets; theresults for low-heritability data sets and for populationmeans in both groups of data sets are indistinguish-able). For data sets with outliers, the model with pointestimates still produces unbiased estimates, but they aresignificantly less precise. However, although the credi-bility intervals are wider (model 3 in Figure 4), they areonly slightly more likely to include the true observation(Table 2). We can improve the precision of estimates bymodeling the slopes. Even more precise estimates canbe obtained with t models. As is the case with standarddeviations, the model with point estimates often fails toconverge when outliers are present. Modeling slopesalleviates the problem, and using t models results infurther improvements (Table 2).

As we detail in the methods section, posteriordistributions of sample parameters at a given level inthe hierarchy incorporate information from the higherlevels through prior distributions. Thus, if our modelsperform correctly, extreme observations at a given levelshould be pulled toward the values for correspondingvariables above them. For example, line effects of lineswith true means below the corresponding populationmeans should be overestimated. Conversely, line meansfor lines with values higher than their populationvalues should be underestimated. Furthermore, the

Figure 2.—Accuracy andprecision of standard devia-tion estimates. For each pa-rameter, we show a pair ofgraphs. The top one plotsthe fractional differencebetween medians of esti-mated posterior distribu-tions of a given parameterand true values (that is,the difference divided bythe true value). The bottomplot shows coefficients ofvariation of the estimatedposterior distributions(time-series SE divided bythe true value; see meth-

ods). The box plots repre-sent data across 500simulated data sets withhigh heritability. Each plotrepresents results under sixscenarios: 1 and 2 are forsimulations without out-liers and 3–6 are for thosewith outliers (see meth-

ods). 1 and 3 were analyzedusing point estimates ofslopes; 2 and 4 were ana-lyzed modeling the uncer-

tainty in slope values and assuming that replicates are normally distributed; and 5 and 6 were analyzed modeling the slopes,but with t3 (5) and t6 (6) distributions for replicates.

368 A. J. Greenberg et al.D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 9: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

more extreme the true value is for a line, the moresevere the discrepancy, with opposite signs for valuesthat are low or high. To see if this indeed occurs, weplotted differences between estimated and true valuesof line effects against differences between true lineeffects and corresponding population means (botttomright panel of Figure 4; both differences scaled by thecorresponding true population mean). As expected, wesee an appreciable negative correlation between the twovariables (Pearson’s r ¼ �0.18).

It is not our main purpose to compare maximum-likelihood and Bayesian methods of inference. Never-theless, we analyzed our simulated data sets using REMLestimates of a traditional mixed-effects model (see FileS3 and methods for details). As expected, we find that

for estimates of environmental standard deviations andfixed effects, this model behaves similarly to our normalmodel with point estimates of slopes. However, thegenetic effect variances (specific and general combin-ing ability) are frequently underestimated, especiallywhen heritability is low (see the figures in File S3),leading to underestimates of heritability. This problemis particularly severe when heritability is low and outliersare present. Furthermore, when heritability is low, theestimates of s

gca, and particularly sSCA, become un-

stable, with deviations from true values highly depen-dent on the particular data set.

Analysis of real kinetic data: Simulated data sets arevery useful when assessing model accuracy because thecorrect results are known a priori. However, simulations

TABLE 2

Credibility interval coverage and convergence rates

% in posterior 95% C.I. % convergence failure

Parameter 1 2 3 4 5 6 1 2 3 4 5 6

h2 ¼ 0.56srep 0.0 81.8 0.0 18.8 23.8 60.0 0.0 0.0 0.4 0.0 0.8 0.8sbl 96.0 94.8 97.0 95.0 6.4 12.0 0.0 0.0 36.6 6.2 0.0 0.4sSCA 86.6 87.4 33.0 81.8 90.4 88.0 0.2 0.0 13.0 3.2 0.0 0.4s

gca70.8 71.6 47.6 67.8 63.4 62.4 0.0 0.0 3.2 0.2 0.0 0.4

h2 96.2 85.6 11.2 78.6 93.8 89.8 0.0 0.0 2.8 0.2 0.0 0.2binbr 93.0 93.2 96.2 93.8 93.2 94.2 0.0 0.0 24.6 4.0 0.0 0.0boutc 96.4 97.0 97.2 96.6 95.4 94.8 0.4 0.0 24.2 5.8 0.0 0.2mline 88.6 88.7 89.2 88.5 89.6 89.1 0.0 0.0 14.2 2.0 0.0 0.2

h2 ¼ 0.07srep 72.0 95.6 2.0 28.6 0.2 28.6 0.4 0.4 1.0 0.0 0.0 0.8sbl 95.8 95.6 96.6 94.6 6.2 18.4 1.8 2.6 24.8 11.0 0.0 0.0sSCA 28.0 36.0 0.4 12.0 89.6 79.6 0.6 1.8 9.4 3.2 0.2 0.0s

gca0.8 1.0 0.4 0.4 0.0 0.0 0.6 0.0 4.2 1.2 0.2 0.0

h2 5.8 2.2 65.2 22.2 1.4 2.4 0.6 0.2 3.6 0.6 0.2 0.0binbr 96.4 96.2 98.0 96.6 95.6 95.2 0.6 1.0 12.4 3.4 0.0 0.0boutc 98.8 98.8 98.4 97.2 96.0 95.6 0.8 2.0 12.0 3.4 0.2 0.0mline 91.5 91.3 92.9 92.5 92.2 91.8 0.5 0.6 6.7 1.2 0.2 0.0

Models 1–6 are as in Figure 2. The set of columns on the left gives fractions of all simulations where the true value of a givenparameter lies in the posterior 95% credibility interval of its estimate. The right set of columns reflects how often the Gelman–Rubin diagnostic (see methods) exceeds 1.5. Other threshold values lead to the same conclusions. Data for line means of all lineswere pooled.

Figure 3.—Accuracy andprecision of narrow-senseheritability estimates. Theplots are arranged andlabeled as in Figure 2.Estimates are from high-heritability (A) and low-heritability (B) simulations.

Bayesian Quantitative-Genetic Model 369D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 10: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

are necessarily idealized, and although we did our bestto make them realistic (by, for example, includingoutliers), it is impossible to simulate every conceivablescenario. We therefore wanted to assess the perfor-mance of our model on a real data set. We analyzed datafor two well-characterized enzymes in the pentosephosphate pathway, G6PD and 6PGD of D. melanogaster.Assays for these enzymes are well established (Clark

and Keith 1989) and a number of studies havedocumented genetic variation in their activities (e.g.,Bijlsma 1980; Wilton et al. 1982; Clark 1989).

We used a version of our Student’s t model with 3 d.f.(model 5 discussed above), but with correction forassay–plate effects (see methods for details). Thechoice of model was dictated by the presence of outliersin data sets for both enzymes. Furthermore, the outliersinflated the estimates of srep to the extent that problemsarose with evaluation of block standard deviations:estimates of sbl came close to zero, and this creatednumerical errors and lack of mixing in the posteriordistributions of other parameters.

We present estimates of selected parameters in Table3. Because previous studies employed only inbred linesand different replication strategies to control forenvironmental effects, it is difficult to make directcomparisons between our results and preceding find-ings. However, we can say that the relative magnitudes ofsrep and sbl we see are broadly similar to those reportedby Clark (1989), although we find a larger geneticcomponent of variation in G6PD and 6PGD than in that

study. Furthermore, we find that the magnitudes ofenvironmental standard deviations vary among crosstypes (Table 3).

One way to assess the quality of our line effectestimates is to calculate the genetic correlation betweenthe activities of the two enzymes. A positive correlationhas been repeatedly documented before (Bijlsma 1980;Wilton et al. 1982; Clark 1989). Since we wereestimating the parameters for each enzyme simulta-neously in our Gibbs sampler, we were able to samplefrom the distribution of the correlation between lineeffects. We indeed see the expected positive geneticcorrelation (Table 3). Strikingly, our estimate is almostidentical to the one obtained by Clark (1989): his valueis 0.31 (0.18, 0.43), whereas ours is 0.30 (0.17, 0.42).

DISCUSSION

We set out to develop a combination of a sparse par-tial diallel crossing scheme and modeling approachto estimate a number of parameters of interest inevolutionary quantitative genetics. For evolutionary appli-cations, we are interested in learning about population-level processes, and therefore large samples of lines arerequired. The full diallel, which incorporates crossesamong all lines, even when excluding reciprocal crossesand inbred lines, is not ideal for such applicationsbecause the number of crosses grows quickly with thenumber of lines assessed (Kempthorne and Curnow

1961; Lynch and Walsh 1998). We settled on a modified

Figure 4.—Accuracy andprecision of estimates ofsample parameters. Thebox plots are arranged asin Figure 2. For line means(mline), we pooled informa-tion for all 60 lines across500 simulated data sets.The scatterplot shows therelationship of fractionaldifferences between trueline means (mline) and thecorresponding true popu-lation means (mpop) onthe x-axis and the frac-tional difference betweenthe posterior estimates ofline means (m̂line) and thecorresponding true linemeans (mline) on the y-axis.

370 A. J. Greenberg et al.D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 11: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

round-robin design that incorporates inbred lines andcrosses within and between populations. Each line isselfed and also crossed four times. Thus, it is possible topartition genetic variation into specific and generalcombining abilities (Sprague and Tatum 1942; Lynch

and Walsh 1998) and still have the number of crossesgrow linearly with the number of lines.

While in principle this crossing design should beuseful for our purposes, it is not straightforward toanalyze using traditional methods. An important stum-bling block is the inclusion of three different types ofcrosses, making it necessary to control for cross-typeeffects when estimating SCA. Other difficulties includemeasurement errors for phenotypes (in our case,enzyme activities), two levels of environmental replica-tion, and slight imbalances in the crossing schemecaused by unequal sampling of lines from populations.Furthermore, we are interested in evaluating distribu-tions of deterministic combinations of parameters (forexample, narrow-sense heritability and genetic correla-tions between enzyme activities).

Bayesian approaches can fulfill these requirements(Blasco 2001; Sorensen and Gianola 2002) and havebeen extensively used in animal breeding and quantita-tive trait mapping (Gianola and Fernando 1986;Sorensen and Gianola 2002; Beaumont and Rannala

2004). Taking advantage of the structure of our data, weimplemented a set of hierarchical Bayesian models(Gelman et al. 2004; Gelman and Hill 2007). We madeextensive use of recently developed computationaltechniques—parameter expansion (Gelman et al.2004, 2008) and hierarchical centering (Gelfand et al.1995; Gilks and Roberts 1996)—to speed and improveconvergence. Our emphasis is on estimating param-eters given a model, rather than significance testing.Although model comparison techniques (Raftery 1996)

and methods for integrating over various models(Phillips and Smith 1996) are available, in our casethe structure of the data and biological considerationsdrive the selection of the basic modeling framework. Weconsider three types of models that differ only in thestatistical approach to dealing with replicates andchoose among these approaches on the basis of theirperformance on simulated data.

We used four simulated data sets: two with lowheritability and with and without outliers among slopevalues and two with high heritability (also with andwithout outliers). We then assessed precision andaccuracy of estimates from our models, compared withtrue values. As expected (Gelman 2005), estimates ofsample-level parameters (e.g., line effects and cross-typecoefficients) are more accurate than evaluations ofpopulation-level variables (e.g., standard deviationsand heritability). In particular, genetic standard devia-tions tend to be overestimated by our models, althoughnarrow-sense heritability is still accurately estimatedwhen it is high. This is consistent with an analysis ofan approximate circulant cross of Scotts pine data(Waldmann and Ericsson 2006). This data set hadonly one level of replication, one type of cross, and onepopulation. Thus, overestimation of genetic effects isprobably not due to the complexity of our experimentalscheme. We find our combination of crossing designand model can effectively partition genetic componentsof variation and adequately estimate and control forcross-type effects. The latter is particularly important,because large inbreeding and outcrossing effects arepresent in the real data sets we seek to model (Table 3).

Overall, we find that when no outliers are presenttaking point estimates of slopes instead of modelingthem does not appreciably affect inference. However, inthe presence of outliers, modeling the slopes is essential.

TABLE 3

Analysis of G6PD and 6PGD kinetic data

G6PD 6PGD

Parametera Estimate 95% C.I. Estimate 95% C.I.

s1rep 0.69 (0.62, 0.75) 0.24 (0.21, 0.28)

s2rep 0.77 (0.70, 0.85) 0.19 (0.17, 0.21)

s3rep 0.28 (0.25, 0.32) 0.16 (0.14, 0.18)

s1bl 0.61 (0.52, 0.73) 0.29 (0.22, 0.59)

s2bl 0.68 (0.55, 0.83) 0.14 (0.10, 0.17)

s3bl 0.56 (0.49, 0.64) 0.18 (0.15, 0.21)

sSCA 0.24 (0.12, 0.33) 0.11 (0.08, 0.15)s

gca0.33 (0.28, 0.41) 0.12 (0.10, 0.15)

h2 0.20 (0.15, 0.27) 0.17 (0.09, 0.24)binbr 0.26 (0.09, 0.44) �0.17 (�0.24, �0.09)boutc �0.42 (�0.58, �0.26) �0.55 (�0.62, �0.47)rG6PD;6PGD

b 0.30 (0.17, 0.42)

a Environmental standard deviation indexes are as follows: 1, within-population crosses; 2, inbred lines; 3,between-population crosses. Narrow-sense heritability is calculated with s1

rep and s1bl.

b Genetic correlation (between-line effects).

Bayesian Quantitative-Genetic Model 371D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 12: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

Further improvements in sample parameter estimatescan be achieved when we use Student’s t models forreplicates. While analyzing empirical data, we furtherfound that t models are often necessary to preventMarkov chains from getting stuck on estimates of sbl

close to zero, a behavior that leads to numericalproblems (values too small for computers to handle)and failure of proper mixing (Markov chains notmoving adequately through the values that belong tothe posterior distribution of a given parameter).

As a comparison, we used REML to estimate theparameters of a mixed-effects model to analyze oursimulated data sets. We found that in general, thisapproach gives results that are similar to our normalmodel with point estimates of slopes. However, incontrast to our Bayesian estimates, the genetic variancesare underestimated by the mixed-effects model. This isparticularly pronounced when outliers are present andthe heritability is low. In this case, estimates of s

gcaand

particularly sSCA become fairly unstable, with thedeviations from true values highly dependent on thedata set (see File S3). This problem does not appear toarise in our Bayesian estimates.

An important feature of the Bayesian approach is theemphasis on parameter estimation over hypothesistesting. In the traditional non-Bayesian framework,one would, for example, test the significance of theadditive genetic effect. If the effect is significant, the linemeans are estimated separately, and if it is not, lineeffects are pooled to estimate, say, population parame-ters. In the hierarchical Bayesian framework, we use therelative information from the population and line levelsto partially pool line effects, with the degree of poolingdetermined by the data (Gelman and Hill 2007). Weillustrate this behavior in the bottom right plot of Figure4, where we see that estimates of means for lines withextreme values are shrunk to their respective popula-tion means.

Another example of this approach is our separatetreatment of environmental variances for each crosstype. Thus, instead of testing for the significance of thegene-by-environment interaction and seeking a biolog-ical interpretation if it is significant, we are able to ask abiologically driven question while building our model.We find differences among some of the cross types(Table 3). However, even when differences are not‘‘significant’’ (for example, posterior intervals for srep

of 6PGD in within- and between-population crossesoverlap substantially), we do not eliminate the extraparameters from the model. While this treatment doesnot capture all the possible kinds of gene-by-environmentinteractions, it provides an example of a biologicallydriven attempt to dissect them.

Because it is easy to construct posterior distributionsof any deterministic combination of parameters, suchtests can be driven by biological interest rather thanstatistical convenience. We provide examples of esti-

mates of narrow-sense heritability and genetic correla-tions between activities of two enzymes: G6PD and6PGD. We show that our approach recovers the well-known positive correlation between these two enzymes(Bijlsma 1980; Wilton et al. 1982; Clark 1989) that islikely due to the toxicity of an intermediate compound(Hughes and Lucchesi 1977, 1978). Because theprevious studies used only inbred lines to calculate thiscorrelation, concerns have been raised (Zera andHarshman 2001) that this type of observation is anartifact of making recessive deleterious alleles homozy-gous through inbreeding. Since we estimate our corre-lation values from general combining abilities on thebasis of heterozygous as well as inbred lines, we canexclude this caveat.

The combination of crossing scheme and modelingapproach we propose provides a powerful frameworkfor estimating parameters important in evolutionaryquantitative genetics. Although our primary motivationwas the assessment of natural variation in metabolicattributes including enzyme activities, it can be readilyextended to any situation when phenotypes are mea-sured with error and are laborious or expensive toassess.

We thank B. Logsdon and two anonymous reviewers for helpfulcomments on the manuscript and A. Coventry for help with compu-tational methods. This research was funded by a National Institutes ofHealth grant to A.G.C. and L.H.

LITERATURE CITED

Ayroles, J. F., M. A. Carbone, E. A. Stone, K. W. Jordan, R. F. Lyman

et al., 2009a Systems genetics of complex traits in Drosophila mel-anogaster. Nat. Genet. 41: 299–307.

Ayroles, J. F., K. A. Hughes, K. C. Rowe, M. M. Reedy, S. L. Rodriguez-Zas et al., 2009b A genomewide assessment of inbreedingdepression—gene number, function, and mode of action. Con-serv. Biol. 23: 920–930.

Bates, D., and M. Maechler, 2009 lme4: Linear Mixed-Effects ModelsUsing S4 Classes.

Beaumont, M. A., and B. Rannala, 2004 The Bayesian revolution ingenetics. Nat. Rev. Genet. 5: 251–261.

Begun, D. J., and C. F. Aquadro, 1993 African and North Americanpopulations of Drosophila melanogaster are very different at theDNA level. Nature 365: 548–550.

Begun, D. J., and C. F. Aquadro, 1995 Molecular variation at thevermilion locus in geographically diverse populations of Drosophilamelanogaster and D. simulans. Genetics 140: 1019–1032.

Bijlsma, R., 1980 Polymorphism at the G6pd and 6Pgd loci in Dro-sophila melanogaster. IV. Genetic factors modifying enzyme activity.Biochem. Genet. 18: 699–715.

Blasco, A., 2001 The Bayesian controversy in animal breeding.J. Anim. Sci. 79: 2023–2046.

Bochdanovits, Z., and G. Jong, 2003 Temperature dependent lar-val resource allocation shaping adult body size in Drosophila mel-anogaster. J. Evol. Biol. 16: 1159–1167.

Box, G. E. P., and G. C. Tiao, 1973 Bayesian Inference in StatisticalAnalysis. Wiley Classics, New York.

Charlesworth, B., and D. Charlesworth, 1999 The genetic basisof inbreeding depression. Genet. Res. 74: 329–340.

Clark, A. G., 1989 Causes and consequences of variation in energystorage in Drosophila melanogaster. Genetics 123: 131–144.

Clark, A. G., and L. E. Keith, 1989 Rapid enzyme kinetic assays ofindividual Drosophila and comparisons of field-caught D. mela-nogaster and D. simulans. Biochem. Genet. 27: 263–277.

372 A. J. Greenberg et al.D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 13: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

Gelfand, A. E., S. K. Sahu and B. P. Carlin, 1995 Efficient para-metrisations for normal linear mixed models. Biometrika 82:479–488.

Gelman, A., 2005 Analysis of variance: why it is more important thanever. Ann. Stat. 33: 1–31.

Gelman, A., 2006 Prior distributions for variance parameters inhierarchical models. Bayesian Anal. 1: 514–534.

Gelman, A., and J. Hill, 2007 Data Analysis Using Regressionand Multilevel/Hierarchical Models. Cambridge University Press,Cambridge, UK.

Gelman, A., and D. B. Rubin, 1992 Inference from iterative simula-tion using multiple sequences. Stat. Sci. 7: 457–472.

Gelman, A., J. B. Carlin, H. S. Stern and D. B. Rubin,2004 Bayesian Data Analysis, Ed. 2. CRC Press, London.

Gelman, A., D. A. van Dyk, Z. Huang and W. J. Boscardin,2008 Using redundant parameterizations to fit hierarchicalmodels. J. Comput. Graph. Stat. 17: 95–122.

Gianola, D., and R. L. Fernando, 1986 Bayesian methods in ani-mal breeding theory. J. Anim. Sci. 63: 217–244.

Gilks, W. R., and G. O. Roberts, 1996 Strategies for improvingMCMC, pp. 89–114 in Markov Chain Monte Carlo in Practice,edited by W. R. Gilks, S. Richardson and D. J. Spiegelhalter.Chapman & Hall, London.

Gilks,W.R.,S.RichardsonandD.J.Spiegelhalter,1996 IntroducingMarkov chain Monte Carlo, pp. 1–20 in Markov Chain MonteCarlo in Practice, edited by W. R. Gilks, S. Richardson and D. J.Spiegelhalter. Chapman & Hall, London.

Hughes, M. B., and J. C. Lucchesi, 1977 Genetic rescue of a lethal‘‘null’’ activity allele of 6-phosphogluconate dehydrogenase inDrosophila melanogaster. Science 196: 1114–1115.

Hughes, M. B., and J. C. Lucchesi, 1978 Dietary rescue of a lethal‘‘null’’ activity allele of 6-phosphogluconate dehydrogenase inDrosophila melanogaster. Biochem. Genet. 16: 469–475.

Kempthorne, O., and R. N. Curnow, 1961 The partial diallel cross.Biometrics 17: 229–250.

Kristensen, T. N., P. Sørensen, K. S. Pedersen, M. Kruhøffer andV. Loeschcke, 2006 Inbreeding by environmental interactionsaffect gene expression in Drosophila melanogaster. Genetics 173:1329–1336.

Lynch, M., and B. Walsh, 1998 Genetics and Analysis of QuantitativeTraits. Sinauer Associates, Sunderland, MA.

O’Hara, R. B., J. M. Cano, O. Ovaskainen, C. Teplitsky and J. S.Alho, 2008 Bayesian approaches in evolutionary quantitativegenetics. J. Evol. Biol. 21: 949–957.

Phillips, D. B., and A. F. M. Smith, 1996 Bayesian model compar-ison via jump diffusions, pp. 215–240 in Markov Chain MonteCarlo in Practice, edited by W. R. Gilks, S. Richardson andD. J. Spiegelhalter. Chapman & Hall, London/New York.

Plummer, M., N. Best, K. Cowles and K. Vines, 2009 coda: OutputAnalysis and Diagnostics for MCMC.

R Development Core Team, 2008 R: A Language and Environmentfor Statistical Computing. R Foundation for Statistical Computing,Vienna.

Raftery, A. E., 1996 Hypothesis testing and model selection, pp.163–188 in Markov Chain Monte Carlo in Practice, edited by W.R. Gilks, S. Richardson and D. J. Spiegelhalter. Chapman& Hall, London/New York.

Sorensen, D., and D. Gianola, 2002 Likelihood, Bayesian and MCMCMethods in Quantitative Genetics. Springer, New York.

Sprague, G. F., and L. A. Tatum, 1942 General vs. specific combiningability in single crosses of corn. J. Am. Soc. Agron. 34: 923–932.

Thompson, R., S. Brotherstone and I. M. White, 2005 Estimationof quantitative genetic parameters. Philos. Trans. R. Soc. B 360:1469–1477.

Waldmann, P., and T. Ericsson, 2006 Comparison of REML andGibbs sampling estimates of multi-trait genetic parameters inScots pine. Theor. Appl. Genet. 112: 1441–1451.

Walsh, B., 2001 Quantitative genetics in the age of genomics. Theor.Popul. Biol. 59: 175–184.

Wayne, M. L., Y.-J. Pan, S. V. Nuzhdin and L. M. McIntyre,2004 Additivity and trans-acting effects on gene expression inmale Drosophila simulans. Genetics 168: 1413–1420.

Wilton, A. N., C. C. Laurie-Ahlberg, T. H. Emigh and J. W.Curtsinger, 1982 Naturally occurring enzyme activity varia-tion in Drosophila melanogaster. II. Relationships among enzymes.Genetics 102: 207–221.

Zera, A. J., and L. G. Harshman, 2001 The physiology of life historytrade-offs in animals. Annu. Rev. Ecol. Syst. 32: 95–126.

Communicating editor: K. W. Broman

Bayesian Quantitative-Genetic Model 373D

ownloaded from

https://academic.oup.com

/genetics/article/185/1/361/6063783 by guest on 23 Decem

ber 2021

Page 14: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

Supporting Information http://www.genetics.org/cgi/content/full/genetics.110.115055/DC1

A Hierarchical Bayesian Model for a Novel Sparse Partial Diallel Crossing Design

Anthony J. Greenberg, Sean R. Hackett, Lawrence G. Harshman and Andrew G. Clark

Copyright © 2010 by the Genetics Society of America DOI: 10.1534/genetics.110.115055

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 15: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 2 SI

FILE S1 Generating Simulated Data

File S1 is available for download as a compressed file (.zip) athttp://www.genetics.org/cgi/content/full/genetics.110.115055/DC1. This file includes:

simulation.Rnw simulation.pdf

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 16: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 3 SI

FILE S2 Student-t Models for the Analysis of Simulated Data

File S2 is available for download as a compressed file (.zip) at http://www.genetics.org/cgi/content/full/genetics.110.115055/DC1. This file includes:

sim_analysis.Rnw sim_analysis.pdf models.pdf

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 17: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 4 SI

REML Analysis of Simulated Data

A. J. Greenberg, S. R. Hackett, L. G. Harshman, and A. G. Clark

As a comparison with our Bayesian approach, we ran a simple traditional mixed-effectsmodel. We define the model as follows:

Vmax(plkji) = µ♂popp + µ♀pop

p + βinbrxinbrk + βoutcxoutc

k + b♂lnpl + b

♀lnpl + bsca

plk + bblplkj + εplkji,

where the random effects are

εplkji ∼ N(0, σ2rep)

bblplkj ∼ N(0, σ2

bl)

bscaplk ∼ N(0, σ2

sca)

b♂lnpl ∼ N(0, σ2

♂ln)

b♀lnpl ∼ N(0, σ2♀ln)

σ2a ≈ 2σ2

gca = σ2♀ln + σ2♂ln

The effects of the male parent’s (µ♂popp ) and female parent’s (µ

♀popp ) population are treated

as fixed, as well as the effects of inbreeding (βinbr) and outcrossing (βoutc). The notationfollows that for the rest of the paper. The effects of the male and female parent are assumedindependent. Therefore, twice the general combining ability variance is the sum of variancesamong male and female parental lines (σ2

♂lnand σ2♀ln), and is equal to additive genetic

variance (σ2a) if epistasis is ignored. Note also that the integers p and l index population and

line pairs respectively.

To estimate the parameters of the model, we used the lmer() function from the lme4 packagein R (see main text for references). We load a simulated data set (generated with code fromSupplement 1) and run the function, redirecting the summary to a variable:

source("sim_data.R")res <- summary(

lmer(v.max.est~m.pop+f.pop+inbr+outc+(1|m.ln)+(1|f.ln)+(1|ln.pair)+(1|ln.pair.blck)-1, data = enz.reg.mle)

)

1

FILE S3

REML Analysis of Simulated Data

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 18: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 5 SI

Supplement 3 Greenberg et al. Page S2

Subtracting 1 in the right-hand side of the formula definition results in a model with no

intercept, allowing separate estimates of each µ♂popp . The res variable is an S4 object. Print-

ing it displays all the necessary information. Estimated random effect standard deviationsand variances are in the res@REmat slot, while the fixed effects are in the res@coefs slotand can be accessed for further processing as needed.

We ran this model on the same simulated data sets we used for the Bayesian analysis wedescribe in the main text. We plot the results in Figures S1 – S3. See the main text fordiscussion.

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 19: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 6 SI

Supplement 3 Greenberg et al. Page S3fr

ac d

evia

tion

frac

dev

iatio

n

frac

dev

iatio

n

frac

dev

iatio

n

−1.0

−0.5

0.0

0.5

1.0

4.0

3.0

0.0

2.0

1.0

5.0

6.0

7.0

0.5

–1.0

0.0

2.0

1.0

1.5

−0.5

−1.0

−0.5

0.0

0.5

1.0

1 21 2 1 21 2

1 21 2 1 21 2high lowhigh low

high lowhigh low

Supplemental Figure S1.— Accuracy of standard deviation estimates. The lay out issimilar to the Figure 2 of the main text. For each parameter, we present analyses of thedata with (1) and without (2) outliers, for high- and low-heritability data sets.

FIGURE S1.—Accuracy of standard deviation estimates. The law out is similar to the Figure 2 of the main text. For each parameter, we present analyses of the data with (1) and without (2) outliers, for high- and low-heritability data sets.

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 20: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 7 SI

Supplement 3 Greenberg et al. Page S4fr

ac d

evia

tion

−1.0

−0.5

0.0

0.5

1.0

1 21 2high low

Supplemental Figure S2.— Accuracy of heritability estimates. The lay out is the sameas in Figure S1.

FIGURE S2.—Accuracy of heritability estimates. The lay out is the same as in Figure S1.

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 21: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 8 SI

Supplement 3 Greenberg et al. Page S5cv

r x10

frac

dev

iatio

n

−0.4

−0.2

0.0

0.2

0.4

0.0

0.5

1.0

1.5

2.0

frac

dev

iatio

n 0.5

0.0

−0.5

4.0

3.0

2.0

1.0

0.0

cvr x

10

1 21 2 1 2 1 2

−0.4

−0.2

0.0

0.2

0.4

cvr x

10

0.00.20.4

0.81.0

frac

dev

iatio

n

1 21 2

high

high

highlow low

low

0.6

1.2

Supplemental Figure S3.— Accuracy and precision of fixed effect estimates.The lay outis the same as in Figure S1, except we also include the plots of coefficients of variation(cvr×10) as measures of precision of estimates.

FIGURE S3.—Accuracy and precision of fixed effect estimates. The lay out is the same as in Figure S1, except we also include the plots of coefficients of variation (cvr x 10) as measures of precision of estimates.

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021

Page 22: A Hierarchical Bayesian Model for a Novel Sparse Partial - Genetics

A. J Greenberg et al. 9 SI

FILE S4 Student-t models for the Analysis of Real Data

File S4 is available for download as a compressed file (.zip) at http://www.genetics.org/cgi/content/full/genetics.110.115055/DC1. This file includes:

enz_analysis.Rnw enz_analysis.pdf enz_data.tsv

Dow

nloaded from https://academ

ic.oup.com/genetics/article/185/1/361/6063783 by guest on 23 D

ecember 2021