Dedicated to my family in Shiyan, Hubei, China, especially to my

68

Transcript of Dedicated to my family in Shiyan, Hubei, China, especially to my

Page 1: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 2: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 3: Dedicated to my family in Shiyan, Hubei, China, especially to my

Dedicated to my family in Shiyan, Hubei, China,.

especially to my grandparents for their indispensable guidance all these years.

Page 4: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 5: Dedicated to my family in Shiyan, Hubei, China, especially to my

List of papers

This thesis is based on the following papers, which are referred to in the text by theirRoman numerals.

I Rönnegård, L., Shen, X. and Alam, M. (2010). hglm: a package for fittinghierarchical generalized linear models. The R Journal. 2(2):20-28.

II Shen, X., Rönnegård, L. and Carlborg, Ö. (2011). How to deal with genotypeuncertainty in variance component quantitative trait loci analyses. Genetics Re-search, Cambridge. 93(5):333-342.

III Shen, X., Rönnegård, L. and Carlborg, Ö. (2011). Hierarchical likelihood opensa new way of estimating genetic values using genome-wide dense marker maps.BMC Proceedings. 5(Suppl 3):S14.

IV Nelson, R., Shen, X. and Carlborg, Ö. (2011). qtl.outbred: interfacing out-bred line cross data with the R/qtl mapping software. BMC Research Notes.4:154.

V Shen, X., Pettersson, M., Rönnegård, L. and Carlborg, Ö. (2012). Inheritancebeyond plain heritability: variance controlling genes in Arabidopsis thaliana.Submitted.

VI Shen, X., Alam, M., Fikse, F. and Rönnegård, L. (2012). Fast generalized ridgeregression for models including heteroscedastic effects in quantitative genetics.Manuscript.

Reprints were made with permission from the publishers.

Page 6: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 7: Dedicated to my family in Shiyan, Hubei, China, especially to my

Contents

Part I: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1 Background & Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Discovering Genetic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1 Single-predictor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 QTL analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 Genome-wide association study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Multiple-predictor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.1 Polygenic effects estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.2 Interaction effects and variance heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Part II: Summary of Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Implementing Hierarchical Generalized Linear Models (Paper I) . . . . . . . . . . . . . . . . . . . . 32

5 Quantitative Trait Loci Interval Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1 Variance component QTL model (Paper II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 QTL regression model (Paper IV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Fitting The Entire Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.1 Double HGLM (Paper III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2 Heteroscedastic effects model (Paper VI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Beyond Plain Heritability: Variance-Controlling Genes (Paper V) . . . . . . . . . . . . . . . . . . 48

Part III: Discussion & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.1 Heritability: How much can we explain? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.2 New data types: How to integrate information? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.3 Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Page 8: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 9: Dedicated to my family in Shiyan, Hubei, China, especially to my

Nomenclature

AMD age-related macular degenerationANOVA analysis of varianceBLUP best linear unbiased predictorBMI body mass indexcM centi-MorganCRP C-reactive proteinDHGLM double HGLMDNA deoxyribonucleic acidEBV estimated breeding valueEM expectation-maximizationFPR false positive rateGBLUP genomic BLUPGC genomic controlGEBV genomic EBVGLM generalized linear modelGLMM generalized linear mixed modelGM genetically modifiedGO gene ontologyGRAMMAR genome-wide rapid association using mixed model and regressionGWAS genome-wide association studyh-likelihood hierarchical likelihoodHDL high-density lipoproteinHEM heteroscedastic effects modelHGLM hierarchical generalized linear modelHMM hidden Markov modelIBD identity-by-descentIBS identity-by-stateIWLS iterative weighted least squaresLAF low-variance allele frequencyLASSO least absolute shrinkage and selection operatorLD linkage disequilibriumLM (normal) linear model (regression)LMM (normal) linear mixed modelLOD logarithm of oddsLRT likelihood ratio testMCMC Markov chain Monte Carlo

9

Page 10: Dedicated to my family in Shiyan, Hubei, China, especially to my

ML maximum likelihoodMME mixed model equationsPCA principle component analysisPQL penalized quasi-likelihoodQTL quantitative trait locus/lociREML restricted maximum likelihoodRR ridge regressionSNP single nucleotide polymorphismSSR simple sequence repeatTBV true breeding valueVC variance componentvGWAS variance-heterogeneity GWASvQTL variance-controlling QTL

10

Page 11: Dedicated to my family in Shiyan, Hubei, China, especially to my

PART I:INTRODUCTION

Page 12: Dedicated to my family in Shiyan, Hubei, China, especially to my

1. Background & Aims

“We have now entered a new era of large-scale genetics unthinkable even a fewyears ago.”

—— Peter Donnelly

IT was by investigating genetics that modern statistics was founded. GALTON (1886)regressed the mid-parents’ heights on their children’s, which opened the gate to a

science that filters out knowledge from chaos, i.e. statistics. FISHER (1918) studiedcorrelation in Mendelian inheritance, which brought analysis of variance (ANOVA)into the field of probability theory and statistics. Genetics and statistics seem to bedestined to meet each other due to different kinds of uncertainty in heredity that westill do not understand.

Another classic example that genetics drove statistics to develop is the mixed model,a.k.a. random effects model or variance component model, which is the basis of moststudies described in this thesis. Millions of poultry and livestock are evaluated ev-ery year via what are called Henderson’s mixed model equations (HENDERSON 1953,1984; VAN VLECK 1998). Although criticized by some statisticians at the beginningthat one should not calculate point estimates for random effects, the mixed modelequations have been proved to give so-called best linear unbiased prediction, or BLUP(ROBINSON 1991). It is actually a general method for estimating random effects anddealing with correlated observations, which has great prediction power. Nowadays,not only in animal breeding and genetics, BLUP has also been widely applied in tech-nology and social sciences.

The goal of statistical genetics is to analyze genetic data and give explanations ofhow the variation in the observed traits are affected by genetic variation. Since oneof the essential attributes of statistical analysis in genetics is the relatedness amongindividuals, the key to achieve such a goal is to trace inheritance among individu-als. Before molecular markers were applied, inheritance could be traced only throughkinship constructed from a pedigree structure. Such a kinship can be used to modelinheritance in e.g. farm animals and human families. After molecular markers becameavailable, statistical tools for genetic analyses started to have more diversity. Beingable to trace inheritance of a specific segment of DNA, statistical tests can help identifygenes or quantitative trait loci (QTL) that regulate quantitative traits. Analyzing mostof the traits are difficult because of the complexity of genetic architecture underly-ing the traits. Due to the fast development of genotyping and sequencing techniques,statistical analyses for different genetics problems now start using the same kind of

12

Page 13: Dedicated to my family in Shiyan, Hubei, China, especially to my

high-dimensional genomic data. Both QTL interval mapping and genome-wide asso-ciation studies (GWAS) utilize the genetic information carried by high-density SNPsto identify major genetic loci that affect complex traits. Also using the same high-density SNPs, breeders reconstruct the kinship of the studied individuals directly bycomparing their genomes and give better evaluations of the animals, i.e. GBLUP.It is now an exciting moment in quantitative genetics that a consistent type of high-dimensional data can be used to explain so many problems regarding inheritance thatwe are interested in.

The aim of this thesis is to develop new statistical tools in both QTL analysis andgenomic evaluation. Most of the proposed statistical methods focus on modeling ge-netic variance, especially using random effects models. Properly modeling geneticvariance could help us better understand the genetic contribution to various types ofcomplex traits.

The thesis contains three parts: the introduction, the summary of papers, and thediscussion. Part I briefly introduces the background knowledge related to my research.Chapter 2 introduces the basics of several statistical tools for gene mapping, includingQTL analysis, GWAS, whole-genome shrinkage estimation methods, and interactionswith the connection to the idea of variance-controlling genes. Chapter 3 introducesthe use of genomic dense markers in prediction and discusses model selection. PartII summarizes the papers that this thesis is based on. Chapter 4 describes the imple-mentation of the hglm package in R (Paper I), which is a fundamental tool for fittingrandom effects models in most of my other papers. Chapter 5 covers both a theoreticalinvestigation (Paper II) and an application tool (Paper IV) in QTL interval mapping,where the former is about variance component models, the latter linear regression.Chapter 6 extends the analysis strategy from repeated single marker testing to fittingmultiple markers simultaneously. A double-layer random effects model allowing het-eroscedastic marker effects was found to be powerful in both QTL identification andgenomic evaluation (Paper III). By simplifying such a double-layer model, the classicHenderson’s mixed model equations were successfully utilized to re-weight mark-ers according to their uneven contribution to the variation of the studied trait (PaperVI). Chapter 7, at the end, brings up a different way of looking at genetic variancethrough variance heterogeneity (Paper V), which recently has become a compellingtopic in quantitative genetics research. Part III discusses some topics related to thisthesis. Chapter 8 discusses two major challenges in current statistical genetics re-search: missing heritability and new data types, and foresees what should be done inthe near future. Chapter 9 concludes the thesis by summarizing contributions of thework.

13

Page 14: Dedicated to my family in Shiyan, Hubei, China, especially to my

2. Discovering Genetic Architecture

“Simple vs. Complex traits: the real definition - Simple: things we have deludedourselves into thinking we understand. Complex: things we’re pretty sure we don’tunderstand.”

—— Eleanor Feingold

FROM Mendelian factors to appearance, from genotypes to phenotypes, from whatis concealed to what is revealed, understanding genetic architectures underlying

our appearance, diseases, fitness, and so on, has always been an ultimate goal of genet-ics research. Once a complete genetic function or pathway is understood, one can tellthe story about a phenotypic phenomenon, so that a valid prediction or even geneticmodification can be made.

Some “simple” traits, for instance some appearance ones, can be explained quitewell through the variation of a single gene. One of our most essential traits, sex,is determined by the gene SRY (sex-determining region Y) on the Y chromosome(WALLIS et al. 2008). This gene, existing in all placental mammals and marsupials,encodes a transcription factor (therian testis determining factor, TDF) that initiates themale sex determination. In other words, this single gene SRY genetically modifieda female into a male. Besides sex, for example, rolling tongue is a dominant traitdetermined by a single gene, and wooly hair is a recessive trait also under regulationof a single gene. These traits, a.k.a. Mendelian traits, can be explained simply by agene that follows Mendel’s law of segregation.

However, most of the traits we see are rather complicated. They are, usually par-tially, affected by several genes. The genes influence but rarely determine the observedtraits, therefore the complex traits are not 100% hereditary. For instance, human heighthas recently been shown to be rather polygenic, affected by plenty of genes along theentire genome that have very small effects (YANG et al. 2011a). Furthermore, there isno doubt that a trait like human height is modified by nutrition and other environmen-tal factors. Hence, most of the time, when a complex trait is studied, one can expect todetect several genes or loci in the genome that have relatively strong effects, and therecan be many other genes with small effects that are undetectable due to the polygenicnature of the studied trait and also the limited detection power of the statistical tests.

Genes have very different magnitudes of effects, and the underlying genetic path-ways and networks are fairly complex. Unfortunately, our capacity in getting sufficientsample size is too limited to model or reveal the complete pathways underlying a par-ticular complex trait. What we often do is to focus on the additive marginal effect ofeach locus, trying to extract as much genetic variance as possible from the genome.

14

Page 15: Dedicated to my family in Shiyan, Hubei, China, especially to my

There are two main drawbacks in the current routine that are difficult to overcome.First, only the loci that show additive effects can be discovered. Additive effects con-tribute the most in phenotype prediction, however, non-additive effects such as effectsfrom gene-gene interactions, a.k.a. epistasis, are missing. The second drawback couldbe more worrying, i.e. only the loci with sufficiently big effects can be detected. Asmentioned above, there are complex traits that are polygenic, which indicates thatmany loci with small effects contribute to the trait. Even though the sample size islarge, statistical power only allows mapping loci with strong effects. The detectivepower restricts us to a certain level of effect size.

In this chapter, I classify the popular statistical genetics methods for gene map-ping into two categories: single- and multiple-predictor analyses. The single-predictoranalysis covers QTL interval mapping in experimental designs and genome-wide as-sociation studies. The section of multiple-predictor analysis starts by multiple additiveeffects modeling, which fits the entire genome simultaneously. Thereafter, gene-geneand gene-environment interactions are briefly introduced, together with their connec-tion to the idea of detecting genes that show variance heterogeneity.

2.1 Single-predictor analysisThis section briefly covers two types of single predictor analysis strategies - QTL anal-ysis and genome-wide association study (GWAS). I name them as “single-predictor”because the predictor can be a genetic marker (GWAS) or a single position betweenflanking markers (QTL analysis). This type of method focuses on one position onthe genome at a time, fits a parametric model, and performs a hypothesis test on thegenetic parameter. Both QTL analysis and GWAS require a genome-wide screeningfor significant loci. Single-predictor analysis is the simplest and the most direct wayto detect potential causal genes for quantitative traits.

2.1.1 QTL analysisA QTL (quantitative trait locus) is a chromosomal segment containing one or moregenes that contributes to the variation observed for a quantitative trait. The genetic ef-fects of a QTL is the combined effects of the genes located in the segment. GARDNERand LATTA (2007) have summarized that the average confidence interval around QTLis about 15.6 cM based on more than 200 mapped QTL. Generally speaking, QTLanalysis is a statistical method that links two types of information - phenotypic data(trait measurements) and genotypic data (DNA variants) - in an attempt to explainthe genetic basis of variation in complex traits. QTL analysis allows researchers infields as diverse as agriculture, evolution, and medicine to link complex phenotypesto specific regions of chromosomes. The goal of a QTL analysis is to identify the ac-tion, interaction, number, and precise location of the regions (QTL) affecting a certaincomplex trait of interest.

Two things are required in order to conduct a QTL analysis in an experimentalpopulation: 1. Two or more strains of organisms that differ genetically with regardto the trait of interest; 2. Genetic markers that distinguish between these parentallines. By typing genetic markers as tags along the genome, DNA information of a

15

Page 16: Dedicated to my family in Shiyan, Hubei, China, especially to my

population can be obtained. The QTL interval mapping strategy was developed whenthe genetic markers were not as dense as the SNP array nowadays across the genome.When people started using micro-satellite markers (SSRs), every marker, regardlessthe information content, could be very precious. Techniques such as interval mappingwere invented for mining as much information as possible from the limited number ofsparse markers, trying to detect QTL harbored within flanking markers.

An example of a powerful experimental design is the F2 intercross. To carry outthe QTL analysis, the parental strains are intercrossed, resulting in heterozygous (F1)individuals. These individuals are then mated to produce F2 individuals. The pheno-types and genotypes of the derived (F2) population are scored. Such an experimentaldesign makes it possible to do QTL interval mapping, which is a linkage analysistechnique that infers genotypes between flanking markers in order to identify QTLnot lying on the marker positions. The possibility of doing such inference comes fromrecombination/crossing-over between homologous chromosomes. During a meiosis,chromosome segments are shuffled, so that the pieces, including the genetic mark-ers therein, that are genetically linked to a QTL influencing the trait of interest willsegregate more frequently with trait values, whereas unlinked markers will not showsignificant association with the phenotype (Figure 2.1).

Statistical modeling is of central importance for identifying QTL. It is an openquestion how the phenotypic and genotypic data should be associated. The simplestlinear model (2.1) considers the phenotypic value yik of the i:th individual with markergenotype i as a mean value µ plus a marker effect bk and a residual error eik that isusually assumed to be normally distributed.

yik = µ +bk + eik (2.1)

This is a one-way ANOVA model, with the presence of a QTL being indicated by asignificant between-genotype variance. Instead of using only marker genotypes, forinterval mapping with genotype uncertainty, the maximum likelihood (ML) methoduses the full information from the marker-trait distribution, so it is expected to bemore powerful. Assuming that the distribution of the phenotype for an individualwith QTL genotype Qk is normal with mean µk and variance σ2, the likelihood forindividual i with phenotypic value yik given marker genotype M j is

L (yik|M j) =N

∑k=1

ϕ(yik,µk,σ2)P(Qk|M j) (2.2)

where ϕ(yik,µk,σ2) denotes the density function for a normal distribution with mean

µk and variance σ2, and a total number of N QTL genotypes is assumed (see alsoLYNCH and WALSH 1998). In the likelihood framework, testing whether a QTL isassociated with the trait of interest is based on the LRT statistic,

λ =−2log(

maxL (y)maxL0(y)

)(2.3)

where L0(y) is the likelihood under the null hypothesis assuming no QTL. By plot-ting the likelihood-ratio statistic (or a closely related quantity, e.g. the LOD score) asa function of map position of the putative QTL, the likelihood profile displays graph-ically the amount of support for a QTL at a particular map position.

16

Page 17: Dedicated to my family in Shiyan, Hubei, China, especially to my

Figure 2.1. A schematic diagram for a QTL mapping process in an F2 intercrossdesign. a, a pair of homologous chromosomes in the F1 population recombine toproduce gametic chromosomes that exist in the F2 population. b, a linkage analysisresults in a testing statistic (e.g. likelihood ratio) profile along the chromosome, whereafter a permutation test, the peak above the significance threshold is considered as asignificant QTL.

ML estimators were originally computationally demanding and require specializedalgorithms, such as the EM algorithm (LANDER and BOTSTEIN 1989; DEMPSTERet al. 1977). Fortunately, a creative use of regression was shown to often provide anexcellent approximation to the ML solution. HALEY and KNOTT (1992) proposed asimple regression directly on the genotype probabilities that approximates the likeli-hood profile very well for ML interval mapping. Using the parameterization of FAL-CONER and MACKAY (1996), assuming two alleles Q and q at the QTL, the genotypicmeans are

µQQ = µ +a, µQq = µ +d, µqq = µ−a

where a and d are the additive and dominance effects, respectively. The regression isconducted as

yik = µ +a · xa(Qk)+d · xd(Qk)+ eik (2.4)

Taking the expectation over all individuals with QTL genotype Qk gives

µQk = µ +a · xa(Qk)+d · xd(Qk) (2.5)

17

Page 18: Dedicated to my family in Shiyan, Hubei, China, especially to my

and given marker genotype M j, we also have

µQk = (µ +a)P(QQ|M j)+(µ +d)P(Qq|M j)+(µ−a)P(qq|M j) (2.6)= µ +a · [P(QQ|M j)−P(qq|M j)]+d ·P(Qq|M j) (2.7)

so that comparing (2.5) with (2.7) gives

xa(Qk) = P(QQ|M j)−P(qq|M j) (2.8)xd(Qk) = P(Qq|M j) (2.9)

Hence, as long as the genotype probabilities are inferred using the recombination fre-quencies and flanking marker information, a QTL scan can be done by directly re-gressing the phenotypic records on P(QQ|M j)−P(qq|M j) and P(Qq|M j) for eachmap position. When it comes to an outbred line cross, calculating genotype probabil-ities can be more complicated than in an inbred line cross. However, in any case, thesame QTL regression model can be applied. NELSON et al. (2011) used this fact toconnect two softwares - one fast program for calculating genotype probabilities in out-bred line crosses (NETTELBLAD et al. 2009) and the other popular tool for mappingQTL in inbred line crosses (BROMAN 2003).

It should be noticed that “ghost” QTL can show up when two or more linked QTLare located on the same chromosome, because the chromosome segment between twolinked QTL can carry information from both sides. ZENG (1994) showed that whenmapping QTL between markers j and j + 1, including markers j− 1 and j + 2 as“guards” could properly account for the QTL information outside the current interval,which is known as composite interval mapping.

Another popular way of modeling QTL effects is to model the effects of QTL alle-les as random instead of fixed effects. The fixed effects models assume a pre-definednumber of QTL alleles and try to estimate their individual allele substitution effects.While the random effects models assume that the founder alleles of the population aredrawn from a distribution, and the properties of this distribution are inferred in theanalysis. To treat the allele substitution effects as random, one can apply a variancecomponent model at each chromosomal position, which has the form

y = Xβ+Zu+ e (2.10)

where y is the trait response vector, β is the fixed effect vector, u is the randomQTL effect that has a zero mean and variance Var(u) = 1

2 σ2g Iq and e is the error

term with a zero mean and variance Var(e) = σ2e IN . u includes the founder allele

substitution effects, and the incidence matrix Z relates individuals and their inheritedallele substitution effects. Instead of Z, genetic information at a certain chromosomalposition is often obtained by calculating the IBD matrix Π such that (RÖNNEGÅRDand CARLBORG 2007)

Π =12

ZZ′ (2.11)

The variance-covariance matrix of y is then

V = Πσ2g + Iσ

2e (2.12)

where σ2g is the genotypic variance and σ2

e is the residual variance. Since the QTLeffects are modeled as random, the IBD matrix Π is actually a correlation matrix for

18

Page 19: Dedicated to my family in Shiyan, Hubei, China, especially to my

the correlated random effects. Each correlation element is inferred not only from thepedigree structure but also from the information of the markers flanking the tested po-sition. The existence of QTL is therefore determined by testing the genetic variancecomponent σ2

g . To adjust the bias of variance components estimation, REML is com-monly used, where the adjusted profile likelihood (PAWITAN 2001; LEE et al. 2006)of θ = (σ2

g ,σ2e )′

L (θ|Π,y) = f (y|Π,θ)

= |2πV|−1/2 exp(−1

2(y−Xβ)′V−1(y−Xβ)

)∣∣∣∣X′V−1X2π

∣∣∣∣−1/2

(2.13)

is maximized to estimate and test θ. Such a variance component QTL model is apowerful tool in QTL interval mapping, especially in outbred crosses. For instance,RÖNNEGÅRD et al. (2008) developed flexible intercross analysis based on variancecomponent QTL models, where by including information about the population struc-ture, i.e. line-origin, in the IBD matrix, it is possible to model and test the magnitudeof segregation within each parental line. SHEN et al. (2011b) considered the complexuncertainty in the IBD matrix itself and showed that a variance component QTL modelgains more power in QTL identification when the full likelihood is applied.

In general, QTL analysis is regarded as a low-power method, especially in termsof mapping accuracy. Large sample sizes are required to map QTL with precision.Roughly, with 200-300 F2 individuals, a QTL accounting for 5% of total variation canbe mapped into a 40cM interval, and over 10 000 F2 individuals are required to mapthis QTL into a 1cM interval (LYNCH and WALSH 1998). Most of the time, even 1cMis not a short interval at all in terms of the number of genes lying under the significantQTL peak. Therefore, one would not consider using an F2 population to fine-mapa QTL. When a dense marker map is available for a certain population, associationmapping can have better power in the mapping precision point of view.

2.1.2 Genome-wide association studyGWAS is a simple idea that was not believed by many researchers even a few yearsago. Unlike a QTL analysis, the markers themselves are tested instead of the intervalsflanked by markers. In order to obtain good power, a large sample size is neededin a population-based association study. Since the LD blocks are so small comparedto e.g. an F2 population, in order to map a causal locus, a very dense marker mapis required so that a certain marker can be in strong linkage with the causal gene.In a cover letter of Science published seven years ago, KLEIN et al. (2005) reporteda causal polymorphism of complement factor H that regulates age-related maculardegeneration (AMD). Although the detection of this gene was statistically very lucky,it successfully indicated the inherent potential of the GWAS strategy. From then on,more and more loci were mapped via GWAS, regulating for instance, human disease-related traits like blood pressure (LEVY et al. 2009), blood lipids (AULCHENKO et al.2009; TESLOVICH et al. 2010), coronary heart disease (WANG et al. 2011), breastcancer (TURNBULL et al. 2010), uterine fibroids (CHA et al. 2011), etc. as well ascomplex traits in other species such as mice (VALDAR et al. 2006), maize (TIAN et al.2011), Arabidopsis (ATWELL et al. 2010) and so on.

19

Page 20: Dedicated to my family in Shiyan, Hubei, China, especially to my

Figure 2.2. A schematic flowchart for a genome-wide association study. A narrowGWAS data analysis basically includes the 3rd and 4th steps, where statistical testsare performed throughout the genome at each SNP marker, and the LD pattern underthe significant association signals are examined.

Unlike QTL analyses in experimental designs, GWAS does not require an artificialcross of parental lines. Only a randomly sampled population is needed. This certainlysacrifices the advantage of localizing QTL between flanking markers, however, if theSNPs are dense enough, given a sufficient sample size, a group of polymorphisms thatare linked to the causal gene would show significant association. Such an associa-tion signal has generally much better resolution than a QTL analysis in an intercrossdesign. Figure 2.2 shows how a GWAS is usually conducted, where the associationscan in the middle is the most important step that statistical modeling contributes to.The LD pattern can be used to derive haplotypes for testing window-wise associa-tions. Bonferroni correction is commonly used for determination of genome-widesignificance threshold, however, it can be conservative because of LD, or it can betoo liberal because of confounding in the population. A significance threshold frompermutations can be a convincing alternative. Nevertheless, in order to achieve “over-whelming” significance, replicating the detected association is an ultimate solution,which is especially important in a drug discovery process (see also KINGSMORE et al.2008). When the same analysis routine is done in several different studies, a furthermeta-analysis can cumulate power from the combined results.

20

Page 21: Dedicated to my family in Shiyan, Hubei, China, especially to my

Many statistical methods have been developed particularly for GWAS (see BALD-ING 2006; CANTOR et al. 2010), together with quite a few computational tools (e.g.AULCHENKO et al. 2007b; PURCELL et al. 2007; YANG et al. 2010b). Most GWASbasically only look at the additive effect of each single SNP (the dominance effectmay be included but due to the extra degree of freedom, it is often avoided). So themost common parametric model is

yi j = µ + x jβ + ei j (2.14)

for individual i with SNP genotype x j (coded as 0, 1 and 2 for instance), where yi jis the phenotype, µ is the overall mean, ei j is the residual, and β is the additive SNPeffect. A p-value is obtained by performing a t- or Wald test on β . It is importantfor a GWAS analysis to store p-values (or equivalently, standard errors), which makesfuture meta-analysis possible. In a meta-analysis combining results from N differentstudies, a pooled estimate of β can be calculated as

β = ∑Ni=1 wiβi

∑Ni=1 wi

(2.15)

where wi = 1/s2i and si is the standard error of βi. The standard error of β is computed

as

s =

√H

N(2.16)

where H is the harmonic mean of all the s2i ’s. Hence, a meta-analysis reduces the

standard error of the estimated effect by√

N folds.Among many issues in GWAS (see WANG et al. 2005b; MCCARTHY et al. 2008),

population stratification might be the most worrying one that requires sophisticatedstatistical methods to handle (PRICE et al. 2010). A simple solution is genomic control(GC) (DEVLIN and ROEDER 1999). GC is used to shrink inflation of the test scores(− log10 p-values). When testing for the single genetic effect, say additive effect, inGWAS, the null distribution of the test statistic for the nominal p-values is χ2 with 1degree of freedom. Since most of the SNPs are not expected to be associated with thetrait, the sample distribution of the χ2’s across the genome is expected to resemble thenull distribution. If inflation exists, an observed χ2 value becomes λ · χ2, thereforethe χ2’s can be adjusted using λ , i.e. the inflation factor estimated by comparingthe distribution of the observed χ2’s and χ2 distribution with 1 degree of freedom.λ can be estimated in different ways. In the R package GenABEL (AULCHENKOet al. 2007b), λ is estimated as the regression slope of the observed χ2’s on the null.The original λ estimator proposed by DEVLIN and ROEDER (1999) is the ratio of theobserved median of the χ2’s to the theoretical median of χ2 distribution with 1 degreeof freedom, χ2(1), i.e.

λ =median{χ2

1 ,χ22 , . . . ,χ2

p}median{χ2(1)}

(2.17)

where p is the number of tested positions or a big subset of them. In practice,many publications used the number 0.456 to approximate median{χ2(1)}, however,

21

Page 22: Dedicated to my family in Shiyan, Hubei, China, especially to my

median{χ2(1)} ≈ 0.455 and actually even slightly less than 0.455. This is a curiositysince many GWAS results have been published over-liberal.

More sophisticated than GC, the principle component analysis (PCA) method pro-posed by PRICE et al. (2006) have been widely adopted, and more recently, mixed-model-based methods have become popular as well (e.g. AULCHENKO et al. 2007a;KANG et al. 2008, 2010; LIPPERT et al. 2011). The mixed model generally has a formof

yi j = µ + x jβ +ui + ei j (2.18)

where comparing with (2.14), the extra random effect term uk is the polygenic effect,and u ∼ N (0,Gσ2

u ). G is a genomic kinship matrix estimated from the SNP data(e.g. VANRADEN 2008; KANG et al. 2008; YANG et al. 2010a). The fundamentalproblem of population stratification is due to the similarity in the DNA sequence be-tween individuals who also have similar phenotypes, so that the inflated signals causedby genetic background will cause false discoveries. Therefore, the correlation in pop-ulation structure needs to be removed in the analysis, but certainly, the loci with smalleffects simultaneously become undetectable.

2.2 Multiple-predictor analysisSince genetic variation is a combined effect of multiple genes, looking at only onelocus at a time should not be the optimal way to detect complex genetic architecture.Multiple regression types of models have been used to fit more genetic variants oreven the whole genome simultaneously. Moreover, gene-gene and gene-environmentinteractions could also be interesting phenomena that may help us understand complexgenetic networks. In this section, some popular approaches with basic theories forestimating polygenic effects across the genome are introduced. Interaction analysis isincluded, and due to its connection to variance heterogeneity, recent developments onmapping genes affecting phenotypic variability are introduced as well.

2.2.1 Polygenic effects estimationEstimation of polygenic effects does not necessarily require genotyping “polygenes”(genome-wide genotyping). In variance component QTL analysis, and also GWAS,a linear model with polygenic effects is commonly used to address the genetic back-ground information other than the major QTL effects. Especially in QTL analysis,polygenic effects can be addressed by modeling random effects that have a correlationstructure derived from the pedigree information, without knowing which genes arethere causing the polygenic effects. With dense SNP markers, one can estimate thepolygenic effects of all the available markers using a “super-saturated” model. Thiskind of model is very useful in current quantitative genetics since they provide a pow-erful unified framework for both QTL mapping and genomic evaluation (see Chapter3).

22

Page 23: Dedicated to my family in Shiyan, Hubei, China, especially to my

The model is “super-saturated” because it fits more markers (p) than the number ofindividuals (n), which generally uses a linear predictor like

µ+p

∑j=1

Z ju j (2.19)

to model the phenotype, where µ= Xβ may include some fixed effects, Z j and u j arethe genotype coding and random effect for the j:th SNP, respectively. When modelingthe u j’s as a random sample drawn from a normal distribution, the model is an LMM,or in the terminology used in genomic prediction, a GBLUP model (MEUWISSENet al. 2001). By solving such a random effects model, one can obtain the shrinkageestimates for all the SNPs (which is actually identical to those obtained from a ridgeregression). A good property of ridge regression is that it was originally developedfor overcoming collinearity in linear regression problems (e.g. HASTIE et al. 2009),so that LD in the genomic data will have little impact on the estimation of the effectsof the strongly linked SNPs.

XU (2003) claimed that a GBLUP model is not proper for QTL mapping because ofthe equally strong shrinkage for each SNP assumed in the model. SHEN et al. (2012a)validated this point by trying a randomization test on a GBLUP model. Therefore,in order to more strongly shrink down the effects of non-QTL positions and mean-while highlight the QTL, different methods, most of which are Bayesian, have beenproposed to assign unequal weights to different SNPs (e.g. MEUWISSEN et al. 2001;XU 2003; WANG et al. 2005a; XU 2007; YI and XU 2008; VERBYLA et al. 2009;RÖNNEGÅRD and LEE 2010; HABIER et al. 2011; SHEN et al. 2011a, 2012a). Allthese methods basically allow heteroscedastic effects of the SNPs, such as

u j ∼N (0,σ2u j

) (2.20)

instead of u j ∼N (0,σ2u ). σ2

u j’s differ between different ways of penalization. The

results from such kind of models can be impressive in terms of QTL mapping profile.Major QTL can be identified clearly by just looking at their effects. However, althoughBayesian does not quite emphasize significance testing, it is still an essential problemfor these whole genome models to be applied more widely.

2.2.2 Interaction effects and variance heterogeneityEpistasis is always there, interesting to discover and understand, even though additiveeffects often play the most important role in population genetics (HILL et al. 2008).Epistasis is complicated to model and explain, and there is some difference betweenepistasis and statistical interactions. Epistasis, in its simplest form, refers to an inter-action between a pair of loci, where the phenotypic effect of one locus depends onthe genotype at the other (CARLBORG and HALEY 2004). In the statistical point ofview, such an interaction happens when the combined effects of a pair of loci are notadditive (COX 1984).

Let us assume two loci A and B that affects a particular trait y. A linear model thatcontains only the interaction between A and B does not make sense, because “Nowsince the presence of the interaction places no restrictions at all on how A varies as

23

Page 24: Dedicated to my family in Shiyan, Hubei, China, especially to my

B changes (and vice versa), we ought to be very surprised if either margin were null.Why should it be?”, wrote NELDER (1994). Therefore, in order to test interaction in alinear model, given the genotypes at two loci, one has to perform a two-way ANOVAcomparing

yi jk = µ +αi +β j + γi j + ei jk (2.21)

withyi jk = µ +αi +β j + ei jk (2.22)

where µ is the overall mean, αi is the effect of locus A, β j is the effect of locus B, γi jis the interaction effect between loci A and B, ei jk is the residual, and i, j and k are theindices for locus A genotypes, locus B genotypes and individuals, respectively. Here,γi j does not contain any additive margin of either locus. Testing γi j genome-widelyrequires a two-dimensional scan. For p loci on the genome, p(p− 1)/2 analyses ofvariance need to be performed, which is computationally intensive. Unfortunately,such a pairwise scan has low power because so many tests are done, making it diffi-cult for an epistatic pair of loci to stand out. When epistasis exists, it cannot alwaysbe statistically proved (because of power and removing additive margins), and viceversa, a statistically significant interaction does not necessarily indicate a molecularinteraction between two proteins (because of false discoveries and that the interactioncan be indirect).

As a phenomenon, an epistatic pathway (G×G), as well as gene-environment in-teractions (G×E), can be exciting to find. Relatively speaking, looking for a signif-icant G×E interaction is less difficult if the environmental factor is well measured.In fact, knowing only one locus that is potentially interacting with some other loci orfactors can be also exciting. Interestingly, several recent studies (PARÉ et al. 2010;STRUCHALIN et al. 2010; RÖNNEGÅRD and VALDAR 2011), almost developed at thesame time, noticed that variance heterogeneity at a locus can be caused by interac-tions that involve this locus. Simply searching for significant loci showing varianceheterogeneity can, to use PARÉ et al. (2010)’s word, prioritize such loci since theyhave potential to be involved in G×G or G×E interactions.

The power of detecting potentially interacting loci using variance heterogeneitytest seems to be good (PARÉ et al. 2010). However, for such a variance heterogeneitytest (Levene or Brown-Forsythe test, see Chapter 7), STRUCHALIN et al. (2010) dis-covered an interesting, and curious, behavior in power. Assuming the model causingvariance heterogeneity of the tested locus is (2.21), and there is no main effect of thislocus, we would like to know how the power of the variance heterogeneity test variesas the other effects change. Given a certain amount of the interaction effect, whatSTRUCHALIN et al. (2010) found is an “M”-shaped power curve for the power againstthe main effect of the other interacting factor, which looks surprising. In order to visu-alize the point completely, I made the 2D Figure 2.3 to show the power pattern againstboth the main effect of the other interacting factor and the interaction effect. Plotted inFigure 2.3 is the Brown-Forsythe test statistic that is proportional to the non-centralityparameter that represents power (HEWITT and HEATH 1988; LIU and RAUDENBUSH2004).

As expected, the power of the variance heterogeneity test varies as the interactioneffect changes. When the interaction effect is null, no heterogeneity of variance isgenerated, so the test has no power. However, even when there is a certain amount of

24

Page 25: Dedicated to my family in Shiyan, Hubei, China, especially to my

interaction effect, if it happens to be a half of the main effect of the other interactingfactor, the test loses all the power (see the discussion of STRUCHALIN et al. 2010).This peculiar trend in Figure 2.3 makes variance heterogeneity a maybe-sufficient butunnecessary condition for interaction effects. Namely, if variance heterogeneity iscaused by interaction,

variance heterogeneity⇒ interaction effect,

butvariance heterogeneity : interaction effect.

Researchers such as JIMENEZ-GOMEZ et al. (2011) discovered QTL with varianceheterogeneity as QTL controlling stochastic noise. They do not claim that the vari-ance heterogeneity is necessarily generated by interaction effects. Nonetheless, theunderlying model that generates variance heterogeneity can be an interaction model.

25

Page 26: Dedicated to my family in Shiyan, Hubei, China, especially to my

Brown−Forsythe test statistic

Main effect of interacting factor

Inte

ract

ion

effe

ct

−300

−200

−100

0

100

200

300

−300 −200 −100 0 100 200 300

0

200

400

600

800

1000

Figure 2.3. The Brown-Forsythe test statistic in a two-way interaction model. Assum-ing a two-way interaction model with standard normal residuals, including the maineffect of the tested locus, the main effect of the other interacting factor, and their in-teraction effect. The value of the Brown-Forsythe test statistic depends on both themain effect of the factor and also the interaction effect. Since the test statistic is pro-portional to the non-centrality parameter, this figure shows the power of the varianceheterogeneity test in an interaction model.

26

Page 27: Dedicated to my family in Shiyan, Hubei, China, especially to my

3. Predictive Modeling

“However beautiful the strategy, you should occasionally look at the results.”

—— Winston Churchill

BECAUSE of the complexity in the underlying genetic architecture for the quan-titative traits, there is usually little confidence that our effect estimate for an

individual locus is correct. However, the situation is better when predicting individualphenotypes. In fact, prediction is an essential part of statistical analysis. Since nowwe cannot understand the functions of all the genetic variants, summing all their ef-fects together could help more in prediction than only the detected individual loci. InChapter 2, we have already seen the use of the whole-genome models for QTL iden-tification. Here I introduce the predictive usage of the models that fit all the availablegenome-wide markers.

Genomic evaluation or genomic selection basically means predicting individualbreeding values and performing selection based on genetic markers. Before genomicdense markers were used, farm animals were evaluated according to their pedigreekinship, from which linear mixed models (LMMs) were developed. Assuming anLMM for a particular phenotype y,

y = Xβ+Zaa+ e (3.1)

where β are the fixed effects with design matrix X, a are the animal effects withincidence matrix Za, and e are the residuals. The LMM in animal breeding can containother random effects such as maternal effects as well, but to illustrate the basic idea, Ifocus on the simple animal model (3.1). If we assume that both a and e are multivariatenormally distributed, where a∼N (0,Aσ2

a ) and e∼N (0,Iσ2e ), the BLUP for a can

be solved via Henderson’s mixed model equations (MME), i.e.(X′X X′Za

Z′aX Z′aZa + σ2e

σ2a

A−1

)(βa

)=(

X′yZ′ay

)(3.2)

The relatedness of the animals whose breeding values are to be estimated is given bythe kinship matrix A derived from the pedigree (see e.g. LYNCH and WALSH 1998).

Nowadays, since genome-wide dense SNPs are available for typing many farmanimals, the kinship between each pair of individuals can be derived by comparingtheir DNA information directly. Instead of A, one can calculate a genomic kinshipmatrix G. A commonly used G matrix is the IBS-like matrix proposed by VANRADEN

27

Page 28: Dedicated to my family in Shiyan, Hubei, China, especially to my

(2008), adjusted for allele frequencies at each SNP. If we construct an incidence matrixZ for n individuals and p SNPs, so that Z has n rows and p columns, by scaling thecodings in Z using allele frequencies, we obtain the G matrix (YANG et al. 2010a).We have a general random effects model for genomic evaluation,

y = Xβ+Zu+ e (3.3)

where u are the allele substitution effects for each SNP, and G ∝ ZZ′. If we assumeu∼N (0,Iσ2

u ), (3.3) is a GBLUP model (MEUWISSEN et al. 2001), which is equiva-lent to a ridge regression. If we assign different variance components or “weights” ashyper-parameters to different SNPs, (3.3) becomes a BayesA-like model (MEUWIS-SEN et al. 2001) or the DHGLM (RÖNNEGÅRD and LEE 2010; SHEN et al. 2011a).There are quite a few other models (see Chapter 2) but basically all share the shape of(3.3) to model the mean of y. Properly re-weighting the markers along the genome hasthe potential to improve prediction of breeding values compared to GBLUP (PSZC-ZOLA et al. 2011; SHEN et al. 2012a, see e.g.), because the SNP effects contributingto a particular trait are usually much more heavy-tailed than Gaussian. The key toimproving the genomic evaluation model, as well as the use of it in QTL mapping, isvariable selection (or prioritization) - to add more weights to the functional QTL.

Figure 3.1 shows an example comparing the shrinkage magnitudes of three differ-ent methods for re-weighting markers, where the heteroscedastic effects model (SHENet al. 2012a, HEM;) shrinks the small effects stronger than the ridge regression, andthe LASSO (with 10-fold cross validation) shrinks most of the small effects to exactlyzero. When the genetic effects are normally distributed, HEM does not have advan-tage compared to ridge regression, but for skewed distributed genetic effects, strongershrinkage estimates have better predictive power (SHEN et al. 2012a). Thus, no modelis the best. In order to capture the genetic information well and predict the phenotypicoutcome, proper variable selection routines should be chosen considering the natureof the phenotype. However, all these methods are purely “mathematical”, ignoring thebiological information underlying each typed marker. Variable selection and shrink-age with respect to e.g. gene annotation information would be useful in future studies(see Chapter 8).

28

Page 29: Dedicated to my family in Shiyan, Hubei, China, especially to my

Figure 3.1. Comparison of SNP effects estimates from ridge regression (RR), het-eroscedastic effects model (HEM) and LASSO. The blue cloud and red scatters com-pare HEM and LASSO estimates, respectively, against RR estimates. The analyzedquantitative trait is days to flowering time under long day (18�C, 16 hrs daylight)published by ATWELL et al. (2010). 167 inbred lines were phenotyped for this trait,where a 250K SNP array was used for genotyping, and 216 130 SNPs were availablefor analysis.

29

Page 30: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 31: Dedicated to my family in Shiyan, Hubei, China, especially to my

PART II:SUMMARY OF PAPERS

Page 32: Dedicated to my family in Shiyan, Hubei, China, especially to my

4. Implementing Hierarchical GeneralizedLinear Models (Paper I)

“A unified framework is provided for viewing and extending many existing meth-ods.”

—— Youngjo Lee & John A. Nelder1

HIERARCHICAL generalized linear models (HGLMs; LEE and NELDER 1996),implemented in the R (R DEVELOPMENT CORE TEAM 2011) package hglm

(RÖNNEGÅRD et al. 2010), is a fundamental tool throughout this thesis. The origi-nal major advantage of HGLMs compared to normal/generalized linear mixed modelsis to fit non-normal random effects. However, because of the flexibility in the fit-ting algorithm and its internal connection with Henderson’s mixed model equations,our implementation is additionally capable of: 1. Estimating variance componentswhen we have correlated random effects; 2. Including fixed effects in a model forthe residual variance. In fact, the second point has been extended to model any vari-ance component so that HGLMs can be used as a powerful tool in both multiple QTLmapping and genomic evaluation (RÖNNEGÅRD and LEE 2010; SHEN et al. 2011a,2012a) (Paper III & VI). The HGLMs provide a general unified framework in statisticsthat can be applied in many random effects problems. All the papers that this thesis isbased on, except Paper IV, have used the hglm package for different purposes.

The story of HGLMs starts from the normal linear mixed model (LMM) as follows,where β and u are the fixed and random effects, respectively, and the distribution ofthe response y is determined by β and the variance components θ = (σ2

u ,σ2e )′.

y|β,u,θ ∼ N (Xβ+Zu,Iσ2e ) (4.1)

u ∼ N (0,Iσ2u ) (4.2)

So that the variance-covariance matrix of y is

Var(y) = ZZ′σ2u + Iσ

2e (4.3)

If we define A = ZZ′, this implicates that a linear mixed model with correlated ran-dom effects, e.g. the animal model, can be re-formulated as an ordinary linear mixedmodel by decomposing the correlation/relationship matrix A. For fitting random ef-fects models, at the time of writing, hglm is the only package in R that allows arbitraryuser-defined design matrix Z for the random effects.

1Page 800, LEE and NELDER (1996)

32

Page 33: Dedicated to my family in Shiyan, Hubei, China, especially to my

Before digging into the algorithm implemented in the hglm package, I hereby il-lustrate the likelihood theory underlying the models. First of all, there are several rea-sons for introducing random effects into a certain statistical model, where neverthelessthe most fundamental one is to predict unobservables. The classic Fisher likelihood(FISHER 1922; EDWARDS 1972; PRATT 1976; PAWITAN 2001) was designed for es-timation but not prediction, so that when unobservable/uncertain/random factors existand need to be predicted, the classic likelihood theory becomes powerless. With thefast development of computing tools, another school of thought in statistics happens tobe able to do such predictions for random effects, i.e. the Bayesian. One of the meth-ods that the Bayesian utilizes is Markov chain Monte Carlo (MCMC), which allowssampling posterior distributions for random components in the model so that furtherinference can be done.

When u is normal, the flexibility of HGLMs for handling arbitrary distributionsof y is the same as generalized linear mixed models (GLMMs; see BRESLOW andCLAYTON 1993), which is relatively straightforward in the fitting procedure sinceHGLMs can be formulated as inter-connected GLMs (LEE and NELDER 2001; LEEet al. 2006). Interestingly, the inter-connected GLMs include weighted gamma GLMsfor estimating variance components or dispersion parameters in HGLMs. The flow ofthe HGLM algorithm is demonstrated in Figure 4.1. Since each part of the estimationprocedure can be executed as a GLM, the name of hierarchical GLMs makes moresense when described in this way.

Figure 4.1. An illustration of the iterative weighted least squares (IWLS) algorithmfor fitting HGLMs based on a normal linear mixed model (LMM). VC = variancecomponents; MME = mixed model equations; Coef. = coefficients (fixed and randomeffects); GLM = generalized linear model; LRT = likelihood ratio test.

When y is non-normal, the algorithm simply applies a link function as GLM does(MCGULLAGH and NELDER 1989) during the step of solving MME. The essentialdifference between HGLMs and GLMMs is the flexibility of the distribution of therandom effects. For non-normal random effects, in order to apply a link function forthe random effects, the original LMM can be re-formulated as an augmented model

33

Page 34: Dedicated to my family in Shiyan, Hubei, China, especially to my

with response (LEE and NELDER 2001; LEE et al. 2006)

ya =(

)(4.4)

where in the estimation procedure ψ = E[u] = 0 if the random effects are assumedto be normally distributed with a zero mean. Generally, viewing the h-likelihoodestimation as an augmented GLM, we have E[yi] = µi, Var(yi) = φiV (µi), E[ψi] = ui,and Var(ψi) = λiVa(ui), where V (·) and Va(·) are GLM variance functions, and φi’sand λi’s are the dispersion parameters which are the variance components in LMMs.In cases where either the response or the random effects are non-normal, such anaugmented response is replaced by the adjusted response

za =(

)(4.5)

where the elements are

zi = ηi +(yi−µi)∂ηi

∂ µi(4.6)

and

ζi = vi +(ψi−ui)∂vi

∂ui(4.7)

In the linearization equations (4.6) and (4.7), η and v are the linear predictors for theoriginal response y and random effects u, respectively. Namely, η = g(µ) = Xβ+Zvand ga(u) = v, where g(·) and ga(·) are two link functions. Constructing the modelmatrix as

T =(

X Z0 I

)(4.8)

the effects can be estimated by iterative weighted least squares (IWLS) for the GLM

T′Σ−1T(βv

)= T′Σ−1za (4.9)

where Σ = ΓW−1 with Γ = diag(Φ,Λ), Φ = diag(φi), Λ = diag(λi), and the iter-ative weight matrix W = diag(W0,W1) has elements W0i = (∂ µi/∂ηi)2V (µi)−1 andW1i = (∂ui/∂vi)2V (ui)−1. For a normal-normal HGLM, i.e. a linear mixed model, onecan show that equation (4.9) is identical to Henderson’s MME. The most interestingpart in the IWLS fitting algorithm is to update the estimates for the dispersion pa-rameters or variance components. φi’s and λi’s can be estimated via weighted gammaGLMs, where the response are the squared deviance residuals and the prior weightsare (1− hii)/2. hii is the i:th hat-value or leverage from (4.9). Gamma GLM fam-ily is fitted since one can show that the variance of the squared deviance residuals isproportional to the square of their mean, which is the only assumption we need forfitting gamma GLMs. Since GLMs are used for estimating the dispersion parameters,further modeling of the variance components becomes straightforward. At the time ofwriting, the most advanced normal-normal HGLM that our hglm package is capableof fitting can be formulated as

y|β,uk,θ ∼ N (Xβ+K

∑k=1

Zkuk,diag(exp(Xdβd))) (4.10)

uk ∼ N (0,Akσ2uk

) (4.11)

34

Page 35: Dedicated to my family in Shiyan, Hubei, China, especially to my

where θ = (β′d ,σ2u,1, . . . ,σ

2u,K)′, and the subscript d stands for “dispersion”. The other

common distributions for the response variable and the random effects that can behandled by hglm, with common link functions, are listed in Table 4.1 (reproducedfrom Table 1 in Paper I).

Table 4.1. Commonly used distributions and link functions possible to fit with hglm.Model Name y|u family Link g(µ) u family Link ga(u)Linear mixed model Gaussian identity Gaussian identityBinomial conjugate Binomial logit Beta logitBinomial GLMM Binomial logit Gaussian identityBinomial frailty Binomial comp-log-log Gamma logPoisson GLMM Poisson log Gaussian identityPoisson conjugate Poisson log Gamma logGamma GLMM Gamma log Gaussian identityGamma conjugate Gamma inverse Inv-Gamma inverseGamma-Gamma Gamma log Gamma log

While the augmented-GLM way of presenting MME turns out to allow much moreflexibility in fitting different kinds of sophisticated hierarchical models, especiallywhen GLMs are used for estimating variance components. For instance, in Paper III,we model the variance component for genetic random effects, i.e. σ2

u in (4.2), insteadof residual variance component σ2

e using a second-layer random effects model, so thatthe model becomes a particular double HGLM (DHGLM) for genomic data, which isdifferent from the DHGLM described in the literature (LEE and NELDER 2006; LEEet al. 2006).

Together with professor Yurii Aulchenko, based on the hglm package, we have im-plemented the function polygenic_hglm as an alternative for the function polygenicin the current version of GenABEL - a popular R package for GWAS (AULCHENKOet al. 2007b). polygenic_hglm estimates the polygenic effects based on the IBS ma-trix more efficiently than the original numerical algorithm. Since REML estimationis done by hglm, polygenic_hglm also produces standard errors estimates for theincluded fixed effects, which makes it possible to test covariates in a polygenic effectsmodel.

The hglm package successfully implemented a unified statistical inference frame-work for random effect models. Its capacity in modeling variance components is fairlyflexible, so it has a great potential in large-scale genetics studies as well as other sci-entific fields.

35

Page 36: Dedicated to my family in Shiyan, Hubei, China, especially to my

5. Quantitative Trait Loci Interval Mapping

“The generation of such high-density maps is not possible for a majority of speciesin practice... a marker analysis cannot unambiguously separate the genetic effectsof a QTL from the recombination fraction between the markers and QTL.”

—— Rongling Wu, Chang-Xing Ma & George Casella1

NOWADAYS mapping potentially functional loci is often done by simply associat-ing a large number of SNPs to different complex traits in a population (GWAS).

Such an association study strategy has become very popular during the last 4-5 years.However, although dense marker maps have been developed, the technique of QTL in-terval mapping in experimental crosses is still useful and should never be abandoned.One reason is that dense marker maps are certainly not available for all species. An-other reason is that we have definitely not obtained complete information from DNAsequence, nor have fully understood the haplotypes along the genome.

The background theory of QTL interval mapping is introduced in Chapter 2. Here,I demonstrate the contributions in Paper II & IV that are related to interval mappingtechniques.

5.1 Variance component QTL model (Paper II)Paper II (SHEN et al. 2011b) extended an “old” idea in QTL analysis that the uncer-tainty in genotypes should be properly considered when modeling the genetic effectsas either fixed effects (ELSTON and STEWART 1971; MORTON and MACLEAN 1974;LANDER and BOTSTEIN 1989) or random effects (SCHORK 1993; KRUGLYAK andLANDER 1995).

When fitting the QTL effects as fixed effects, the distribution method (full like-lihood method taking the genotype uncertainty into account) has been proved to beapproximated well using a simple linear regression on the genotype probabilities (HA-LEY and KNOTT 1992), which has become a widely used tool for QTL mapping. Forinstance, Paper IV provided a convenient interface for using such a regression method(NELSON et al. 2011). However, when modeling the QTL effects as random, therehas not been a universal solution to integrate the information about uncertain geno-types into the analysis. The difficulty comes from the “correlation matrix” inferredby combining the genetic marker information and the pedigree structure, i.e. the IBD

1Page 223, WU et al. (2007)

36

Page 37: Dedicated to my family in Shiyan, Hubei, China, especially to my

(identity-by-descent) matrix. The usual way of using the IBD matrix is to calculatethe average amount of shared alleles between relatives and plug into a linear mixedmodel as the correlation of random effects, referred to the expectation method. Butthe expectation method throws away all the information except the mean value in thedistribution of the IBD matrix. For small human full-sib families, previous studieshave derived the full likelihood function considering the uncertainty in the IBD ma-trix (XU 1996; GESSLER and XU 1996). This is doable because the family sizes aresmall. Unfortunately, when one moves from human families to an animal pedigree andfrom full-sibs to F2 intercross designs, with possible inbreeding as well, it is almostimpossible to analytically derive the distribution of a big complex IBD matrix.

Statistical inference for the random QTL effects is done by testing the correspond-ing variance component. Restricted maximum likelihood (REML) is used to correctbias in variance component estimation. Let us denote the phenotype vector as y, theIBD matrix as Π, and the other parameters in the random effects model as θ. Thetraditional expectation method conducts likelihood-ratio tests (LRT) through the like-lihood

LE = L (θ|y,E[Π]) (5.1)

Often people just call E[Π] the IBD matrix, regardless the uncertainty in the matrixitself. Instead, the distribution method utilizes the joint likelihood for both θ and Π,so that the inference on θ should be done directly through its marginal likelihood

LD = L (θ|y)= ∑

Π

L (θ,Π|y)

= ∑Π

L (θ|y,Π)P(Π)

= EΠ[L (θ|y,Π)] (5.2)

Therefore, from equation (5.1) and (5.2), we clearly see the exact way of conductingthe likelihood function is to average out all the possible likelihood functions over theprobability space of Π.

Obviously, as suggested by earlier studies (XU 1996), one can just apply a MonteCarlo sampling strategy from the probability space of Π and obtain an empirical re-alization of the IBD matrix distribution. Drawing m imputes for Π, the marginallikelihood of the parameters, LD, can be approximated as

LD(θ|y)≈ 1m

m

∑i=1

L (θ|y,Πi) (5.3)

according to (5.2). However, calculating these imputed likelihood functions is notan easy task, because the likelihood functions generally have extremely small values,even at their maxima, which are beyond the capacity of numeric precision in currentcomputers. One cannot calculate the average of the corresponding log-likelihoodsand transform back. Hence, one contribution of Paper II is to derive and implementa Newton-Raphson-based EM algorithm for solving such a problem. We found thatthe algorithm we derived uses log-likelihood values of individual imputes, for whichclosed solutions are already available (HARVILLE 1977) (see Paper II for details).

37

Page 38: Dedicated to my family in Shiyan, Hubei, China, especially to my

In Paper II, the performance of the distribution method compared to the expectationmethod was examined by two small examples and also using some real experimentaldata. Interestingly, the comparison on a real pig intercross data (ANDERSSON et al.1994) showed better QTL mapping precision of the distribution method. This is an-other finding of Paper II that contributes to the literature. Figure 5.1 (reproduced fromFigure 3b,c in Paper II) compares the maximized REML likelihood function valuesfrom both methods at a putative QTL with rather low marker information (see Fig-ure 4 in Paper II). The two methods gave similar likelihood values, however, whena causal QTL exists, the distribution method has better power. Even when no QTLexists at all, the distribution method has significantly lower tendency to generate falsepositives (p-value = 2.3× 10−14 from a Wilcoxon test for this particular simulation,testing whether the log-likelihood values from both methods differ).

Figure 5.1. Simulation results for power of interval mapping using the distributionmethod compared to the expectation method. A QTL was simulated between twoflanking markers on pig chromosome 6 (see Paper II for details about the data). 1 000simulations were executed for comparing the log-likelihoods from the expectationand the distribution methods. The points above/below the diagonal are in red/blue,indicating that the distribution method has larger power than the expectation method(left panel), or that the distribution method has lower false positive rate (FPR) thanthe expectation method (right panel). The numbers in color show the correspondingpercentages of the sets of points.

The study done in Paper II gives us a better understand of how genotype uncertaintyaffects the classic variance component QTL analysis. The difference between thedistribution method and the traditional expectation method might not be substantialin many cases, especially when the population size is sufficiently large. However, forpartially informative markers, the distribution method has a tendency to improve theprofile of a QTL scan so that it contributes to QTL fine mapping.

38

Page 39: Dedicated to my family in Shiyan, Hubei, China, especially to my

5.2 QTL regression model (Paper IV)In contrast to the theoretical research in Paper II, Paper IV contributes to QTL intervalmapping technique by providing an application tool. The idea is to use Karl Broman’sqtl package (BROMAN 2003) in R to help us perform QTL analysis in outbred linecrosses. But R/qtl is a tool designed for intercrosses of inbred lines, and was notdeveloped for outbred line crosses. The idea here is to import pre-calculated geno-type probabilities from outbred line cross data into a fake R/qtl object. Thereafter,one can use R/qtl as a “slave” to do all the jobs including 1D/2D QTL mapping andpermutation test, which have already been efficiently implemented in R/qtl.

The reason why the qtl.outbred routine works is because the underlying statis-tical models for analyzing inbred and outbred line crosses are identical. No matterhow genotype probabilities are inferred for each chromosomal locus, a simple lin-ear regression (HALEY and KNOTT 1992) is utilized to fit the data. Therefore, thefunctions in R/qtl for both 1D and 2D QTL scans, together with the permutation test,work exactly the same way for outbred line cross data, as long as the correct genotypeprobabilities are given.

Table 5.1. Functions in the R package qtl.outbred.Function Descriptioncalc.prob Calculating genotype probabilities using the triM algorithmimpo.prob Importing calculated genotype probabilities to R/qtl

The qtl.outbred package consists of two main functions (Table 5.1, reproducedfrom Figure 1 in Paper IV), where the one called impo.prob is necessary for anyanalysis, which does the trick of turning R/qtl into a “slave”. First of all, a datasetfor R/qtl should be prepared in its required format, for instance, an Excel sheet. R/qtlhandles F2 inbred line cross data so that a pedigree structure needs to be given, andso do F2 outbred line crosses. In the Excel sheet, the user just needs to insert thepedigree structure as if it was an inbred line cross, and the remaining fake genotypescan be arbitrarily given. This creates a data format that R/qtl can read into R, with thecorrect number of individuals in each generation. R/qtl package itself is capable ofcalculating genotype probabilities for inbred line crosses, and the calculation resultsare stored as a component in the list of a qtl object. impo.prob in qtl.outbred createssuch a component for outbred line cross data.

Certainly, impo.prob requires pre-calculated genotype probabilities from otherexisting tools. The package can import the output format from the popular web-basedtool GridQTL (SEATON et al. 2006). Besides, qtl.outbred provides an alternative ofusing the triM algorithm, implemented in the C++ software cnF2freq (NETTELBLADet al. 2009), to calculate genotype probabilities for outbred line crosses. The functioncalc.prob in the qtl.outbred package does this calculation job as a user-friendly Rinterface for the C++ software.

The package qtl.outbred is a useful tool that provides a simple and fast interfacein R for QTL analyses in outbred populations. Using qtl.outbred, one is also able toproduce neat QTL scan results and nice figures in the shape of R/qtl (see Figure 1 inPaper IV).

39

Page 40: Dedicated to my family in Shiyan, Hubei, China, especially to my

6. Fitting The Entire Genome

“Because of the high dimensionality of the model, it violates the usual rule of par-simony in model fitting. Fortunately, we were able to penalize the small effects andgive them negligible weights so that their inclusion should have negligible effectson the analysis.”

—— Shizhong Xu1

SINGLE-MARKER analysis is good and single-marker analysis is not good. It isgood because of the convenience in statistical model fitting and testing. By focus-

ing on a single locus on the genome, the result is easy to explain. In the practical pointof view, for instance in the GWAS context, the reported single marker p-values froma common regression model can easily be used in further meta-analysis (THOMPSONet al. 2011), which makes a consortium possible to conduct big studies by combin-ing results from many groups. The single marker analysis also gives clear indicationwhere to invest, clone and potentially make drugs or GM products from. It is not goodbecause by looking at a limited amount of the genome, the power in gene detectionbecomes fairly limited, since genes contribute together to a certain phenotype or evencreate complex biochemical networks. From a multiple testing procedure, the revealedloci out of the genome explain a limited amount of phenotypic variance, which is wayless than the estimated heritability from many other studies. The poor capacity in cap-turing genetic variance makes predictive power using identified loci rather limited aswell.

Using a multiple regression model, one can include a couple of loci in the samemodel, trying to understand the genetic effects and underlying biology better. How-ever, a multiple linear regression has two vital drawbacks that make such analysesdifficult to proceed with. First, the regression model has a limited degree of freedom,depending on the number of observations, so that too many covariates would make anover-fit. This makes it impossible to fit the whole genome in one unified model sincethere are usually much more genetic markers than studied population size. Second,statistical testing on covariates in a multiple linear regression can be affected a lot bymulti-collinearity which is common in genomics study since a small region on a chro-mosome often shows a certain magnitude of linkage disequilibrium (LD) (FALCONERand MACKAY 1996). Therefore, fitting all the markers effects as random effects be-came a natural way to model the whole genome. For example, fitting a linear mixedmodel, or equivalently a ridge regression, not only saves degrees of freedom, but alsodeals with multi-collinearity in the model matrix.

1Page 800, XU (2003)

40

Page 41: Dedicated to my family in Shiyan, Hubei, China, especially to my

Denoting the number of observations or individuals as n and that of explanatoryvariables or genetic markers as p. There are plenty of statistical methods or perspec-tives to handle such p� n problems in quantitative genetics, for instance, ridge regres-sion (e.g. MALO et al. 2008), linear mixed model (GBLUP; MEUWISSEN et al. 2001),partial regression (e.g. ZENG 1993, 1994), LASSO (TIBSHIRANI 1996), Bayesianmodel selection (e.g. MEUWISSEN et al. 2001; XU 2003; YI and XU 2008), etc. Gen-erally, all these methods do some amount of shrinkage on each of the estimated effects,resulting in high-dimensional models useful for predicting phenotypes and potentiallyalso for identifying QTL.

In this chapter, the two papers about whole genome models in this thesis are sum-marized. Instead of the popular Bayesian routines, Paper III (SHEN et al. 2011a)showed that the DHGLM works no worse than the BayesA method (MEUWISSENet al. 2001) and even computationally faster. Paper VI (SHEN et al. 2012a) presenteda generalized ridge regression method that is a non-iterative simplified version of thedouble-layer model use in Paper III, which has a substantial computational advantagein fitting general p� n problems.

6.1 Double HGLM (Paper III)Paper III analyzed the common dataset (SZYDLOWSKI and PACZYNSKA 2011) issuedfor the participants of the 14th QTLMAS workshop in Poznan, Poland, 20102. Thesimulated dataset consists of 3 226 individuals in 5 generations, where the phenotypicrecords for the 900 individuals in the F4 generation were not given, nor the true QTLcoordinates and their effects. Two traits, one quantitative and the other binary, sharingpleiotropic QTL, were simulated. The very first analysis was trying to perform QTLmapping for both traits using variance component models. The results were reason-able but with low precision in QTL mapping, which is not surprising for a linkageanalysis. Therefore, an alternative analysis that fits a double HGLM (DHGLM; LEEand NELDER 2006) was tried instead, which was first implemented by RÖNNEGÅRDand LEE (2010) that extended our hglm package.

The reported results were compared during the conference, and the performance ofDHGLM was good in both QTL detection (MUCHA et al. 2011) and genomic predic-tion (PSZCZOLA et al. 2011) compared to the other methods. DHGLM had the bestQTL mapping accuracy for the simulated quantitative trait (Figure 6.1). In Figure 6.1,the Bayesian methods (BOUWMAN et al. 2011; SUN et al. 2011; CALUS et al. 2011)performed well as expected. BOUWMAN et al. (2011) used Gibbs sampling for vari-able selection (GEORGE and MCCULLOCH 1993), which is implemented in the iBaysoftware (JANSS 2009). SUN et al. (2011) used their BayesCπ method (HABIER et al.2011), whereas CALUS et al. (2011) used BayesC (VERBYLA et al. 2009). Besides,NETTELBLAD (2011) performed haplotype inference based on hidden Markov models(HMMs) for the multi-generation pedigree, and KARACAÖREN et al. (2011) reportedtheir results from GRAMMAR algorithm (AULCHENKO et al. 2007a). What can beseen is that the whole-genome methods out-perform the single-marker analyses. Un-like the Bayesian methods, DHGLM, based on the extended likelihood (BJØRNSTAD

2URL: http://jay.up.poznan.pl/qtlmas2010/

41

Page 42: Dedicated to my family in Shiyan, Hubei, China, especially to my

1996) or h-likelihood (LEE and NELDER 1996), is deterministic in its fitting algo-rithm, therefore no intensive sampling like MCMC is required.

# Mapped QTL / # Reported QTL

Coster and Calus

Karacaören et al.

Calus et al.

Nettelblad

Sun & Dekkers

Bouwman et al.

Shen et al.

0.2 0.4 0.6 0.8 1.0 1.2

Figure 6.1. Comparison of QTL mapping accuracy of all reported results for the simu-lated quantitative trait. The figure is reproduced from Sebastian Mucha’s presentationat the 14th QTLMAS workshop in Poznan, Poland, 2010. The horizontal axis showsthe ratio of the number of mapped QTL to the number of reported QTL. One reportedlocation could map more than one simulated QTL position if the simulated QTL arevery close to each other.

The DHGLM used in Paper III is different from the original version given by LEEand NELDER (2006). Instead of modeling the residual variance in an LMM, we modelthe marker-specific genetic variance of the random effects (see also RÖNNEGÅRDand LEE 2010). Suppose we have the following random effect model for the entiregenome,

y = Xβ+Zg+ e (6.1)

where y is the vector of phenotypic records, g ∼ N (0,diag(λ)) are the SNP ef-fects, λ = (λ1,λ2, . . . ,λm)′ are the variances of the SNP effects, and the residualse∼N (0,σ2I). The fixed effects β include an intercept and the sex effect in Paper IIIto reduce the residual errors. If λ is a vector of identical λ values, model (6.1) is justan ordinary LMM, from which one can obtain GBLUP (MEUWISSEN et al. 2001).Instead of assigning different prior distributions for the variance of each SNP effect(the Bayesian methods), we further model the effect-specific variance using a secondlayer of random effect model or namely, another layer of HGLM,

logλ= 1a+b (6.2)

42

Page 43: Dedicated to my family in Shiyan, Hubei, China, especially to my

with an intercept a and normally distributed random effects b ∼ N (0,σ2b I) as the

linear predictor. In the fitting algorithm (see Paper III), model (6.2) actually fits agamma GLMM. As we’ve already seen in Chapter 4, the HGLM algorithm estimatesthe fixed and random effects by solving Henderson’s MME which is an LM itself (e.g.equation 4.9), and it estimates the variance components or dispersion parameters usinggamma GLMs. Therefore, simply by iterating several LMs and GLMs, the algorithmquickly converges to the ML or REML estimates of the parameters in the DHGLM. Atconvergence, the inference for each part of the DHGLM is based on the h-likelihood

h(y,g,b|β,σ2,a,σ2b ) = log f (y|β,g,σ2)+ log f (g|a,b)+ log f (b|σ2

b )

= −n2

log(2πσ2)− 1

2σ2 (y−Xβ−Zg)′(y−Xβ−Zg)

−12

m

∑j=1

log(2πea+b j)− 12

m

∑j=1

g2j

ea+b j

−m2

log(2πσ2b )− 1

2σ2b

b′b

where j is the marker index, m is the number of markers, and n is the number ofindividuals. The h-likelihood is simply the joint distribution of the data (observedinformation), the parameters in the model, and the random effects (unobserved infor-mation). Since adjacent loci on the same chromosome are generally linked, we addeda flexible extension of our model by adding a correlation for b. The correlated ran-dom effects, b j, follow a multivariate normal distribution with a mean of zero and avariance-covariance matrix

A = σ2b

1 ρ ρ2 · · · ρm−2 ρm−1

ρ 1 ρ · · · ρm−3 ρm−2

ρ2 ρ 1 · · · ρm−4 ρm−3

......

.... . .

......

ρm−2 ρm−3 ρm−4 · · · 1 ρ

ρm−1 ρm−2 ρm−3 · · · ρ 1

(6.3)

where ρ ∈ [0,1). When ρ = 0, the loci are assumed to be independent; and when0 < ρ < 1, the correlation between two loci is a monomial function of ρ (RÖNNEGÅRDand LEE 2010). In this work, ρ is pre-defined before the model fitting, which actuallydoes smoothing on the marker-specific variance λ, which reduces the noise in theprofile and highlights the QTL signals (Figure 6.2). A better estimate or “guess” ofthe correlation might improve power of the analysis, e.g. using LD information in thegenotype data to estimate such correlation (feedback from professor Yurii Aulchenkoduring the conference presentation).

According to the extended likelihood principle (BJØRNSTAD 1996), inference ofthe random genetic effects g should be done through the h-likelihood, fixed effects βthrough the marginal likelihood, and variance components σ2 and σ2

b through the ad-justed profile likelihood (LEE et al. 2007). For the genomic model here, what we careabout is the g term which tells how the genetic effects differ among different mark-ers. The estimation of genetic random effects g together with their prediction errors(PAWITAN 2001) λ can be done using the inter-connected GLM algorithm (see Paper

43

Page 44: Dedicated to my family in Shiyan, Hubei, China, especially to my

I & III). This provides h-likelihood estimates of the marker-specific genetic randomeffects, i.e. g. Such an estimation procedure based on the h-likelihood is deterministicand computationally efficient compared to Bayesian MCMC routines, without losingthe flexibility in model construction. Although conducting proper statistical tests forg is not straightforward directly considering their prediction errors λ, a randomizationtest for the entire genome could be a solution (see Paper VI). In Figure 6.2, the “thresh-old” is given by an overall genetic variance component estimate from LMM, whichcan be regarded as a genome-wide average of the marker-specific variance estimatesλ.

0 2000 4000 6000 8000 10000

0.0

0.1

0.2

0.3

0.4

Marker index

Var

ianc

e of

Mar

ker

Effe

ct

Additive QTLEpistatic QTLImprinted QTL

Figure 6.2. Marker-specific variance profile along the genome for QTL identificationof the simulated quantitative trait of the 14th QTLMAS workshop common dataset.The horizontal dashed line gives the variance component estimate of the random ge-netic effects if LMM/GBLUP is fitted. The colors separate the five simulated chromo-somes. The simulated true QTL positions are indicated by vertical bars.

Regarding the QTL mapping precision in Figure 6.1, we reported the peaks abovethe overall genetic variance component estimate from LMM to be detected QTL, andsome other small ones to be suggestive. One reason that our method is able to provideclearly centered QTL coordinates could be due to the smoothing we add to b. There-fore, a smoothed curve instead of scattered dots was generated as the marker-specificvariance profile.

The Bayesian methods (see review by SORENSEN 2009) assume the additive ge-netic effect a of each marker to be random with a certain mean E[a] and varianceVar(a). However, the marker-specific variance Var(a) is a “hyper-parameter” notstraightforward to explain. One Bayesian interpretation of Var(a) clarified by GI-ANOLA et al. (2009) is the uncertainty about the unknown additive effect a. Accordingto this interpretation, Var(a) = 0 means that a = E[a] without uncertainty, but it doesnot necessary mean a = 0 since E[a] can differ from zero. Nevertheless, for modelsthat assume a normal distributed random effect per marker, such as DHGLM (RÖN-NEGÅRD and LEE 2010; SHEN et al. 2011a), E(a) = 0. Therefore, in models suchas DHGLM, markers with zero variance have no effect, and those with large variancehave also large effects.

44

Page 45: Dedicated to my family in Shiyan, Hubei, China, especially to my

Straightforwardly, the random effects part of the linear predictor Zg was used forcalculating genomic estimated breeding values (GEBVs). The results were not the bestbut promising compared to other sophisticated models used by breeders (PSZCZOLAet al. 2011). We had good prediction for the binary trait of the 900 young individu-als in F4 generation - a correlation of 0.72 between true breeding values (TBVs) andGEBVs. Whereas a correlation of 0.60 between TBVs and GEBVs was obtained forthe quantitative trait. The major reason that the performance was worse for the quan-titative trait was because a more complex genetic architecture was simulated for thequantitative trait. Since we did not include maternal effect or family effect in the anal-ysis, some genetic effects, such as imprinting QTL, were not optimally considered.

6.2 Heteroscedastic effects model (Paper VI)In a memoir of Charles Henderson published by the National Academies Press (VANVLECK 1998), the author describes that around half a century ago, Henderson’s ideawas to simply modify the least squares equations by plugging in the genetic correlationmatrix G, i.e.(

X′R−1X X′R−1ZZ′R−1X Z′R−1Z+G−1

)(βu

)=(

X′R−1yZ′R−1y

)(6.4)

which is now well known as Henderson’s MME. Given the estimates of the variancecomponents θ in R and G, the MME actually solves the following LMM,

y|β,u,θ ∼ N (Xβ+Zu,R) (6.5)u ∼ N (0,G) (6.6)

If R = Iσ2e and G = Iσ2

u , the LMM is simplified to be (4.1) and (4.2), and the MMEcan be written as (

X′X X′ZZ′X Z′Z+λ I

)(βu

)=(

X′yZ′y

)(6.7)

where λ = σ2e /σ2

u is the ratio of the estimated residual variance to the estimated ge-netic variance component. As we have shown in Paper I (see also RÖNNEGÅRD andCARLBORG 2007), an LMM with correlated random effects can be solved via the sim-plified MME (6.7). We can see from (6.7) that Henderson modified the least squareequations by adding a common shrinkage to the random effects, i.e. what ridge re-gression does. This creates a very important advantage of using random effects, i.e.more covariates than the number of individuals can be fitted into a linear model. SinceHenderson’s form of MME (6.4) has the G matrix in it, the MME can actually handleunequal shrinkage for different random effects.

In fact, generally, what the DHGLM in Paper III and also the Bayesian methods(e.g. MEUWISSEN et al. 2001; XU 2003) do is to estimate different variance compo-nents or shrinkage parameters for different markers along the genome. Pre-defining aλ vector for all the markers in the MME (6.7), we have the “re-weighted” MME(

X′X X′ZZ′X Z′Z+λI

)(βu

)=(

X′yZ′y

)(6.8)

45

Page 46: Dedicated to my family in Shiyan, Hubei, China, especially to my

The contribution of Paper VI is to provide a heteroscedastic effects model (HEM) thatbasically answers these two questions: 1. Based on DHGLM, what kind of quantitycan be used to assign to λ? 2. How to solve such a huge least squares problem sincethe matrix Z′Z+λI has a huge size of the number of markers?

The DHGLM used in Paper III suggests that the “working response” for λ in thesecond layer (6.2) could be a useful predictor of λ, i.e. simply λ j = σ2

e /σ2u j

and

σ2u j

=u2

j

1−h j j(6.9)

where u j is the estimated effect for marker j from GBLUP, and h j j is the correspond-ing hat-value. So this novelty could answer question 1 above. Also, we know thattransforming the individual effects a to the marker effects u is possible as follows,

u = Z′G−1a (6.10)

where the pre-defined weights for markers (6.9) go into G. (6.10) gets rid of the hugematrix Z′Z +λI for solving u, but in order to apply the quantity (6.9), huge matrixmanipulation needs to be avoided for calculation of h j j as well. In the following fittingalgorithm, we show that h j j can also be obtained via a transformation technique. Inthe algorithm, steps 1-4 fit GBLUP and steps 5-8 fits a generalized RR. The algorithmalso includes a Cholesky decomposition of the genomic relationship matrix G to sim-plify the computations and the transformation of leverages (Step 5). Step 5 is a newderivation that answers question 2 above.

Algorithm. (Fitting heteroscedastic effects model, HEM) Given a phenotype vectory (size n× 1) that belongs to any GLM family (MCGULLAGH and NELDER 1989),fixed effects design matrix X (size n× k) and the SNP genotype matrix Z (size n×p), the GBLUP (ridge regression) and HEM (generalized ridge regression) can becomputed as:

1. Calculate G = ZZ′, its inverse G−1 and its Cholesky decomposition L s.t.LL′ = G;

2. Fit a GLMM (generalized linear mixed model) with response y, fixed effects Xand random effects design matrix L. This fits the animal model as a GLMMwith correlated random effects;

3. From step 2, store the estimated variance components σ2b , σ2

e and the animaleffects a. Calculate λ = σ2

e /σ2b ;

4. Transform a back to the SNP effects b = Z′G−1a;5. Define

Cv =1

σ2e

(X′X X′LL′X L′L+λ In

)and divide the inverse of Cv into blocks

C−1v =

(C11

v C12v

C21v C22

v

)Define a transformation matrix M = Z′G−1L. Calculate the leverage for eachrandom SNP effect as

h j j = 1−M j(In−C22v /σ

2b )M′j

46

Page 47: Dedicated to my family in Shiyan, Hubei, China, especially to my

where M j is the j:th row of the transformation matrix M;6. Define a diagonal matrix W with each diagonal element

w j j =b2

j

1−h j j

and update G to be G∗ = ZWZ′, which is vectorized in implementation sinceW is diagonal. Calculate G∗−1, and L∗ s.t. L∗L∗′ = G∗;

7. Fit a GLMM with response y, fixed effects X and random effects design matrixL∗;

8. From step 7, transform the updated individual effects a back to the SNP effectsb = Z′G∗−1a.

In this algorithm, GLMMs are estimated based on penalized quasi-likelihood (PQL)for MME (see R package hglm and its algorithm in Paper I).

I will not repeat the detailed results in Paper VI here. To summarize in short, themarker-specific shrinkage quantity (6.9) that we developed can highlight strong QTLnicely and improve genomic evaluation by a significant amount. Regarding efficiency,our R package bigRR3 has a big computational advantage when the number of co-variates exceeds the number of individuals a lot. For instance, for a dataset with 100individuals and 1 million markers, on my laptop, bigRR fits a ridge regression in about1 minute, and in another minute, fitting of the generalized ridge regression proposedin this paper is done.

By simplifying the DHGLM in Paper III, we sacrificed the nice smoothing (6.3)we added to the marker-specific variance components. In order to guarantee a fastand stable DHGLM algorithm with a nice QTL mapping profile, further research isrequired to incorporate some smoothing technique into the simplified algorithm.

3Package URL: https://r-forge.r-project.org/R/?group_id=1301

47

Page 48: Dedicated to my family in Shiyan, Hubei, China, especially to my

7. Beyond Plain Heritability:Variance-Controlling Genes (Paper V)

“The robustness, or uniformity, of an individual’s character is not only of greatpractical importance in medical genetics and food production but is also of scientificand evolutionary interest.”

—— Lars Rönnegård & William Valdar1

SIMPLE methods often give us promising results. It’s not because the sophisticatedmethods are worse, but due to the unbeatable robustness of the simple methods,

such as linear regression, student-t test, ANOVA, etc. While in Paper V (SHEN et al.2012b), it was a simple test for variance heterogeneity that revealed promising andinteresting results.

The publicly available Arabidopsis dataset was re-analyzed by screening the genomeusing the Brown-Forsythe test (BROWN and FORSYTHE 1974) for variance hetero-geneity. The Brown-Forsythe test for such a “vGWAS” is a robust statistical testfor the equality of group variances in terms of phenotypic distribution STRUCHALINet al. (2010). If the phenotypic value is yi j for individual i that with genotype j, wherei = 1, . . . ,n, and j = 1, . . . ,m, the absolute deviation from the median of each genotypeis

y∗i j = |yi j− y j| (7.1)

where y j is the median of the phenotypic values of the individuals that have the geno-type j. Performing a one-way ANOVA on y∗i j, we have the ANOVA F statistic

F =(N−m)∑

mj=1 n j(y∗· j− y∗··)

2

(m−1)∑mj=1 ∑

n ji=1(y

∗i j− y∗· j)2

(7.2)

where n j is the number of observations in group j. This F statistic follows an F dis-tribution with m−1, n−m degrees of freedom. When n is large enough, one can ap-proximate the F statistic as a χ2 statistic with m−1 degrees of freedom. The nominalp-values calculated using such χ2-statistics were used in vGWAS with a Bonferronicorrected significance threshold.

After conservatively filtering the detected associations (genomic control was ap-plied, see Chapter 2) and database checking, the most striking example is the molyb-denum transporter MOT1 in the plant (Figure 3 in Paper V). The analysis of the molyb-denum content trait shows a clean genome-wide scan profile, where the only signifi-cant peak locates around the candidate gene MOT1, validated by BAXTER et al. (2008)

1Page 435, RÖNNEGÅRD and VALDAR (2011)

48

Page 49: Dedicated to my family in Shiyan, Hubei, China, especially to my

via a QTL analysis in an F2 intercross with further molecular experiments. There isone SNP typed right in the exon segment of MOT1. In the quantitative genetics pointof view, this molecular evidence is sufficiently exciting to draw more attention to thevariance-control topic.

Another clear and significant signal is a locus that we named variation in serra-tion, or VS, which regulates the variability in leaf serration of the plant (Figure 4 inPaper V). It is interesting that the position of the VS locus coincides with an importantcandidate transcription factor ANAC13 (RIECHMANN et al. 2000). However, the asso-ciation peak does not seem to be caused by the candidate ANAC13, since the signal onANAC13 is quite low and according to the SNP data, ANAC13 is not in a strong link-age with the significant segments around it. Therefore, the VS locus requires furtherfine-mapping or molecular experiments to be understood better.

In contrast to the traditional phenotypic variance dissection (see Chapter 8), con-sidering genetic variance heterogeneity effects, a suggestive way of re-dissecting thephenotypic variance VP is proposed in Paper V. For a single locus, we dissect the phe-notypic variance into the variance due to the mean shift between genotypes, VM , thevariance due to the variance heterogeneity, VV , and the remaining residual varianceVR, i.e.

VP = VM +VV +VR (7.3)

Since inbred lines are analyzed Paper V, there is no dominance and consequently VM =VA. Comparing to (8.1), we have VR � VE , where equality holds if and only if VV =0, i.e. VV captures a part of VE that is not pure stochastic noise, but actually dueto genetics. For 52 quantitative phenotypes, we compared the portions VM / VP andVV / VP for all the available SNPs across the genome (Figure 1 in Paper V). The geneticcontribution to the variance of the phenotypes seems to be as common as to the mean.

Figure 7.1. An schematic example illustrating the cause of high-variance allele due to“restriction” type of interaction. When high level of the trait expression has negativeeffect, a restrictor is in operation to reduce the trait level. In a studied population,different individuals react differently to the restriction. Due to the mixture of over-restricted and under-restricted individuals in the population, the group of individualsthat have the high-expression (radical) allele will result in larger group variance com-pared to the individuals that have the normal stable allele.

49

Page 50: Dedicated to my family in Shiyan, Hubei, China, especially to my

As shown by both PARÉ et al. (2010) and STRUCHALIN et al. (2010), also dis-cussed by RÖNNEGÅRD and VALDAR (2011), variance heterogeneity of a single locusis possibly due to interaction effects (epistasis or gene-by-environment interaction). InPaper V, an interesting finding is reported that the candidate gene FRI for flowering-time-related traits shows interesting interaction with vernalization condition (Figure6 in Paper V). Even though FRI mainly has strong mean-controlling effect, but itsvariance heterogeneity is also substantial due to interaction.

Here, instead of repeating the results in Paper V in detail, I continue the discus-sion in Chapter 2 on the relationship between interaction and variance heterogeneity.For example, both PARÉ et al. (2010) and STRUCHALIN et al. (2010) have reported avariance-heterogeneity signal associated with C-reactive protein levels (CRP). PARÉet al. (2010) noticed a significant interaction between the variance-prioritized locusand BMI (p-value = 7.2×10−10). However, there are many factors that could poten-tially affect CRP so that might interact with the detected locus. Especially, the CRPlevel is in response to inflammation, which is usually cured of via medicine taken bythe individuals. Since individuals react differently to the cure, larger variance maybe created. Figure 7.1 shows a general and maybe common situation that an radi-cal allele could potentially become the high-variance allele as well, because of thediverse response to a certain “restriction”. The restriction could be from another pro-tein, or a treatment, or a different environment, etc. Therefore, for quantitative traitswith known factors that restrict the outcome, mapping variance-controlling loci canbe more useful to find interactions.

Let us assume the two-way interaction model (2.21) again, since true positive vari-ance heterogeneity of a certain locus is a sign for interaction effects, can we predictanything about the interaction effect via the variance-controlling locus? The answerseems to be yes. Based on the theory and definitions in Paper V, Figure 7.2 showssome simulation results under model (2.21). Fixing the main effect of the other in-teracting factor to be null, for a particular low-variance allele frequency (LAF)2, theportion of the phenotypic variance due to variance heterogeneity (VV /VP, see Paper V)is a function of the broad sense heritability (H2). This suggests that one could poten-tially predict H2 from VV /VP, if the variance heterogeneity of a single marker comesfrom a two-way interaction model. Therefore, starting by loci showing variance het-erogeneity could potentially infer more broad sense heritability due to interaction.

2LAF is the frequency of the allele that generates low phenotypic variance. Genotypic data frominbred lines were analyzed in Paper V, so the low-variance allele frequency was equivalent tothe low-variance genotype frequency.

50

Page 51: Dedicated to my family in Shiyan, Hubei, China, especially to my

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll

l

l

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

H2

VV /

VP

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll

ll

ll

l

l

l

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll

ll

ll

ll

ll

l

l

l

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll

ll

ll

ll

ll

ll

l

l

l

l

l

l

l

l l l l l l l l l l l l l l l l l l l l l l l l l ll

ll

ll

ll

ll

ll

ll

ll

l

l

l

l

l

l

l

l

l

l l l l l l l l l l l l l l l l l l ll

ll

l l ll

ll

ll

ll

ll

ll

ll

ll

l

l

l

l

l

l

l

l

l

l

l l l l l l l l l l l l l l l l ll l

ll

ll

ll

ll

ll

ll

ll

ll

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l l l l l l l l l l l l ll

ll

ll

ll

ll

ll

ll

ll

ll

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l l l l ll

ll

ll

ll

ll

ll

ll

ll

ll

ll

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

LAF = 0.9LAF = 0.8LAF = 0.7LAF = 0.6LAF = 0.5LAF = 0.4LAF = 0.3LAF = 0.2LAF = 0.1

Figure 7.2. The relationship between the phenotypic variance (VP) portion due tovariance heterogeneity (VV ) and the broad sense heritability (H2) assuming a two-way interaction model. The two-way interaction model includes the main effect ofthe tested locus, the main effect of the other interacting factor (set to null), and theirinteraction effect. The broad sense heritability is estimated as the coefficient of de-termination of the full interaction model. LAF = low-variance allele frequency (seePaper V).

51

Page 52: Dedicated to my family in Shiyan, Hubei, China, especially to my
Page 53: Dedicated to my family in Shiyan, Hubei, China, especially to my

PART III:DISCUSSION & CONCLUSION

Page 54: Dedicated to my family in Shiyan, Hubei, China, especially to my

8. Discussion

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal isvital.”

—— Aaron Levenstein1

MANY topics and challenges have arisen in current quantitative genetics (e.g. TI-WARI and SCHORK 2011). One should notice that the development of an-

alytical tools in quantitative genetics is always driven by data coming from recentbiotechnologies. Experimental design is important, however, the common routine ofgenetic analysis is like “get the data and see what we can do”. “Old” methods couldbe less useful as soon as the data type changes. For instance, many researchers havedeveloped methods for variance component QTL mapping, which are rarely used nowin a population where dense SNPs are available for GWAS. We should therefore try tothink more about how to deal with coming issues in genetic analysis. Some discussionon predictive modeling is made in Chapter 3. Here, I will mainly discuss two big chal-lenges in statistical genetics: missing heritability (MAHER 2008) and data types suchas rare variants (GIBSON 2012), and how they can be related to the statistical methodsdeveloped in this thesis. Some potential future development is also discussed.

8.1 Heritability: How much can we explain?Variation in a phenotype has both its genetic and environmental components. Thephenotypic variance VP can be dissected as

VP = VG +VE (8.1)

where VG and VE are the genetic and environmental variance, respectively (FALCONERand MACKAY 1996). The portion in VP that is determined by genetics is called thebroad sense heritability, i.e.

H2 =VG

VP(8.2)

VG is further dissected to be VA +VD +VI , where VA, VD, and VI are the additive, dom-inance, and interaction variance, respectively. In this formulation, VI includes both

1Quoted by Nature Genetics 24, page 11, January 2000.

54

Page 55: Dedicated to my family in Shiyan, Hubei, China, especially to my

the gene-by-gene and gene-by-environment interactions. The additive component VAplays a central role in phenotypic prediction, so the narrow sense heritability

h2 =VA

VP(8.3)

has drawn the most attention in quantitative genetics research. A general knowledgeabout heritability is that traits related to fitness are generally less heritable than others,e.g. litter size in pigs (h2 ≈ 5%) and egg production in fruit flies (h2 ≈ 20%). Traitslike human height mentioned above and back fat in pigs (h2 ≈ 70%) have very highheritability but with fairly high complexity in heredity as well.

The mappable loci often uncover a small proportion of the heritability of a complextrait, whereas the whole genome explains much more than that. An interesting traitthat has been studied since GALTON (1886) is human height. It has been concludedfrom many studies to be highly heritable (h2 ≈80%), but the discovered influentialgenes determine only a limited amount of variation in height, for instance, even themajor locus HMGA2 explains only ∼0.3%. Based on the GBLUP model, a recentstudy conducted by a research group in Australia found that human height is 45% in-fluenced by around 300 thousands common variants typed in our genome (YANG et al.2010a). Still, a substantial amount (h2 ≈35%) of heritability is not there in the com-mon SNPs. Taking blood lipids as another example, it is around 30% heritable, how-ever, the major locus APOE explains only∼ 0.5% of the variation in total cholesterol,and the highly significant locus CETP explains just ∼ 2.5% of HDL (AULCHENKOet al. 2009; TESLOVICH et al. 2010). Even all the 95 loci reported by TESLOVICHet al. (2010) explains only about 5% of the cholesterol variation.

What are the reasons that cause heritability to be missing? For instance, we mighthave missed alleles with rather small effects; most causal variants are not really ob-served; sex chromosomes are not fully considered; chromosomal rearrangements andrare variants are not considered. In the statistical modeling point of view, dominance,epistasis, gene-environment interaction, etc., are contributors to broad sense heritabil-ity but not addressed in the current popular predictive models. The HEM developedin Paper VI has a potential to fit more regressors computationally efficiently, so that itis worth trying to put dominance and epistasis into analysis. Rare variants might betreated as random effects assuming a particular random effects distribution in HGLMs,but further investigation is certainly required.

The findings of variance-controlling loci in Paper V does not directly contribute, atleast when the phenotypic distribution per genotype is normal, to the “missing heri-tability”. This can easily be shown using the KENT (1983)’s information gain2. Theinformation gain is computed by comparing the model assuming constant residualvariance and the one assuming residual variance heterogeneity. When the residu-als follow normal distribution, the information gain equals zero. However, as wediscussed in Chapter 2, variance-controlling loci can potentially contribute to broadsense heritability through interaction (e.g. PARÉ et al. 2010), which is to be revealedby further studies.

2The information gain is defined as the difference in FRASER (1965)’s information, which isdetermined by the likelihood of the model.

55

Page 56: Dedicated to my family in Shiyan, Hubei, China, especially to my

8.2 New data types: How to integrate information?Current biotechnology is generating new assays for DNA and even RNA sequences.As sequencing develops, it is a big challenge for statistical analysis tools to modelsuch sequence data. On a SNP array, what we see are the typed tags, mostly commonin the population, instead of the real complete sequence. When the individuals aresequenced, we do not only get information on haplotypes but also a number of rarevariants (BODMER and BONILLA 2008). The rare variants in a DNA segment, aresparse like stars in the sky, however instead of shiny, they can be rather annoying.First of all, rare variants do have effects, but how to collect the effects on a populationlevel is a difficult question to answer.

Both common and rare variants are typed at different kinds of DNA segments,where plenty of annotation or gene ontology (GO) information is available in thedatabase, but it is disappointing that we do not use such information in gene mappingand predictive modeling. A hierarchical modeling framework is required to re-weighttyped variants according to where they are typed, for instance, whether a particularSNP is located in an exon of an annotated gene or an intron instead. Certainly, SNParrays are not designed for typing causal variants but just tags, nevertheless, as thesequencing data become available, treating all the variants the same regardless wherethey are seems to be incorrect. The DHGLM in Paper III (or a similar HGLM) givesus an option to model such biological information. In random effects models, theannotation information could be modeled as a dispersion model for the marker-specificgenetic variance component.

8.3 Future developmentFuture development of the works in this thesis would be based mainly on the nov-elties in Paper III, V, and VI. The double-layer random effects model (DHGLM) inPaper III, has good potential to incorporate complex biological information into genemapping and genomic evaluation. In short, genetic variants typed at different seg-ments of DNA should be re-weighted differently, and using a DHGLM or an HGLMwith structured genetic variance, this can be achieved. Such a double-layer modelcan be used to model genetic variants in a sliding window and test the joint geneticeffects. As sequencing data become available, sliding window analysis seems to bea way to score and test rare variants. The generalized ridge regression developedin Paper VI can be regarded as a fast approximation for the DHGLM in Paper III,although not as flexible as the DHGLM or other similar Bayesian methods, its com-putational advantage would definitely help to make hierarchical models more usefulin fitting high-dimensional genomic data. The statistical test and quantitative theoryin Paper V are currently quite trivial, however, the results are already striking enoughfor us to pay attention to variance-controlling genes. There are quite a few issuesin mapping variance-controlling QTL (vQTL; RÖNNEGÅRD and VALDAR 2011), forinstance, population stratification cannot be solved by simply using a linear mixedmodel, a double-layer random effects model could be a solution (YANG et al. 2011b);the contribution of vQTL in genomic selection is not clear yet; the relationship be-tween variance heterogeneity and heritability needs to be further investigated, and soforth. Some of the topics mentioned here are ongoing at the time of writing.

56

Page 57: Dedicated to my family in Shiyan, Hubei, China, especially to my

9. Conclusion

“In population genetics there is usually little reason for confidence that an estimateis correct even to within an order of magnitude, but reaching it faster is definitelyprogress.”

—— Rosalind Harding1

THIS thesis develops some analytical methods that cover different branches of sta-tistical genetics, including QTL analysis, GWAS, and genomic evaluation. More

information has been taken into account so that accuracy in QTL mapping using vari-ance component models can be improved (Paper II). Mapping variance-heterogeneityin GWAS has been shown to be able to discover significant loci missed in earlier stud-ies (Paper V). Double hierarchical generalized linear models, and a new generalizedridge regression, have been implemented and applied to high-dimensional genomicdata analysis, showing promising results in both QTL mapping and genomic predic-tion (Paper III & VI). In conclusion, the works presented here provide novel insightsin modeling genetic variance, which is the major contribution of this thesis.

Most of the tools developed in this thesis have been implemented as R packagesand are available online, including a general statistical tool for fitting random ef-fects models (package hglm, Paper I), an efficient generalized ridge regression forhigh-dimensional data (package bigRR, Paper VI), a double-layer mixed model forgenomic data analysis (package iQTL, Paper III), a stochastic IBD matrix calcula-tor (package MCIBD, used in Paper II), a computational interface for QTL mapping(package qtl.outbred, Paper IV), and a GWAS analysis tool for mapping variance-controlling loci (package vGWAS, Paper V).

1In discussion on STEPHENS and DONNELLY (2000), Journal of the Royal Statistical Society,Series B 62(4), page 638.

57

Page 58: Dedicated to my family in Shiyan, Hubei, China, especially to my

Sammanfattning på Svenska

Denna avhandling utvecklar och utvärderar statistiska metoder för olika typer av genet-iska analyser, inklusive quantitative trait loci (QTL) analys, genomvid associationsstudier (GWAS) och genomisk utvärdering. Det viktigaste resultatet av avhandlin-gen är att ge nya insikter i modellering genetisk variation, särskilt via modeller medslumpmässiga effekter.

En metod för QTL analys utvecklades där osäkerhet i nedärvning ingår i modellen.Det visade sig att denna modell till viss del kan korrigera för bias i skattningar ochöka precisionen i QTL kartläggning.

Dubbel hierarkiska generaliserade linjära modeller, samt en förenklad version, utve-cklades och tillämpades i hel-genom analys. Metoderna visade hög säkerhet i QTLkartläggning och genomisk prediktion.

En analys av allmänt tillgängliga GWAS data identifierade betydande loci i Ara-bidopsis som styr fenotypisk varians i stället för medelvärdet. Denna studie stärkerexistensen av varianskontrollerande gener.

Studierna i avhandlingen åtföljs av R paket som finns tillgängliga online. Dessainkluderar ett statistiskt verktyg för skattning av modeller med slumpmässiga effekter(hglm), general ridge regression för mångdimensionella data (bigRR), ett analysverk-tyg för genom-data (iQTL), en stokastisk IBD matris kalkylator (MCIBD), ett beräkn-ingsprogram och gränssnitt för QTL kartläggning (qtl.outbred), och analysverktygför att kartlägga varianskontrollerande loci i GWAS (vGWAS).

58

Page 59: Dedicated to my family in Shiyan, Hubei, China, especially to my

Acknowledgements

The work conducted in this thesis was performed in both Uppsala (Swedish Universityof Agricultural Sciences) and Borlänge (Dalarna University). Swedish Foundation forStrategic Research is acknowledged for financial support. Thanks to the NationalGraduate School in Scientific Computing for conducting and funding many usefulcourses.

This thesis would not be possible without supports from other people. First of all, Iowe my deepest gratitude to my two supervisors, Lars Rönnegård and Örjan Carl-borg. They have made available supports from many different aspects. I am gratefulto Lars for guiding me into the field of statistical genetics. Lars has contributed hisselfless support to each of my projects, with both responsibility and talent. It is notonly always interesting to discuss ideas and work with him, playing table tennis andbilliard games together with him have also been a lot of fun. I would like to thankÖrjan for interesting discussions about creative scientific ideas and also some outlookon life. Especially, the nice open-minded computational genetics group that Örjan isleading has been a wonderful environment for my research.

It is a pleasure for me to thank the previous and current members of the computa-tional genetics group in Uppsala. Thanks to Ronnie Nelson for working together onour published package and his friendly smile that makes people feel happy. Thanks toMats Pettersson for co-authoring with plenty of technical and intellectual supports.Thanks to Marcin Kierczak for sharing the office with me and growing Arabidop-sis for me. Thanks to François Besnier for discussions on variance component QTLanalysis and his generous support on IBD matrix calculation. Thanks to WeronicaEk for challenging me with her data and traveling sleepily with me back from theUS. Thanks to Lucy Crooks for interesting discussions and language support by herqueen English. Many thanks also to Anna Johansson, Stefan Marklund, Xidan Li,Jiazhong Guo, Zheya Sheng and Muhammad Ahsan for their help on science andlife.

I am indebted to my colleagues in Borlänge for both their supports in statisticsknowledge and the nice working environment that they have created. Thanks toMoudud Alam for sharing the office with me and helping me a lot understand theinference theory of mixed models. Thanks to Kenneth Carling for providing fund-ing for several conferences that I attended and also letting me help in the skiing racein Falun. Thanks to Changli He for being my teacher from master study and en-lightening me with his experience. Also, many thanks to Majbritt Felleki, RichardStridbeck, Dao Li, Xiangli Meng, Mengjie Han and Ola Nääs for their help and in-teresting discussions. I hereby would like to thank Fan Yang Wallentin who lecturedin Borlänge as well, for her instructions and guidance for my study and the opportu-nity to present my work at the statistics department of Uppsala University. Thanks toJohan Bring who worked in Borlänge, for his helpful supervision during my masterstudy. Also, thanks to Mikael Möller who also worked in Bolänge, for interestingdiscussions and his technical support on LATEX.

59

Page 60: Dedicated to my family in Shiyan, Hubei, China, especially to my

It is a great honor for me to thank Yurii Aulchenko for the opportunity to visitRotterdam and learn from him. Yurii is such a nice and talented person to work with,who is very pedagogical and patient in discussions. We not only discussed aboutseveral ideas to work on, he also introduced to me a lot of common knowledge abouthuman populations and genome-wide association studies that I was not aware of. Ialso would like to acknowledge him for buying me coffee during our discussions inRotterdam.

I am grateful to Freddy Fikse for co-authoring with his knowledge in the animalbreeding area and some very useful suggestions. I also would like to thank CarlNettelblad for his support via his software for genotype probability calculation andother interesting discussion.

Without the advice from the lecturers of the courses that I have taken, it wouldnot be possible for me to build up knowledge for science. I would like to thank Hos-sein Jorjani for offering his quantitative genetics course and interesting discussions inTexas. Thanks to Youngjo Lee for lecturing at the winter course in Vålådalen aboutthe hierarchical generalized linear models, and also a lot of interesting discussionswith him about both research and skiing. Thanks to Maya Neytcheva for her use-ful lectures on numerical methods and matrix computation. Thanks to Dietrich vonRosen for not only his lectures but also quite a few interesting discussions regard-ing science and other aspects. Also, thanks to other lectures including Bruce Walsh,Martin Berggren, Jonas Lindemann, Lars Eldén, Jarrod Hadfield and Jeffrey S.Racine for their lectures on both theory and software packages.

In addition, I would like to show my gratitude to many friends from the same masterprogram in Dalarna University as I am, who are currently also doing their doctoralresearch in Sweden. These include Ying Li at Swedish University of AgriculturalSciences; Jianxin Wei, Xingwu Zhou, and Xijia Liu at Uppsala University; FengLi, Yuli Liang, Ying Pang, and Chengcheng Hao at Stockholm University. All ofthem have been good buddies on the way approaching PhD.

Thanks to Jinzhi Hu, Liang Tian, and other friends for enjoying tennis togetherwith me. Thanks to Kaweng Ieong and his family (especially “Abı”) for being myfriendly neighbors. Thanks to Liang Tian again and his family (especially “Pípi”) forbeing my friends in Uppsala, especially traveling with them has always been of greatfun.

Thanks to all my friends in China, especially to the couple Zirui Yu and Xiaol-ing Tan who, at the time of writing, are planning the trip to Sweden to attend mydissertation. Although the distance is long, the supports have never disappeared.

Thanks to Linda, for sharing life and goals with me, and also for her help on mySammanfattning på Svenska.

Finally, I am very grateful to my family and would like to dedicate all my achieve-ment to them, especially to my grandparents for everything that they have been giv-ing selflessly.

60

Page 61: Dedicated to my family in Shiyan, Hubei, China, especially to my

References

ANDERSSON, L., C. HALEY, H. ELLEGREN, S. KNOTT, M. JOHANSSON, et al.,1994 Genetic mapping of quantitative trait loci for growth and fatness in pigs.Science 263: 1771–1774.

ATWELL, S., Y. S. HUANG, B. J. VILHJALMSSON, G. WILLEMS, M. HORTON,et al., 2010 Genome-wide association study of 107 phenotypes in Arabidopsisthaliana inbred lines. Nature 465: 627–631.

AULCHENKO, Y. S., D.-J. DE KONING, and C. HALEY, 2007a Genomewide rapidassociation using mixed model and regression: a fast and simple method forgenomewide pedigree-based quantitative trait loci association analysis. Genetics177: 577–85.

AULCHENKO, Y. S., S. RIPATTI, I. LINDQVIST, D. BOOMSMA, I. M. HEID, et al.,2009 Loci influencing lipid levels and coronary heart disease risk in 16 europeanpopulation cohorts. Nature Genetics 41: 47–55.

AULCHENKO, Y. S., S. RIPKE, A. ISAACS, and C. VAN DUIJN, 2007b GenABEL:an R package for genome-wide association analysis. Bioinformatics 23:1294–1296.

BALDING, D. J., 2006 A tutorial on statistical methods for population associationstudies. Nature Reviews Genetics 7: 781–791.

BAXTER, I., B. MUTHUKUMAR, H. C. PARK, P. BUCHNER, B. LAHNER, et al.,2008 Variation in molybdenum content across broadly distributed populations ofArabidopsis thaliana is controlled by a mitochondrial molybdenum transporter(MOT1). PLoS Genetics 4: e1000004.

BJØRNSTAD, J. F., 1996 On the generalization of the likelihood function and thelikelihood principle. Journal of the American Statistical Association 91: 791–806.

BODMER, W., and C. BONILLA, 2008 Common and rare variants in multifactorialsusceptibility to common diseases. Nature Genetics 40: 695–701.

BOUWMAN, A. C., L. L. G. JANSS, and H. C. M. HEUVEN, 2011 A Bayesianapproach to detect QTL affecting a simulated binary and quantitative trait. BMCProceedings 5(Suppl 3): S2.

BRESLOW, N. E., and D. G. CLAYTON, 1993 Approximate inference in generalizedlinear mixed models. Journal of the American Statistical Association 88: 9–25.

BROMAN, K. W., 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics19: 889–890.

BROWN, M. B., and A. B. FORSYTHE, 1974 Robust tests for equality of variances.Journal of the American Statistical Association 69: 364–367.

CALUS, M. P. L., H. A. MULDER, and R. F. VEERKAMP, 2011 Estimatinggenomic breeding values and detecting QTL using univariate and bivariate models.BMC Proceedings 5(Suppl 3): S5.

CANTOR, R. M., K. LANGE, and J. S. SINSHEIMER, 2010 Prioritizing gwas results:A review of statistical methods and recommendations for their application.American Journal of Human Genetics 86: 6–22.

61

Page 62: Dedicated to my family in Shiyan, Hubei, China, especially to my

CARLBORG, Ö., and C. S. HALEY, 2004 Epistasis: too often neglected in complextrait studies? Nature Reviews Genetics 5: 618–625.

CHA, P.-C., A. TAKAHASHI, N. HOSONO, S.-K. LOW, N. KAMATANI, et al., 2011A genome-wide association study identifies three loci associated withsusceptibility to uterine fibroids. Nature Genetics 43: 447–450.

COX, D. R., 1984 Interaction. International Statistical Review / Revue Internationalede Statistique 52: 1–31.

DEMPSTER, A. P., N. M. LAIRD, and D. B. RUBIN, 1977 Maximum likelihoodfrom incomplete data via the em algorithm. Journal of the Royal StatisticalSociety. Series B (Methodological) 39: 1–38.

DEVLIN, B., and K. ROEDER, 1999 Genomic control for association studies.Biometrics 55: 997–1004.

EDWARDS, A., 1972 Likelihood. Cambridge University Press, Cambridge.ELSTON, R., and J. STEWART, 1971 A general model for the genetic analysis of

pedigree data. Human Heredity 21: 523–542.FALCONER, D. S., and T. F. MACKAY, 1996 Introduction to Quantitative Genetics.

Longman: London.FISHER, R. A., 1918 The correlation between relatives on the supposition of

Mendelian inheritance. Transactions of the Royal Society of Edinburgh 52:399–433.

FISHER, R. A., 1922 On the mathematical foundations of theoretical statistics.Philosophical Transactions of the Royal Society of London, Series A 222:309–368.

FRASER, D. A. S., 1965 On information in statistics. The Annals of MathematicalStatistics 36: 890–896.

GALTON, F., 1886 Regression towards mediocrity in hereditary stature. The Journalof the Anthropological Institute of Great Britain and Ireland 15: 246–263.

GARDNER, K. M., and R. G. LATTA, 2007 Shared quantitative trait loci underlyingthe genetic correlation between continuous traits. Molecular Ecology 16:4195–209.

GEORGE, E. I., and R. E. MCCULLOCH, 1993 Variable selection via gibbssampling. Journal of the American Statistical Association 88: 881–889.

GESSLER, D. D. G., and S. XU, 1996 Using the expectation or the distribution ofthe identity by descent for mapping quantitative trait loci under the random model.American Journal of Human Genetics 59: 1382–1390.

GIANOLA, D., G. DE LOS CAMPOS, W. G. HILL, E. MANFREDI, andR. FERNANDO, 2009 Additive genetic variability and the Bayesian alphabet.Genetics 183: 347–363.

GIBSON, G., 2012 Rare and common variants: twenty arguments. Nature ReviewsGenetics 13: 135–145.

HABIER, D., R. L. FERNANDO, K. KIZILKAYA, and D. J. GARRICK, 2011Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics12.

HALEY, C., and S. KNOTT, 1992 A simple regression method for mappingquantitative trait loci in line crosses using flanking markers. Heredity 69: 315–324.

HARVILLE, D. A., 1977 Maximum likelihood approaches to variance componentestimation and to related problems. Journal of the American Statistical

62

Page 63: Dedicated to my family in Shiyan, Hubei, China, especially to my

Association 72: 320–338.HASTIE, T., R. TIBSHIRANI, and J. FRIEDMAN, 2009 The elements of statistical

learning. Springer.HENDERSON, C. R., 1953 Estimation of variance and covariance components.

Biometrics 9: 226–252.HENDERSON, C. R., 1984 Applications of Linear Models in Animal Breeding.

University of Guelph, Guelph, ON, 3rd edition.HEWITT, J. K., and A. C. HEATH, 1988 A note on computing the chi-square

noncentrality parameter for power analyses. Behavior Genetics 1: 105–108.HILL, W. G., M. E. GODDARD, and P. M. VISSCHER, 2008 Data and theory point

to mainly additive genetic variance for complex traits. PLoS Genetics 4:e1000008.

JANSS, L. L. G., 2009 iBay manual version 1.47. Janss Biostatistics, Leiden, theNetherlands.

JIMENEZ-GOMEZ, J. M., J. A. CORWIN, B. JOSEPH, J. N. MALOOF, and D. J.KLIEBENSTEIN, 2011 Genomic analysis of QTLs and genes altering naturalvariation in stochastic noise. PLoS Genet 7: e1002295.

KANG, H. M., J. H. SUL, S. K. SERVICE, N. A. ZAITLEN, S.-Y. KONG, et al.,2010 Variance component model to account for sample structure in genome-wideassociation studies. Nature Genetics 42: 348–354.

KANG, H. M., N. A. ZAITLEN, C. M. WADE, A. KIRBY, D. HECKERMAN, et al.,2008 Efficient control of population structure in model organism associationmapping. Genetics 178: 1709–23.

KARACAÖREN, B., T. SILANDER, J. M. ÁLVAREZ-CASTRO, C. S. HALEY, andD. J. DE KONING, 2011 Association analyses of the MAS-QTL data set usinggrammar, principal components and Bayesian network methodologies. BMCProceedings 5(Suppl 3): S8.

KENT, J. T., 1983 Information gain and a general measure of correlation. Biometrika70: 163–173.

KINGSMORE, S. F., I. E. LINDQUIST, J. MUDGE, D. D. GESSLER, and W. D.BEAVIS, 2008 Genome-wide association studies: progress and potential for drugdiscovery and development. Nature Review Drug Discovery 7: 221–230.

KLEIN, R. J., C. ZEISS, E. Y. CHEW, J.-Y. TSAI, R. S. SACKLER, et al., 2005Complement factor h polymorphism in age-related macular degeneration. Science308: 385–388.

KRUGLYAK, L., and E. LANDER, 1995 Complete multipoint sib-pair analysis ofqualitative and quantitative traits. American Journal of Human Genetics 57:439–454.

LANDER, E., and D. BOTSTEIN, 1989 Mapping mendelian factors underlyingquantitative traits using RFLP linkage maps. Genetics 121: 185–199.

LEE, Y., and J. A. NELDER, 1996 Hierarchical generalized linear models (withdiscussion). Journal of the Royal Statistical Society. Series B (Methodological) 58:619–678.

LEE, Y., and J. A. NELDER, 2001 Hierarchical generalised linear models: Asynthesis of generalised linear models, random-effect models and structureddispersions. Biometrika 88: 987–1006.

LEE, Y., and J. A. NELDER, 2006 Double hierarchical generalized linear models

63

Page 64: Dedicated to my family in Shiyan, Hubei, China, especially to my

(with discussion). Applied Statistics 55: 139–185.LEE, Y., J. A. NELDER, and M. NOH, 2007 H-likelihood: problems and solutions.

Statistics and Computing 17: 49–55.LEE, Y., J. A. NELDER, and Y. PAWITAN, 2006 Generalized Linear Models with

Random Effects: Unified Analysis via H-likelihood. Chapman & Hall/CRC.LEVY, D., G. B. EHRET, K. RICE, G. C. VERWOERT, L. J. LAUNER, et al., 2009

Genome-wide association study of blood pressure and hypertension. Nat Genet .LIPPERT, C., J. LISTGARTEN, Y. LIU, C. M. KADIE, R. I. DAVIDSON, et al., 2011

Fast linear mixed models for genome-wide association studies. Nature Methods 8:833–5.

LIU, X., and S. RAUDENBUSH, 2004 A note on the noncentrality parameter andeffect size estimates for the f test in anova. Journal of Educational and BehavioralStatistics 29: 251–255.

LYNCH, M., and B. WALSH, 1998 Genetics and analysis of Quantitative Traits.Sinauer Associates, Inc.

MAHER, B., 2008 The case of the missing heritability. Nature 456: 18–21.MALO, N., O. LIBIGER, and N. J. SCHORK, 2008 Accommodating linkage

disequilibrium in genetic-association analyses via ridge regression. AmericanJournal of Human Genetics 82: 375–385.

MCCARTHY, M. I., G. R. ABECASIS, L. R. CARDON, D. B. GOLDSTEIN,J. LITTLE, et al., 2008 Genome-wide association studies for complex traits:consensus, uncertainty and challenges. Nature Reviews Genetics 9: 356–369.

MCGULLAGH, P., and J. A. NELDER, 1989 Generalized linear models. Chapman &Hall/CRC.

MEUWISSEN, T., B. HAYES, and M. GODDARD, 2001 Prediction of total geneticvalue using genome-wide dense marker maps. Genetics 157: 1819–1829.

MORTON, N., and C. MACLEAN, 1974 Analysis of family resemblance. III.complex segregation of quantitative traits. American Journal of Human Genetics26: 489–503.

MUCHA, S., M. PSZCZOLA, T. STRABEL, A. WOLC, P. PACZYNSKASKA, et al.,2011 Comparison of analyses of the QTLMAS XIV common dataset. II: QTLanalysis. BMC Proceedings 5(Suppl 3): S2.

NELDER, J. A., 1994 The statistics of linear models: back to basics. Statistics andComputing 4: 221–234.

NELSON, R. M., X. SHEN, and Ö. CARLBORG, 2011 qtl.outbred: Interfacingoutbred line cross data with the R/qtl mapping software. BMC Research Notes 4.

NETTELBLAD, C., 2011 Haplotype inference based on hidden markov models in theQTL-MAS 2010 multi-generational dataset. BMC Proceedings 5(Suppl 3): S10.

NETTELBLAD, C., S. HOLMGREN, L. CROOKS, and Ö. CARLBORG, 2009cnF2freq: Efficient determination of genotype and haplotype probabilities inoutbred populations using markov models. Lecture Notes in Bioinformatics(LNBI) 5462: 307–319.

PARÉ, G., N. R. COOK, P. M. RIDKER, and D. I. CHASMAN, 2010 On the use ofvariance per genotype as a tool to identify quantitative trait interaction effects: areport from the women’s genome health study. PLoS Genetics 6: e1000981.

PAWITAN, Y., 2001 In All Likelihood: Statistical Modelling and Inference UsingLikelihood. Oxford Science Publications.

64

Page 65: Dedicated to my family in Shiyan, Hubei, China, especially to my

PRATT, J. W., 1976 F. Y. Edgeworth and R. A. Fisher on the efficiency of maximumlikelihood estimation. The Annals of Statistics 4: 501–514.

PRICE, A. L., N. J. PATTERSON, R. M. PLENGE, M. E. WEINBLATT, N. A.SHADICK, et al., 2006 Principal components analysis corrects for stratification ingenome-wide association studies. Nature Genetics 38: 904–9.

PRICE, A. L., N. A. ZAITLEN, D. REICH, and N. PATTERSON, 2010 Newapproaches to population stratification in genome-wide association studies. NatureReviews Genetics 11: 459–463.

PSZCZOLA, M., T. STRABEL, A. WOLC, S. MUCHA, and M. SZYDLOWSKI, 2011Comparison of analyses of the QTLMAS XIV common dataset. I: genomicselection. BMC Proceedings 5(Suppl 3): S1.

PURCELL, S., B. NEALE, K. TODD-BROWN, L. THOMAS, M. A. R. FERREIRA,et al., 2007 PLINK: A tool set for whole-genome association and population-basedlinkage analyses. American Journal of Human Genetics 81: 559–575.

R DEVELOPMENT CORE TEAM, 2011 R: A Language and Environment forStatistical Computing. R Foundation for Statistical Computing, Vienna, Austria.ISBN 3-900051-07-0.

RIECHMANN, J. L., J. HEARD, G. MARTIN, L. REUBER, C.-Z. JIANG, et al., 2000Arabidopsis transcription factors: Genome-wide comparative analysis amongeukaryotes. Science 290: 2105–2110.

ROBINSON, G. K., 1991 That BLUP is a good thing: The estimation of randomeffects. Statistical Science 6: 15–32.

RÖNNEGÅRD, L., F. BESNIER, and O. CARLBORG, 2008 An improved method forquantitative trait loci detection and identification of within-line segregation in F2intercross designs. Genetics 178: 2315–2326.

RÖNNEGÅRD, L., and O. CARLBORG, 2007 Separation of base allele and samplingterm effects gives new insights in variance component QTL analysis. BMCGenetics 8: 1.

RÖNNEGÅRD, L., and Y. LEE, 2010 Hierarchical generalized linear models have agreat potential in genetics and animal breeding. In Proceedings World Congress onGenetics Applied to Livestock Production, Leipzig, Germany.

RÖNNEGÅRD, L., X. SHEN, and M. ALAM, 2010 hglm: A package for fittinghierarchical generalized linear models. The R Journal 2: 20–28.

RÖNNEGÅRD, L., and W. VALDAR, 2011 Detecting major genetic loci controllingphenotypic variability in experimental crosses. Genetics 188: 435–447.

SCHORK, N. J., 1993 Extended multipoint identity-by-descent analysis of humanquantitative traits: Efficiency, power, and modeling considerations. AmericanJournal of Human Genetics 53: 1306–1319.

SEATON, G., J. HERNANDEZ, J. GRUNCHEC, I. WHITE, J. ALLEN, et al., 2006GridQTL: A grid portal for QTL mapping of compute intensive datasets. InProceedings of the 8th World Congress on Genetics Applied to LivestockProduction, August 13-18, 2006. Belo Horizonte, Brazil.

SHEN, X., M. ALAM, F. FIKSE, and L. RÖNNEGÅRD, 2012a Fast generalized ridgeregression for models including heteroscedastic effects in quantitative genetics.Manuscript .

SHEN, X., M. PETTERSSON, L. RÖNNEGÅRD, and Ö. CARLBORG, 2012bInheritance beyond plain heritability: variance-controlling genes in Arabidopsis

65

Page 66: Dedicated to my family in Shiyan, Hubei, China, especially to my

thaliana. Submitted .SHEN, X., L. RÖNNEGÅRD, and Ö. CARLBORG, 2011a Hierarchical likelihood

opens a new way of estimating genetic values using genome-wide dense markermaps. BMC Proceedings 5(Suppl 3): S14.

SHEN, X., L. RÖNNEGÅRD, and Ö. CARLBORG, 2011b How to deal with genotypeuncertainty in variance component quantitative trait loci analyses. GeneticsResearch, Cambridge 93: 333–342.

SORENSEN, D., 2009 Developments in statistical analysis in quantitative genetics.Genetica 136: 319–332.

STEPHENS, M., and P. DONNELLY, 2000 Inference in molecular population genetics(with discussion). Journal of the Royal Statistical Society, Series B 62: 605–655.

STRUCHALIN, M. V., A. DEHGHAN, J. C. M. WITTEMAN, C. V. DUIJN, and Y. S.AULCHENKO, 2010 Variance heterogeneity analysis for detection of potentiallyinteracting genetic loci: method and its limitations. BMC Genetics 11: 92.

SUN, X., D. HABIER, R. L. FERNANDO, D. J. GARRICK, and J. C. M. DEKKERS,2011 Genomic breeding value prediction and QTL mapping of QTLMAS2010data using Bayesian methods. BMC Proceedings 5(Suppl 3): S13.

SZYDLOWSKI, M., and P. PACZYNSKA, 2011 QTLMAS 2010: simulated dataset.BMC Proceedings 5(Suppl 3): S3.

TESLOVICH, T. M., K. MUSUNURU, A. V. SMITH, A. C. EDMONDSON, I. M.STYLIANOU, et al., 2010 Biological, clinical and population relevance of 95 locifor blood lipids. Nature 466: 707–713.

THOMPSON, J. R., J. ATTIA, and C. MINELLI, 2011 The meta-analysis ofgenome-wide association studies. Briefings in Bioinformatics 12: 259–269.

TIAN, F., P. J. BRADBURY, P. J. BROWN, H. HUNG, Q. SUN, et al., 2011Genome-wide association study of leaf architecture in the maize nested associationmapping population. Nature Genetics 43: 159–162.

TIBSHIRANI, R., 1996 Regression shrinkage and selection via the Lasso. Journal ofthe Royal Statistical Society, Series B 58: 267–288.

TIWARI, H. K., and N. J. SCHORK, 2011 Grand challenges in statisticalgenetics/genomics methodology. Frontiers in Genetics 2: 1–2.

TURNBULL, C., S. AHMED, J. MORRISON, D. PERNET, A. RENWICK, et al., 2010Genome-wide association study identifies five new breast cancer susceptibilityloci. Nature Genetics 42: 504–507.

VALDAR, W., L. C. SOLBERG, D. GAUGUIER, S. BURNETT, P. KLENERMAN,et al., 2006 Genome-wide genetic association of complex traits in heterogeneousstock mice. Nature Genetics 38: 879–887.

VAN VLECK, L. D., 1998 Charles Roy Henderson: A Biographical Memoir.National Academies Press, Washington D.C.

VANRADEN, P. M., 2008 Efficient methods to compute genomic predictions. Journalof Dairy Science 91: 4414–4423.

VERBYLA, K. L., B. J. HAYES, P. J. BOWMAN, and M. E. GODDARD, 2009Accuracy of genomic selection using stochastic search variable selection inAustralian Holstein Friesian dairy cattle. Genetics Research, Cambridge 91:307–311.

WALLIS, M., P. WATERS, and J. GRAVES, 2008 Sex determination in mammals -before and after the evolution of SRY. Cellular and Molecular Life Sciences 65:

66

Page 67: Dedicated to my family in Shiyan, Hubei, China, especially to my

3182–3195.WANG, F., C.-Q. XU, Q. HE, J.-P. CAI, X.-C. LI, et al., 2011 Genome-wide

association identifies a susceptibility locus for coronary artery disease in thechinese han population. Nature Genetics 43: 345–349.

WANG, H., Y.-M. ZHANG, X. LI, G. L. MASINDE, S. MOHAN, et al., 2005aBayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170:465–80.

WANG, W. Y. S., B. J. BARRATT, D. G. CLAYTON, and J. A. TODD, 2005bGenome-wide association studies: theoretical and practical concerns. NatureReviews Genetics 6: 109–118.

WU, R., C. MA, and G. CASELLA, 2007 Statistical Genetics of Quantitative Traits:Linkage, Maps, and QTL. Springer Science + Business Media, LLC.

XU, S., 1996 Computation of the full likelihood function for estimating variance at aquantitative trait locus. Genetics 144: 1951–1960.

XU, S., 2003 Estimating polygenic effects using markers of the entire genome.Genetics 163: 789–801.

XU, S., 2007 An empirical Bayes method for estimating epistatic effects ofquantitative trait loci. Biometrics 63: 513–521.

YANG, J., B. BENYAMIN, B. P. MCEVOY, S. GORDON, A. K. HENDERS, et al.,2010a Common snps explain a large proportion of the heritability for humanheight. Nature Genetics 42: 565–9.

YANG, J., S. H. LEE, M. E. GODDARD, and P. M. VISSCHER, 2010b GCTA: A toolfor genome-wide complex trait analysis. American Journal of Human Genetics 88:76–82.

YANG, J., T. A. MANOLIO, L. R. PASQUALE, E. BOERWINKLE, N. CAPORASO,et al., 2011a Genome partitioning of genetic variation for complex traits usingcommon snps. Nature Genetics 43: 519–525.

YANG, Y., O. F. CHRISTENSEN, and D. SORENSEN, 2011b Analysis of agenetically structured variance heterogeneity model using the Box-Coxtransformation. Genetics Research, Cambridge 93: 33–46.

YI, N., and S. XU, 2008 Bayesian LASSO for quantitative trait loci mapping.Genetics 179: 1045–1055.

ZENG, Z.-B., 1993 Theoretical basis for separation of multiple linked gene effects inmapping quantitative trait loci. Proceedings of the National Academy of Sciences,USA 90: 10972–10976.

ZENG, Z.-B., 1994 Precision mapping of quantitative trait loci. Genetics 136:1457–1468.

67

Page 68: Dedicated to my family in Shiyan, Hubei, China, especially to my