Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for...

13
Copyright Ó 2010 by the Genetics Society of America DOI: 10.1534/genetics.110.116426 Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,* ,1 Bernard De Baets § and Geert Haesaert* *Department of Biosciences and Landscape Architecture, University College Ghent, B-9000 Gent, Belgium, § Department of Applied Mathematics, Biometrics and Process Control, Ghent University, B-9000 Gent, Belgium Manuscript received March 8, 2010 Accepted for publication May 11, 2010 ABSTRACT Efficient genomic selection in animals or crops requires the accurate prediction of the agronomic performance of individuals from their high-density molecular marker profiles. Using a training data set that contains the genotypic and phenotypic information of a large number of individuals, each marker or marker allele is associated with an estimated effect on the trait under study. These estimated marker effects are subsequently used for making predictions on individuals for which no phenotypic records are available. As most plant and animal breeding programs are currently still phenotype driven, the continuously expanding collection of phenotypic records can only be used to construct a genomic prediction model if a dense molecular marker fingerprint is available for each phenotyped individual. However, as the genotyping budget is generally limited, the genomic prediction model can only be constructed using a subset of the tested individuals and possibly a genome-covering subset of the molecular markers. In this article, we demonstrate how an optimal selection of individuals can be made with respect to the quality of their available phenotypic data. We also demonstrate how the total number of molecular markers can be reduced while a maximum genome coverage is ensured. The third selection problem we tackle is specific to the construction of a genomic prediction model for a hybrid breeding program where only molecular marker fingerprints of the homozygous parents are available. We show how to identify the set of parental inbred lines of a predefined size that has produced the highest number of progeny. These three selection approaches are put into practice in a simulation study where we demonstrate how the trade-off between sample size and sample quality affects the prediction accuracy of genomic prediction models for hybrid maize. D ESPITE the numerous studies devoted to molec- ular marker-based breeding, the genetic progress of most complex traits in today’s plant and animal breeding programs still heavily relies on phenotypic selection. Most breeding companies have established dedicated databases that store the vast number of phenotypic records that are being routinely collected throughout the course of their breeding programs. These phenotypic records are, however, gradually being complemented by various types of molecular marker scores and it is to be expected that effective marker- based selection schemes will eventually allow current phenotyping efforts to be reduced (Bernardo 2008; Hayes et al. 2009). The available marker and pheno- typic databases already allow for the construction and validation of marker-based selection schemes. Mining the phenotypic databases of a breeding company is, however, quite different from analyzing the data that is generated by a carefully designed experiment. Genetic evaluation data is often severely unbalanced as elite individuals are usually tested many times on their way to becoming a commercial variety or sire, while less per- forming individuals are often disregarded after a single trial. Furthermore, the different phenotypic evaluation trials are separated in time and space and as such, sub- jected to different environmental conditions. Therefore, ranking the performance of individuals that were evalu- ated in different phenotypic trials is usually a nontrivial task. Animal breeders are well experienced when it comes to handling unbalanced genetic evaluation data. The best linear unbiased predictor or BLUP approach (Henderson 1975) presented a major breakthrough in this respect, especially when combined with restricted maximum-likelihood or REML estimation of the need- ed variance components (Patterson and Thompson 1971). Somewhat later on, this linear mixed modeling approach was also adopted by plant breeders as the de facto standard for handling unbalanced phenotypic data. The more recent developments in genomic selection (Bernardo 1995; Meuwissen et al. 2001; Gianola and van Kaam 2008) and marker-trait associ- ation studies (Yu et al. 2006) are, at least partially, BLUP- based and are therefore, in theory, perfectly suited for 1 Corresponding author : Department of Biosciences and Landscape Architecture, University College Ghent, Voskenslaan 270, B-9000 Gent, Belgium. E-mail: [email protected] Genetics 185: 1463–1475 (August 2010)

Transcript of Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for...

Page 1: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

Copyright � 2010 by the Genetics Society of AmericaDOI: 10.1534/genetics.110.116426

Graph-Based Data Selection for the Constructionof Genomic Prediction Models

Steven Maenhout,*,1 Bernard De Baets§ and Geert Haesaert*

*Department of Biosciences and Landscape Architecture, University College Ghent, B-9000 Gent, Belgium, §Departmentof Applied Mathematics, Biometrics and Process Control, Ghent University, B-9000 Gent, Belgium

Manuscript received March 8, 2010Accepted for publication May 11, 2010

ABSTRACT

Efficient genomic selection in animals or crops requires the accurate prediction of the agronomicperformance of individuals from their high-density molecular marker profiles. Using a training data setthat contains the genotypic and phenotypic information of a large number of individuals, each marker ormarker allele is associated with an estimated effect on the trait under study. These estimated markereffects are subsequently used for making predictions on individuals for which no phenotypic recordsare available. As most plant and animal breeding programs are currently still phenotype driven, thecontinuously expanding collection of phenotypic records can only be used to construct a genomicprediction model if a dense molecular marker fingerprint is available for each phenotyped individual.However, as the genotyping budget is generally limited, the genomic prediction model can only beconstructed using a subset of the tested individuals and possibly a genome-covering subset of themolecular markers. In this article, we demonstrate how an optimal selection of individuals can be madewith respect to the quality of their available phenotypic data. We also demonstrate how the total numberof molecular markers can be reduced while a maximum genome coverage is ensured. The third selectionproblem we tackle is specific to the construction of a genomic prediction model for a hybrid breedingprogram where only molecular marker fingerprints of the homozygous parents are available. We showhow to identify the set of parental inbred lines of a predefined size that has produced the highest numberof progeny. These three selection approaches are put into practice in a simulation study where wedemonstrate how the trade-off between sample size and sample quality affects the prediction accuracy ofgenomic prediction models for hybrid maize.

DESPITE the numerous studies devoted to molec-ular marker-based breeding, the genetic progress

of most complex traits in today’s plant and animalbreeding programs still heavily relies on phenotypicselection. Most breeding companies have establisheddedicated databases that store the vast number ofphenotypic records that are being routinely collectedthroughout the course of their breeding programs.These phenotypic records are, however, gradually beingcomplemented by various types of molecular markerscores and it is to be expected that effective marker-based selection schemes will eventually allow currentphenotyping efforts to be reduced (Bernardo 2008;Hayes et al. 2009). The available marker and pheno-typic databases already allow for the construction andvalidation of marker-based selection schemes. Miningthe phenotypic databases of a breeding company is,however, quite different from analyzing the data that isgenerated by a carefully designed experiment. Geneticevaluation data is often severely unbalanced as elite

individuals are usually tested many times on their way tobecoming a commercial variety or sire, while less per-forming individuals are often disregarded after a singletrial. Furthermore, the different phenotypic evaluationtrials are separated in time and space and as such, sub-jected to different environmental conditions. Therefore,ranking the performance of individuals that were evalu-ated in different phenotypic trials is usually a nontrivialtask.

Animal breeders are well experienced when it comesto handling unbalanced genetic evaluation data. Thebest linear unbiased predictor or BLUP approach(Henderson 1975) presented a major breakthroughin this respect, especially when combined with restrictedmaximum-likelihood or REML estimation of the need-ed variance components (Patterson and Thompson

1971). Somewhat later on, this linear mixed modelingapproach was also adopted by plant breeders as the defacto standard for handling unbalanced phenotypicdata. The more recent developments in genomicselection (Bernardo 1995; Meuwissen et al. 2001;Gianola and van Kaam 2008) and marker-trait associ-ation studies (Yu et al. 2006) are, at least partially, BLUP-based and are therefore, in theory, perfectly suited for

1Corresponding author : Department of Biosciences and LandscapeArchitecture, University College Ghent, Voskenslaan 270, B-9000 Gent,Belgium. E-mail: [email protected]

Genetics 185: 1463–1475 (August 2010)

Page 2: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

mining the large marker and phenotypic databases thatback each breeding program. In practice, however, theunbalancedness of the available genetic evaluation dataoften reduces its total information content and theconstruction of a marker-based selection model is limitedto a more balanced subset of the data.

As phenotypic data are available, genotyping costslimit the total number of individuals that can be includ-ed in the construction of a genomic prediction model.The best results will be obtained by selecting a subset ofindividuals for which the phenotypic evaluation dataexhibits the least amount of unbalancedness. In thisarticle we demonstrate how this phenotypic subsetselection problem can be translated into a standardgraph theory problem that can be solved with exactalgorithms or less-time-consuming heuristics.

In most plant and animal species, the number ofavailable molecular markers is rapidly increasing, whilethe genotyping cost per marker is decreasing. Neverthe-less, as budgets are always limited, genotyping all map-ped markers for a small number of individuals mightbe less efficient than genotyping a restricted set ofwell-chosen markers on a wider set of individuals. Oneshould therefore be able to select a subset of molecularmarkers that covers the entire genome as uniformly aspossible. We demonstrate how this marker selectionproblem can also be translated into a well-known graphtheory problem that has an exact solution.

The third problem we tackle by means of graph theoryis more specific to hybrid breeding programs where theparental individuals are nearly or completely homozy-gous. This implies that we can deduce the molecularmarker fingerprint of a hybrid individual from the mark-er scores of its parents. As the phenotypic data arecollected on the hybrids, genotyping costs can be re-duced by selecting a subset of parental inbreds that haveproduced the maximum number of genetically distinctoffspring among themselves. Obviously, the phenotypicdata on these offspring should be as balanced as possible.

Besides solving the above-mentioned selection prob-lems by means of graph theory algorithms, we demon-strate their use in a simulation study that allows us todetermine the optimum trade-off between the numberof individuals and the size of the genotyped molecularmarker fingerprint for predicting the phenotypic per-formance of hybrid maize by means of e-insensitive sup-port vector machine regression (e-SVR) (Maenhout

et al. 2007, 2008, 2010) and best linear prediction (BLP)(Bernardo 1994, 1995, 1996).

SELECTING INDIVIDUALS FROM UNBALANCEDPHENOTYPIC EVALUATION DATA

In most plant or animal breeding programs, allphenotypic measurements that were recorded duringgenetic evaluation trials are stored for future reference.The all-encompassing experimental design that gener-

ated the data is likely to be very unbalanced. The mostextreme case of an unbalanced design is a disconnecteddesign. Table 1 gives an example of a disconnected sireevaluation design taken from Kennedy and Trus

(1993). The breeding values of four sires are evaluatedby measuring the performance of their offspring inthree different herds. Sires having offspring in differentherds provide vertical connections between herds whileherds containing offspring of different sires providehorizontal connections. In a perfectly balanced design,each sire would have the same number of offspringtested in each herd. In the presented scenario, however,sires s1 and s2 are disconnected from sires s3 and s4 asthere is no possible path between these groups. Thismeans that if we analyze the phenotypic data from thisdesign with an ordinary least-squares model, contrastsinvolving sires that belong to the disconnected groupswould be inestimable. However, if we fit a linear mixedmodel to the data in which we assume herds as fixed andsires as random effects, contrasts involving sire BLUPsbelonging to these disconnected groups are perfectlyestimable.

Ignoring connectivity issues by treating sire effects asrandom variables is, however, not without consequence.This approach implicitly assumes that all evaluatedindividuals originate from the same population and assuch have the same breeding value expectation. Thisassumption is generally not valid in animal breedingprograms as the better sires are usually evaluated in thebetter herds (Foulley et al. 1990). A similar stratifica-tion can be observed in genetic evaluation trialsperformed by plant breeders where late and thereforehigher-yielding individuals are generally tested in geo-graphical regions with longer growing seasons. As aconsequence, BLUP-based genomic selection routineswill be less efficient, while marker-trait associationstudies will suffer from increased false positive ratesand reduced power. A very unbalanced but neverthelessconnected design will also reduce the effectiveness ofmarker-based selection approaches as the predictionerror variance of the estimated breeding values in-creases substantially. Furthermore, the estimated breed-ing values will be regressed toward the mean and will notaccount for the true genetic trend.

TABLE 1

Example of a disconnected sire 3 herd design taken fromKENNEDY and TRUS (1993)

s1 s2 s3 s4

h1 3 6 0 0h2 3 4 0 0h3 0 0 7 5

The cell numbers indicate how many offspring of the sirepertaining to that particular column were evaluated in theherd pertaining to that particular row.

1464 S. Maenhout, B. De Baets and G. Haesaert

Page 3: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

We can assume that the available data set containsunbalanced phenotypic measurements on t individuals,where t is generally a very large number. The availablephenotypic data allow the breeder to try out one ormore of the more recent BLUP-based genomic selectionapproaches without setting up dedicated trials. Givenhis financial limit for genotyping, he wants to selectexactly p individuals from this data set. The selection ofp individuals should be optimal in the sense that theprecision of the BLUPs of the p breeding values that areobtained from a linear mixed model analysis of the fullset of phenotypic records is superior to the precision ofany other set of BLUPs with cardinality p. This optimalitycriterion requires a measure of precision of a subset ofBLUPs obtained from a linear mixed model analysis. Tointroduce this criterion, we will make the generalassumption that the applied linear mixed model takesthe form

y ¼ Xb 1 Zu 1 e; ð1Þ

where y is a column vector containing n phenotypicmeasurements on the t individuals. b is a vector of fixednuisance effects like trial, herd, and replication effectsand u is a vector containing random genetic effects foreach of the t individuals. For ease of explanation, weassume that u contains only t breeding values, but thepresented approach can easily be generalized to caseswhere u is made up from general combining abilities(GCA) and specific combining abilities (SCA) andpossibly the different levels of various genotype-by-environment (G 3 E) interaction factors. Vector econtains n random residuals. Matrices X and Z linkthe appropriate phenotypic records to the effects in band u, respectively. Furthermore we assume that we canrepresent the variance of u and e as

Varue

� �¼ G 0

0 R

� �:

G can contain an assumed covariance structure forthe t individuals, typically a scaled numerator relation-ship matrix calculated from available pedigree or mark-er data. It is, however, important to realize that fitting acovariance between breeding values allows the BLUPsfrom individuals that have little phenotypic informationthemselves, to borrow strength from phenotypic re-cords on closely related individuals. As a result, the pindividuals with highest BLUP precision will most likelybe close relatives, which is detrimental for the general-izing capabilities of the marker-based selection model.If we want the selection of p individuals to rely com-pletely on the amount of information and the structure(balancedness) of their phenotypic records, G shouldbe a scaled identity matrix. Once the p individuals havebeen selected, a pedigree or marker-based covariancestructure can be incorporated in G for the constructionof the actual marker-based prediction model. The

covariance structure of the residuals in matrix R cancontain heterogeneous variances for the different pro-duction environments or, in case that data originatesfrom actual field trials, spatial information like row orcolumn correlations. The BLUPs in vector u are obtainedby solving the mixed model equations (Henderson

1984)

X9R�1X X9R�1ZZ9R�1X Z9R�1Z 1 G�1

� �bu

� �¼ X9R�1y

Z9R�1y

� �:

The inverse of the coefficient matrix allows to obtainthe prediction error variance (PEV) matrix of vector u as

X9R�1X X9R�1ZZ9R�1X Z9R�1Z 1 G�1

� ��1

¼ C11 C12

C21 C22

� �;

where

PEVðuÞ ¼ Varðu� uÞ ¼ C22:

A logical choice of measure to express the precision ofa selection of p BLUPs from the t candidates in vector uwould be some function of the p 3 p principal submatrixC

p22, obtained by removing the rows and columns of C22

that pertain to individuals that are not in that particularselection. As a good design is strongly associated withthe precision of pairwise contrasts (Bueno Filho andGilmour 2003), we use the lowest precision of allpossible pairwise contrast vectors between the p selectedindividuals as optimization criterion. A pairwise contrastvector qij for the individuals i and j is a vectorwhere qij

i ¼ 1 and qiji ¼� 1, while all other elements of

qij are zeros. Laloe (1993) and Laloe et al. (1996)propose expressing the precision of a linear contrastvector q by means of the generalized coefficient ofdetermination (CD), which is defined as

CDðqÞ ¼ q9ðG� C22Þqq9Gq

;

where CDðqÞ always lies within the unit interval. Theyindicate that CDðqÞ can be obtained as a weightedaverage of the t – 1 nonzero eigenvalues mi of thegeneralized eigenvalue problem

ððG� C22Þ � miGÞvi ¼ 0; ð2Þ

as

CDðqÞ ¼P

ti¼2 a2

i miPti¼2 a2

i

; ð3Þ

where ai2 is the weight for eigenvalue i and the first

eigenvalue m1 always equals zero as a consequence of thewell-known summation constraint 19G�1u (see, e.g.,Foulley et al. 1990). Each linear contrast vector q canbe expressed as a linear combination of the t – 1 nonzeroeigenvectors vi as

Data Selection for Genomic Prediction 1465

Page 4: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

q ¼Xt

i¼2

aivi :

In fact, all linear contrast vectors that are estimable inthe least-squares sense are linear combinations of theeigenvectors vi of Equation 2 that are associated tononzero eigenvalues mi, while those contrasts that arenot estimable in a least-squares sense are linear combi-nations of eigenvectors for which at least one associatedeigenvalue is zero. This implies that the CD of a pairwisecontrast vector involving two individuals that wereevaluated in two disconnected groups does not neces-sarily become zero as several eigenvalues mi in Equation3 might be nonzero. This might bias the selectionprocedure to favor a disconnected set of individuals witha high information content (i.e., a high level of replica-tion) instead of a connected set of individuals with lowinformation content. To avoid this situation, the CD ofpairwise contrast vectors between disconnected individ-uals should be forced to zero. In case Equation 1represents a simplified animal model where G ¼ Is2

g

and R ¼ Is2e, disconnected pairs of individuals can be

easily identified by examining the block diagonal struc-ture of the PEV matrix C22 as explained in Appendix A.In Appendix B we show how disconnected pairs ofindividuals can be identified by means of the transitiveclosure of the adjacency matrix of the t individuals.

Now that we have the corrected CD for each of thepðp � 1Þ=2 pairwise contrast vectors, we can representthe t individuals as vertices (also called nodes) of aweighted complete graph where the edge betweenindividual i and individual j carries the weight CDðqijÞ,expressing the precision of the pairwise contrast as anumber between zero and one. We need to selectexactly p vertices such that the minimum edge weightin the selected subgraph is maximized. This problem isequivalent to the ‘‘discrete p-dispersion problem’’ fromthe field of graph theory. This problem setting isencountered when locating facilities that should notbe clustered in one location, like nuclear plants orfranchises belonging to the same fast-food chain. Thisproblem is nondeterministic polynomial-time (NP)-hard even when the distance matrix satisfies the triangleinequality. Erkut (1990) describes two exact algorithmsthat are based on a branch and bound strategy andcompares 10 different heuristics (Erkut et al. 1994). Aninteresting solution lies in the connection between thediscrete p-dispersion problem and the maximum cliqueproblem. A clique in a graph is a set of pairwise adjacentvertices or, in other words, a complete subgraph. Thecorresponding optimization problem, the maximumclique problem, is to find the largest clique in a graph.This problem is also NP-hard (Carraghan and Parda-

los 1990). The idea is to decompose the discrete p-dispersion problem in a number of maximum cliqueproblems by assigning different values to the minimum

required contrast precision CDmin. Initially, CDmin is low(e.g., CDmin ¼ 0.1) and we define a graph G9(V, E9),where the edges of the original graph G are removedwhen their edge weight is smaller than CDmin. Thisimplies that there will be no edges between discon-nected pairs of individuals in the derived graph as theseedge weights have been set to zero by the CD correctionprocedure. Solving the maximum clique problem inG9(V, E9) allows us to identify a complete subgraph forwhich all edge weights are guaranteed to be greater thanCDmin. The number of vertices in this complete subgraphis generally smaller than t but greater than p. By repeatingthis procedure with increasing values of CDmin one canmake a trade-off between sample size and sample qualityas demonstrated in Figure 1 for a representative sample ofsize t¼ 4236 individuals for which genetic evaluation datawere recorded as part of the grain maize breedingprogram of the private company RAGT R2n. Each dotrepresents the largest possible selection of individualswhere CDmin ranges from 0 to 0.97. The data used in thisexample are connected as there is no sudden drop in thenumber of individuals when CDmin is raised from 0.0 to0.1. In general, the surface below the curve represents ameasure of data quality. If one is interested only inobtaining the optimal selection of exactly p individualsfrom a set of t candidates, one can implement thedescribed maximum clique-based procedure in a binarysearch.

The presented approach for solving the discretep-dispersion problem requires an efficient algorithmthat allows the maximum clique from a graph to be

Figure 1.—Graph of the trade-off between the selectionsize and the selection quality for a sample of the RAGT grainmaize breeding pool. For each examined level of CDmin, rang-ing from 0.0 to 0.97, the dot represents the maximum car-dinality selection of individuals for which the minimumprecision of a pairwise contrast is at least CDmin.

1466 S. Maenhout, B. De Baets and G. Haesaert

Page 5: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

obtained. Several exact algorithms and heuristics havebeen published, but comparing these is often difficult asthe dimensions and densities of the provided examplegraphs as well as computational platforms tend to differbetween articles. The exact algorithm of Carraghan

and Pardalos (1990) is, however, considered as thebasis for most later algorithms. Although the efficiencyof this algorithm has been superseded by that of morerecent developments (Ostergard 2002; Tomita andSeki 2003), its easy implementation often makes it themethod of choice. If the available run-time is limited, atime-constrained heuristic like the reactive local searchapproach presented by Battiti and Protasi (2001)might be more appropriate. Bomze et al. (1999) give anoverview of several other heuristic approaches found inliterature, in particular greedy construction and sto-chastic local search, including simulated annealing,genetic algorithms, and tabu search.

SELECTING MARKERS FROM A DENSE MOLECULARMARKER FINGERPRINT

The construction of a genomic prediction modelrequires genotypic information on each of the pselected individuals. Generally it is assumed that a goodprediction accuracy can only be achieved by maximizingthe genome coverage, which implies genotyping a largenumber of molecular markers. This approach seemsparticularly attractive as genotyping costs are decreasingrapidly. However, as shown by Maenhout et al. (2010),the relation between the number of genotyped markersand the obtained prediction accuracy seems to be subjectto the law of diminishing marginal returns. This meansthat it might be more efficient to construct the genomicprediction model using a larger number of individuals incombination with a smaller molecular marker finger-print. There is obviously an upper limit to the sparsity ofthe applied molecular marker fingerprint and its ge-nome coverage should be as uniform as possible suchthat the probability of detecting a marker-trait associationis maximized.

We start by solving this selection problem on a singlechromosome for which t candidate molecular markershave been mapped. We want to select exactly q of thesemarkers such that the chromosome coverage is optimalcompared to all other possible selections of q markers.Maximizing the chromosome coverage could meanseveral things, including maximizing the average inter-marker distance and maximizing the minimum markerdistance. We prefer the latter definition as it implies aone-dimensional version of the discrete p-dispersionproblem. In this restricted setting, a reduction to a seriesof maximum clique problems is not necessary as Ravi

et al. (1991) have published an algorithm that obtainsthe optimal solution in an overall running time ofO(min(t2, qt log(t))).

The extension to c . 1 chromosomes is againdependent on the interpretation of uniform genomecoverage. For example, we can use the above-mentionedalgorithm to select q‘i=

Pci¼1 ‘i markers on each chro-

mosome i with length ‘i. As these fractions will generallynot result in integers, the remainder after division couldbe attributed to each of the different chromosomes indecreasing order of their minimum intermarker dis-tance after the addition of one marker. A more intuitiveinterpretation of a uniform genome coverage entails aselection of markers such that the minimum inter-marker distance over all chromosomes is maximized.This can be achieved by linking all chromosomes headto tail as if all markers were located on a singlechromosome. To be able to use the above-mentionedalgorithm, the distance between the last marker of thefirst chromosome and the first marker of the secondchromosome of each linked chromosome pair shouldbe set to infinity.

SELECTING PARENTAL INBRED LINES

In hybrid breeding programs, the molecular markerfingerprint of a single-cross hybrid can be easily de-duced from the fingerprints of its two homozygousparents. This allows us to reduce the total genotypingcost of the genomic prediction model considerably. Ifwe assume we have a budget for fingerprinting exactlyk parental inbred lines, we can maximize the numberof genotyped single-cross hybrids by selecting the setof lines that have produced the maximum number ofsingle-cross hybrids among themselves. We approachthis selection problem by representing the total set ofparental inbred lines as the vertices of an unweightedpedigree graph where an edge between two verticesrepresents an offspring individual (i.e., a single-crosshybrid) for which genetic evaluation data are available.Figure 2 shows such a graph representation of thesample used in Figure 1 containing 487 inbred lines and4236 hybrids. We need to select a k-vertex subgraph thathas the maximum number of edges. In graph theoryparlance, this problem is called the ‘‘densest k-subgraphproblem,’’ which is shown to be NP-hard. Severalapproximation algorithms have been published, in-cluding the heuristic based on semidefinite program-ming relaxation presented by Feige and Seltser (1997)and the greedy approach of Asahiro et al. (2000). Thebasic idea of the latter is to repeatedly remove the vertexwith minimum degree (i.e., minimum number of edges)from the graph until there are exactly k vertices left. Thisapproach has been shown to perform almost just asgood as the much more complicated alternative basedon semidefinite programming.

The presented selection procedure does not considerthe quality of the genetic evaluation data that areavailable for the hybrids. As a result, the optimalselection with respect to the maximization of the

Data Selection for Genomic Prediction 1467

Page 6: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

number of training examples might turn out to be a verypoor selection with respect to the quality of thephenotypic data. To enforce these data quality con-straints, the described inbred line selection procedureshould be performed after a preselection of the hybridson the basis of the precision of pairwise contrasts. If weselect k inbred lines where k ranges from the totalnumber of candidate parents to 3 for each level ofCDmin ranging from 0.1 to 0.97 we get a 3-dimensionalrepresentation of the data quality as shown in Figure 3.Similarly to Figure 1, each dot on the surface representsthe size of the optimal selection of hybrids under theconstraints of a genotyping budget for k parental inbredlines and a minimum pairwise contrast precision ofCDmin. We can see that for high levels of CDmin and highlevels of k, the cardinality of the resulting selection becomes0, indicating that there are no hybrids that comply withboth constraints. As soon as the constraint on CDmin isrelaxed, the selection cardinality increases gradually asmore parental inbred lines are being genotyped.

SIMULATION STUDY

The construction of a hybrid maize prediction modelbased on e-SVR (Maenhout et al. 2008, 2010) or BLP(Bernardo 1995) requires a combination of genotypicand phenotypic data on a predefined number of inbredlines and their hybrid offspring, respectively. As pheno-typic data are available from past genetic evaluationtrials, the number of training examples that is used forthe construction of this prediction model is constrainedby the total genotyping cost. If we reduce the size ofthe fingerprint, more inbred lines can be genotypedand more training examples become available, whichshould result in a better prediction accuracy of themodel. However, reducing the size of the molecularmarker fingerprint comes at the price of a reducedgenome coverage and an increased number of selectedhybrids results in a reduced precision of BLUP contrastsdue to connectivity issues (e.g., Figure 1). Therefore, it isto be expected that within the constraints of a fixed

genotyping budget, maximum prediction accuracy canbe achieved by finding the optimal balance between thefingerprint size and the number of training examples.The location of this optimum is obviously highly de-pendent on the information content of the availablephenotypic data and the applied linkage map, but canbe estimated by means of the aforementioned graphtheory algorithms for each specific data set.

Simulation setup: To demonstrate the approach, weuse the phenotypic data that were generated as part ofthe grain maize breeding program of the private breed-ing company RAGT R2n and their proprietary SSRlinkage map. We assume a limited budget for genotyp-ing 101 SSR markers on 200 inbred lines or 20200markers in total. We also assume that we can limit the

Figure 2.—Graph representation of a sampleof the RAGT grain maize breeding pool. Theblue vertices represent inbred lines and the grayedges are single-cross hybrids.

Figure 3.—Graph of the trade-off between the selectionsize and the selection quality when only k parental inbredlines are being genotyped. For each examined level of CDmin

ranging from 0.0 to 0.97 the number of genotyped inbredlines k is reduced from 487 to 3. Each dot in the plotted surfacerepresents the maximum cardinality selection of hybrid individ-uals for which the minimum precision of a pairwise contrast isat least CDmin and the number of parents is exactly k.

1468 S. Maenhout, B. De Baets and G. Haesaert

Page 7: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

number of candidate inbred lines to 400 by restrictingthe prediction model to a specific heterotic groupcombination, a specific environment (i.e., maturityrating, irrigation, and fertilizer treatments) and the setof inbred lines that are available at the moment ofgenotyping. This intensive preselection of candidatelines is mainly needed to keep the simulations tractable.In a more realistic setting, calculations are performedonly once so the set of initial candidate lines can belarger. Table 2 gives a schematic overview of thedifferent steps that are performed at each iteration ofthe simulation routine.

Again we make use of the pedigree graph represen-tation where inbred lines are represented as vertices andeach single-cross hybrid is represented as an edgebetween two vertices as shown in Figure 2. In this graph,the degree of a vertex (i.e., the number of edgesincident to the vertex) therefore equals the number ofdistinct single-cross hybrids of which the inbred line is aparent. Figure 4 shows the empirical distribution ofthese degrees on a log scale for the entire RAGT grainmaize breeding pool. The observed long-tailed behaviorof the empirical distribution is not unexpected as mostinbred lines only have a limited number of children,while inbred lines with higher progeny numbers (i.e.,the tester lines) are rare. In an attempt to parametrizethe underlying distribution from which the observedvertex degrees were drawn, several candidate distribu-tions among which the Poisson, geometric, discrete log-normal, and discrete power-law distributions were fittedby means of likelihood maximization. The best fit wasobserved for the discrete power-law distribution with aleft threshold value of 6 that is indicated as a straight lineon Figure 4. The fit of this distribution is, however,insufficient as indicated by the significantly largeKolmogorov–Smirnov D-statistic, where significance isdetermined by means of the parametric bootstrapprocedure described by Clauset et al. (2009).

As no conclusive evidence on the underlying distri-bution of the observed vertex degrees was found, we

prefer to sample inbred lines from the full RAGT graphdirectly. However, taking a representive sample from alarge graph is not a trivial task. The sample quality ofvarious published graph sampling algorithms seemsto be highly dependent on the properties of the graph.

TABLE 2

Description of each step that is performed during a single iteration of the simulation routine

Step Description

1 Sample 400 vertices from the pedigree graph by means of the ‘forest fire’ algorithm:Indirect sampling of hybridsIndirect sampling of multi-environment trials

2 Partition sampled inbred lines in c heterotic groups by means of the Dsatur vertex coloring algorithm3 Simulate 8 breeding cycles on each of the c heterotic groups4 Simulate phenotypic records on the sampled hybrids5 Reduce the number of sampled hybrids by gradually increasing CDmin

6 Reduce the number of genotyped inbred lines by means of the greedy densest k-subgraph algorithm7 Select q SSR markers with maximal genome coverage8 Determine the prediction accuracy of e-SVR and BLP using the reduced set of training examples

The goal is to find the optimal trade-off between the number of genotyped inbred lines and the size of their molecular fin-gerprint, when the total genotyping budget is fixed.

Figure 4.—Log-scaled degree distribution of the graphcreated from part of the RAGT R2n grain maize breedingprogram. In this undirected, unweighted graph, parental in-bred lines are represented as vertices and single-cross hybridsas edges. Each dot represents a unique log-scaled vertex de-gree (horizontal axis) and the log of its frequency in the graph(vertical axis). The red line represents the fitted power law dis-tribution by means of likelihood maximization. The thresholdvalue of 6 was determined by minimizing the Kolmogorov–Smirnov statistic as described by Clauset et al. (2009).

Data Selection for Genomic Prediction 1469

Page 8: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

To decide which sampling routine is optimal for theRAGT data, we first need to decide on a measure ofsample quality. We compare the empirical cumulativedistribution (ECD) of the vertex degrees in the fullgraph with those ECDs of 100 samples containing 400vertices. From these ECDs, we calculate the averageKolmogorov–Smirnov D-statistic for each examinedsampling routine. For the RAGT data, the ‘‘forest fire’’vertex sampling approach resulted in the smallestaverage D-statistic compared to the alternative methodsdescribed by Lescovec and Faloutsos (2006). Thissampling routine starts by selecting a vertex v0 uni-formly at random from the graph. Vertex v0 now spreads‘‘the fire’’ to a random selection of its neighbors, whichare then in turn allowed to infect a random selection oftheir own neighbors. This process is continued untilexactly 400 vertices are selected. If the fire dies outbefore the sample is complete, a new starting vertex isselected uniformly at random. The number of neigh-bors that is infected at each selected vertex is obtained asa random draw from a geometric distribution wherethe parameter p was set to 0.62, as this value resultedin the best average sample quality. All hybrids forwhich both parents were sampled (i.e., the edges ofthe subgraph) have associated phenotypic records andas such indirectly sample a set of multi-environmenttrials (METs). All hybrids that were not indirectlyselected by the inbred line sample, but do havephenotypic records in the sample of METs, are includedin the selection as data connecting check varieties.Despite the fact that the RAGT data already providephenotypic records for the selected hybrids and checkvarieties, we replace these by simulated measurements aswe want to be able to assess the actual prediction accuracyof e-SVR and BLP under various levels of data quality.

The simulation of these phenotypic records for thesampled hybrids starts by partitioning the selectedinbred lines into heterotic groups. This partitioningshould ensure that the two parents of each single-crosshybrid always belong to distinct heterotic groups, whilethe total number of groups needs to be minimized. Thegraph theory equivalent of this problem is called the‘‘vertex coloring problem,’’ which, as all previouslydescribed graph theory problems, belongs to the com-plexity class of NP-hard problems. The minimumnumber of colors (i.e., heterotic groups) is called thechromatic number of the graph. The vertex coloringproblem has been extensively studied in graph theoryliterature (Jensen and Toft 1995) and several efficientheuristics are available. The greedy desaturation algo-rithm or Dsatur published by Brelaz (1979) is oftenused as a benchmark method to assess the efficiency andprecision of newly developed vertex coloring algo-rithms. Its good performance on a variety of graphsand easy implementation makes it the method of choicefor designating inbred lines to heterotic groups at eachiteration of the simulation routine.

Once the chromatic number c has been determinedfor the sampled set of inbred lines, an entire breedingprogram is simulated starting from c open-pollinatedvarieties and resulting in c unrelated heterotic groups.The simulation of this breeding program mimics themaize breeding program of the university of Hohenheimas described by Stich et al. (2007) and modified byMaenhout et al. (2009). In short, the simulationroutine uses the proprietary linkage map of the breed-ing company RAGT R2n containing 101 microsatellitesand adds an additional 303 evenly distributed, simu-lated SSRs. It also generates 250 QTL loci of theselection trait (e.g., yield), which are randomly posi-tioned on the genetic map. The number of allelesfor each SSR or QTL is drawn from a Poisson distribu-tion with an expected value of 7. Each simulation startsby generating an initial base population in Hardy–Weinberg equilibrium. Allele frequencies for each locusare drawn from a Dirichlet distribution and used tocalculate the allele frequencies in each of the c sub-populations assuming an Fst value of 0.14. We perform 8breeding cycles where each cycle consists of 6 gener-ations of inbreeding and subsequent phenotypic selec-tion based on line per se or testcross performance asdescribed by Stich et al. (2007). The result is a set of 400highly selected inbred lines partitioned in c unrelatedheterotic groups. Within each of these groups, thesimulated inbred lines are randomly assigned to thesampled inbred lines and a genotypic value is generatedfor each interheterotic hybrid by summing the effects ofthe 250 QTL alleles of both parents and adding anormally distributed SCA value. The size of the SCAvariance component depends on the heritability of thetrait under consideration, but is assumed to be only 1

8of the total nonadditive variance (SCA 1 G 3 E andresidual error) as this was the average of observed ratiosfor the traits grain yield, grain moisture contents, anddays until flowering in the actual RAGT data. Thegenotypic values of the check varieties are generatedfrom a single normal distribution where the variance isthe sum of the additive variance and SCA variance ofthe sampled hybrids. The simulated genotypic values ofhybrids and check varieties are used to generatephenotypic records according to the sampled MET datastructure, assuming a single replication in each locationof a MET. This implies that G 3 E effects are confoundedwith the residual error and only a single effect is drawnfrom a normal distribution where the variance is 7

8 ofthe total nonadditive variance. The main environmentaleffect of each location is also drawn from a normaldistribution for which the variance is twice the additivevariance of the hybrids.

The simulated phenotypic records that are associatedwith the sampled data structure allow us to estimate thegenotypic value of each hybrid by means of a linearmixed model analysis. We fit individuals (hybrids andcheck varieties) as random and locations as fixed effects

1470 S. Maenhout, B. De Baets and G. Haesaert

Page 9: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

as this approach should result in an e-SVR model with asuperior prediction accuracy (Maenhout et al. 2010).To avoid selections of closely related hybrids, thevariance–covariance matrix of the genotypic effects isfitted as a scaled identity matrix. The resulting PEVmatrix of the random genotypic effects is used toiteratively select a smaller subset of the sampled hybridsby gradually increasing the minimum required pre-cision of each pairwise contrast in the selection. Initially,the required CDmin value is set to 0, which implies thatall hybrids are selected. The next examined level ofprecision requires CDmin . 0, which effectively excludesselections containing disconnected individuals. Morestringent levels of precision are enforced by requiringCDmin . qp, where qp is the pth quantile of the observeddistribution of CD values in the complete sample and pranges from 0 to 0.875 in steps of 0.125. Defining CDmin

values as quantiles allows us to compare the obtainedprediction accuracies over the different samples of thesimulation routine.

For each level of CDmin, the number of genotypedinbred lines is reduced from 400 to 50 in steps of 50, whileat the same time the number of markers in the molecularmarker fingerprint is increased from 50 to 404. For eachcombination of CDmin and number of genotyped inbredlines, the BLUPs of the selected hybrids are used toconstruct an e-SVR and a BLP-based prediction model. Infact, the prediction accuracy of both methods is verifiedby randomly assigning the BLUPs to one of five groups.For each of these groups, a separate e-SVR and BLPprediction model is constructed using all BLUPs in theremaining four groups as training data. The resultingprediction model is then used to make predictions on thehybrids in the selected group (i.e., the validation data).Combining the predictions of all five models allows us toobtain a measure of prediction accuracy by correlatingthem against the simulated genotypic values.

Simulation results: We expect that enforcing a mini-mum required pairwise contrast precision CDmin . 0results in a selection of BLUPs that has greater accuracycompared to the full set of hybrids. In Figure 5 thisBLUP accuracy is plotted against CDmin and themaximum number of inbred lines for each of the threeexamined heritability levels. Each point on these wire-frame surfaces represents the squared Pearson correla-tion between the BLUPs and the actual, simulatedgenotypic values of the selected hybrids at that partic-ular level of CDmin and number of parental inbred lines,averaged over 100 iterations of the simulation routine.We can see that an increase in CDmin results in an almostlinear increase in BLUP precision for each heritabilitylevel. This effect is especially pronounced for the lowestheritability level h2 ¼ 0.25. As expected, the BLUPprecision is not influenced by the number of parentalinbred lines.

Figure 6 presents the prediction accuracy of bothe-SVR and BLP for increasing values of the minimum

required contrast precision CDmin and a decreasingnumber of genotyped inbred lines. The height of eachpoint in the wireframes represents the average pre-diction accuracy, expressed as a squared Pearson corre-lation, over 100 iterations of the simulation routine. Foreach of the examined heritability levels, e-SVR generallyperforms better than BLP. The negative effect ofdisconnected hybrids in the selection of training exam-ples is visualized as the sharp increase in predictionaccuracy when the minimum required contrast preci-sion is slightly constrained from CDmin¼ 0 to CDmin . 0 .This effect is more pronounced for BLP than for e-SVR.Increasing CDmin any further generally decreases theprediction accuracy, especially for traits with lowerheritability. This observation implies that, at least forthe RAGT data set, a larger number of trainingexamples of lower data quality is to be preferred overa smaller selection of hybrids for which more and betterconnected phenotypic information is available, as longas disconnected individuals are excluded.

BLP and e-SVR do not take a unanimous stand on theoptimal number of genotyped inbred lines. For BLP, theoptimum seems to lie somewhere around 100 inbredlines for h2¼ 0.25 and 150 for h2¼ 0.50 and h2¼ 0.75, theequivalent of fingerprint sizes of 202 and 134 SSR

Figure 5.—Accuracy of the genotypic value BLUPs of thehybrids selected using the described graph-based procedures.The three examined heritability levels h2 ¼ 0.25, h2 ¼ 0.5, andh2 ¼ 0.75 are represented by the bottom, middle, and topwireframe surfaces respectively. Each point on a surface isthe squared Pearson correlation between the BLUPs andthe actual (simulated) genotypic values of the selected hy-brids under the constraints of a minimum required contrastprecision CDmin, expressed as a percentile of the sampled CDvalues, and the number of genotyped inbred lines, averagedover 100 iterations of the simulation routine.

Data Selection for Genomic Prediction 1471

Page 10: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

markers, respectively. This optimum is, however, lesspronounced for the higher heritability levels. For e-SVR,the optimal number of inbred lines is 150, 200, and 350for h2¼ 0.25, h2¼ 0.5, and h2¼ 0.75, respectively. At thehighest heritability level, e-SVR seems to prefer trainingsets of maximum size, at the cost of a very smallmolecular marker fingerprint size. The observed behav-ior of both BLP and e-SVR is consistent with the resultsof a previous study (Maenhout et al. 2010) where it wasshown that BLP is less sensitive to a reduction of thenumber of training examples compared to e-SVR, aslong as the molecular marker fingerprint is dense. e-SVRon the other hand, although requiring a training set of

considerable size, handles smaller or less informativemolecular marker fingerprints better than BLP.

DISCUSSION

This article presents three selection problems that arerelevant to the budget-constrained construction of agenomic prediction model from available genetic eval-uation data. The first problem considers the selection ofexactly p individuals from a set of t candidates that willbe genotyped to serve as training examples for theconstruction of the prediction model. This selectionshould be optimal in the sense that a linear mixed

Figure 6.—Average prediction accuracy ofe-SVR and BLP prediction models over 100 iter-ations of the simulation routine for varying levelsof the minimum required contrast precisionCDmin, expressed as a percentile of the sampledCD values ranging from 0 to 0.875 and the num-ber of genotyped inbred lines. The height ofeach point in the wireframe represents the pre-diction accuracy obtained by e-SVR and BLPwhen training on the optimal selection of hy-brids under the constraints imposed by the levelsof the two independent variables. Prediction ac-curacy is expressed as the average squaredPearson correlation between the simulated andthe predicted genotypic values of the hybrids.The interval at the bottom of each wireframeprovides the minimum and maximum standarderror of the mean. The scales of the vertical axesare comparable only within the same heritabilitylevel.

1472 S. Maenhout, B. De Baets and G. Haesaert

Page 11: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

model analysis of the associated phenotypic recordsshould result in a set of p BLUPs of genotypic values thathave the highest precision of all possible selections. Bydefining the precision of a selection as the minimumgeneralized coefficient of determination of a pairwisecontrast, this selection problem can be translated to thediscrete p-dispersion problem from the field of graphtheory. The reduction of this problem to a set ofmaximum clique problems allows us to visualize thetrade-off between selection size and selection quality.The greedy nature of a breeding program does un-fortunately bias the presented selection approach to-ward high-performing individuals. These are generallytested more thoroughly than their low-performingcolleagues. As the latter generally have only a fewassociated phenotypic records, the pairwise contrastsinvolving these individuals have a low precision, whichin turn makes their selection by the described pro-cedure very unlikely. As a consequence, the resultinggenomic prediction model is likely to overestimate thecapabilities of the low-performing individuals. To avoidthis bias, the selection procedure should optimize twoobjectives simultaneously: (1) maximizing the mini-mum precision of all pairwise contrasts in the selectionand (2) maximizing the genetic variance in the selec-tion. Even if one would succeed in finding an acceptabletrade-off between these conflicting objectives, the esti-mates of the genotypic value of low-performing individ-uals will always suffer from large standard errors, whichmakes them unreliable training examples.

The second problem we discuss deals with theselection of exactly q molecular markers from a set of tcandidates for which the relative positions on a geneticmap are known. To guarantee that the selection has anoptimal genome coverage, we maximize the minimumintermarker distance. We show that this problem can betranslated to a one-dimensional discrete p-dispersionproblem for which an exact algorithm is available.

The third problem is specific to hybrid breedingprograms and entails the selection of exactly k parentalinbred lines such that the number of single-crosshybrids in the selection is maximized. If we representthe inbred lines as vertices of a graph and each single-cross hybrid as an edge between its parental vertices, thisproblem can be translated to the densest k-subgraphproblem, which we solve by using a greedy heuristic.

The presented solutions to the three selection prob-lems are put into practice in a simulation study in whichthe goal is to find the optimal number of trainingexamples for the construction of e-SVR and BLP pre-diction models with maximal prediction accuracy undera fixed genotyping budget. At each iteration of thesimulation routine, inbred lines, hybrids, and theirassociated phenotypic data structure are sampled fromactual genetic evaluation data. The number of trainingexamples is gradually reduced by putting constraints onthe data quality and the number of genotyped inbred

lines. The results indicate that selections of trainingexamples containing disconnected individuals are det-rimental to the prediction accuracy of both e-SVR andBLP. More stringent data quality constraints are, how-ever, not necessary. e-SVR performs best if the numberof parental inbred lines (i.e., the number of trainingexamples) is maximized at the cost of a reducedgenome coverage. BLP on the other hand performsbest when trained on a smaller set of training examplesfor which a dense fingerprint is available.

Despite the fact that these conclusions are most likelyspecific to maize breeding programs and possibly evenspecific to the heterotic groups and breeding methodsused by the data-providing breeding company, thepresented graph-based data selection algorithms shouldprove themselves to be useful for the construction ofgenomic prediction models in other plant and animalspecies as well. Evidently, more species-specific casestudies are required to ascertain this claim.

The authors thank the people from RAGT R2n for their unreservedand open-minded scientific contribution to this research.

LITERATURE CITED

Asahiro, Y., K. Iwama, H. Tamaki and T. Tokuyama, 2000 Greedilyfinding a dense subgraph. Algorithmica 34: 203–221.

Battiti, R., and M. Protasi, 2001 Reactive local search for the max-imum clique problem. Algorithmica 29: 610–637.

Bernardo, R., 1994 Prediction of maize single-cross performanceusing RFLPs and information from related hybrids. Crop Sci.34: 20–25.

Bernardo, R., 1995 Genetic models for predicting maize single-crossperformance in unbalanced yield trial data. Crop Sci. 35: 141–147.

Bernardo, R., 1996 Best linear unbiased prediction of the perfor-mance of crosses between untested maize inbreds. Crop Sci.36: 50–56.

Bernardo, R., 2008 Molecular markers and selection for complextraits in plants: learning from the last 20 years. Crop Sci. 48:1649–1664.

Bomze, M., M. Budinich, P. Pardalos and M. Pelillo, 1999 Themaximum clique problem, pp. 1–74 in Handbook of CombinatorialOptimization, Supplement Vol. A, edited by D.-Z. Du and P. M.Pardalos. Kluwer Academic, Dordrecht, The Netherlands.

Brelaz, D., 1979 New methods to color the vertices of a graph.Commun. Assoc. Comput. Mach. 22: 251–256.

Bueno Filho, S., and S. G. Gilmour, 2003 Planning incompleteblock experiments when treatments are genetically related. Bio-metrics 59: 375–381.

Carraghan, R., and P. M. Pardalos, 1990 An exact algorithm forthe maximum clique problem. Oper. Res. Lett. 9: 375–382.

Chakrabarti, M. C., 1964 On the C-matrix in design of experi-ments. J. Indian. Statist. Assoc. 1: 8–23.

Clauset, A., C. R. Shalizi and M. E. J. Newman, 2009 Power-lawdistributions in empirical data. SIAM Rev. 51: 661–703.

De Meyer, H., H. Naessens and B. De Baets, 2004 Algorithms forcomputing the min-transitive closure and associated partitiontree of a symmetric fuzzy relation. Eur. J. Oper. Res. 155: 226–238.

Erkut, E., 1990 The discrete p-dispersion problem. Eur. J. Oper.Res. 46: 48–60.

Erkut, E., Y. Ulkusal and O. Yenicerioglu, 1994 A comparison ofp-dispersion heuristics. Comput. Oper. Res. 21: 1103–1113.

Feige, U. and M. Seltser, 1997 On the densest k-subgraph prob-lem. Technical Report CS97-16, Weizmann Institute, Rehovot,Israel.

Foulley, J., J. Bouix, B. Goffinet and J. M. Elsen, 1990 Con-nectedness in genetic evaluation, pp. 277–308 in Advances in

Data Selection for Genomic Prediction 1473

Page 12: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

Statistical Methods for Genetic Improvement of Livestock, edited by D.Gianola and K. Hammond, Springer-Verlag, Heidelberg.

Gianola, D., and J. B. C. H. M. van Kaam, 2008 Reproducing kernelHilbert spaces regression methods for genomic assisted predic-tion of quantitative traits. Genetics 178: 2289–2303.

Hayes, B. J., P. J. Bowman, A. J. Chamberlain and M. E. Goddard,2009 Genomic selection in dairy cattle: progress and chal-lenges. J. Dairy Sci. 92: 433–443.

Heiligers, B., 1991 A note on connectedness of block designs.Metrika 38: 377–381.

Henderson, C. R., 1975 Best linear unbiased estimation and predic-tion under a selection model. Biometrics 31: 423–447.

Henderson, C. R., 1984 Applications of Linear Models in Animal Breed-ing. University of Guelph Press, Guelph, Ontario, Canada.

Jensen, T., and B. Toft, 1995 Graph Coloring Problems. Wiley, New York.Kennedy, B., and D. Trus, 1993 Considerations on genetic connect-

edness between management units under an animal model.J. Anim. Sci. 71: 2341–2352.

Laloe, D., 1993 Precision and information in linear models of ge-netic evaluation. Genet. Sel. Evol. 25: 557–576.

Laloe, D., F. Phocas and F. Menissier, 1996 Considerations aboutmeasures of precision and connection in mixed linear models ofgenetic evaluation. Genet. Sel. Evol. 28: 359–378.

Lescovec, J., and C. Faloutsos, 2006 Sampling from large graphs,pp. 631–636 in Proceedings of the 12th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, Associationof Computing Machinery, New York.

Maenhout, S., B. De Baets, G. Haesaert and E. Van Bockstaele,2007 Support vector machine regression for the prediction ofmaize hybrid performance. Theor. Appl. Genet. 115: 1003–1013.

Maenhout, S., B. De Baets, G. Haesaert and E. Van Bockstaele,2008 Marker-based screening of maize inbred lines using sup-port vector machine regression. Euphytica 161: 123–131.

Maenhout, S., B. De Baets and G. Haesaert, 2009 Marker-basedestimation of the coefficient of coancestry in hybrid breedingprogrammes. Theor. Appl. Genet. 118: 1181–1192.

Maenhout, S., B. De Baets and G. Haesaert, 2010 Prediction ofmaize single-cross hybrid performance: support vector machineregression versus best linear prediction. Theor. Appl. Genet. 120:415–427.

Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard,2001 Prediction of total genetic value using genome-widedense marker maps. Genetics 157: 1819–1829.

Naessens, H., H. De Meyer and B. De Baets, 2002 Algorithms forthe computation of T-transitive closures. IEEE Trans. Fuzzy Syst.10: 541–551.

Ostergard, P. R. J., 2002 A fast algorithm for the maximum cliqueproblem. Discrete Appl. Math. 120: 197–207.

Patterson, H. D., and R. Thompson, 1971 Recovery of inter-block information when block sizes are equal. Biometrika 58:545–554.

Ravi, S., D. Rosenkrantz and G. Tayi, 1991 Facility dispersionproblems: heuristics and special cases. Lecture Notes Comput.Sci. 519: 355–366.

Stich, B., A. E. Melchinger, H. P. Piepho, S. Hamrit, W. Schip-

prack et al., 2007 Potential causes of linkage disequilibriumin a european maize breeding program investigated with com-puter simulations. Theor. Appl. Genet. 115: 529–536.

Tomita, E., and T. Seki, 2003 An efficient branch-and-bound algo-rithm for finding a maximum clique. DIMACS Ser. DiscreteMath. Theoret. Comput. Sci. 2731: 278–289.

Warshall, S., 1962 A theorem on Boolean matrices. J. Assoc. Com-put. Mach. 9: 11–12.

Yu, J., G. Pressoir, W. H. Briggs, I. Vroh Bi, M. Yamasaki et al.,2006 A unified mixed-model method for association mappingthat accounts for multiple levels of relatedness. Nat. Genet. 38:203–208.

Communicating editor: I. Hoeschele

APPENDIX: IDENTIFYING DISCONNECTED PAIRS OF INDIVIDUALS

Appendix A: By examination of the PEV matrix

If we analyze the available genetic evaluation data with a linear mixed model according to Equqation 1, where thevariance structure is simplified to

Varue

� �¼ Is2

g 0

0 Is2e

� �;

we can express the prediction error variance matrix as (Henderson 1984)

PEVðuÞ ¼ ðZ9MZ 1 IlÞ�1s2e;

where l ¼ s2e=s2

g and M is the orthogonal projector on the column space of matrix X as M ¼ I� XðX9XÞ�1X9. Thematrix product Z9MZ is in fact the information matrix of the genotypic effects if we would consider both environmentsand genotypic effects as fixed and analyze the data as a block design in a linear least-squares setting. Chakrabarti

(1964) proves that if such a block design is fully connected (i.e., all elementary contrasts are estimable in a least-squaressense), the rank of this information matrix equals t� 1, where t is the number of fitted genotypic effects. Furthermore,Heiligers (1991) proves that in case the information matrix has a lower rank t � p, where p $ 2, the design isdisconnected and the symmetric matrix Z9MZ can always be put in a block diagonal form with p distinct blocks aroundthe principal diagonal, by simply permuting the appropriate rows and columns. Each of these blocks represents a set offully connected individuals that are disconnected from all other individuals that are not represented in that particularblock. If we assume that we are dealing with a disconnected design and that the columns of Z (i.e., the individuals) areordered in such a way that Z9MZ is in block diagonal form, it should be fairly obvious that also PEVðuÞ is block diagonalas inversion preserves this matrix property. As most linear mixed model packages provide the PEV matrix, theidentification of disconnected pairs of individuals can be performed by recovering this block diagonal structure byappropriate row and column permutations.

1474 S. Maenhout, B. De Baets and G. Haesaert

Page 13: Graph-Based Data Selection for the Construction of Genomic ... · Graph-Based Data Selection for the Construction of Genomic Prediction Models Steven Maenhout,*,1 Bernard De Baets§

Appendix B: By computing the transitive closure

If VarðuÞ is not a diagonal matrix, the block diagonal structure of Z9MZ is not preserved in the PEV matrix. Onecould of course examine the structure of Z9MZ instead, but this matrix is generally not available. It might therefore beeasier to identify disconnected pairs of individuals by determining the transitive closure of their adjacency matrix. Thisis a symmetric, Boolean t 3 t matrix where the element on row i and column j is set to 1 if individuals i and j have beenevaluated in a common environment and 0 otherwise. For the example in Figure 1, this adjacency matrix looks like

A ¼

1 1 0 01 1 0 00 0 1 10 0 1 1

2664

3775:

The transitive closure of this matrix is again a symmetric, block-diagonalizable, Boolean matrix that can beinterpreted in a similar way as Z9MZ Warshall (1962) describes a concise and efficient algorithm for computing thetransitive closure of an adjacency matrix that has a worst-case complexity of O(t3). More advanced algorithms aredescribed by Naessens et al. (2002) and De Meyer et al. (2004).

Data Selection for Genomic Prediction 1475