Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if...

18
Genetic Epidemiology RESEARCH ARTICLE An Object-Oriented Regression for Building Disease Predictive Models with Multiallelic HLA Genes Lue Ping Zhao, 1,2 Hamid Bolouri, 3 Michael Zhao, 4 Daniel E. Geraghty, 5 ˚ Ake Lernmark, 6 and The Better Diabetes Diagnosis Study Group 7 1 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America; 2 Department of Biostatistics, University of Washington School of Public Health, Seattle, Washington, United States of America; 3 Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America; 4 Bellevue High School, Seattle, Washington, United States of America; 5 Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America; 6 Department of Clinical Sciences, Lund University/CRC, Sk ˚ ane University Hospital, Malm ¨ o, Sweden; 7 Members of the Better Diabetes Diagnosis Study Are Listed in the Appendix Received 3 December 2015; Revised 11 February 2016; accepted revised manuscript 17 February 2016. Published online 12 April 2016 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/gepi.21968 ABSTRACT: Recent genome-wide association studies confirm that human leukocyte antigen (HLA) genes have the strongest associations with several autoimmune diseases, including type 1 diabetes (T1D), providing an impetus to reduce this genetic association to practice through an HLA-based disease predictive model. However, conventional model-building methods tend to be suboptimal when predictors are highly polymorphic with many rare alleles combined with complex patterns of sequence homology within and between genes. To circumvent this challenge, we describe an alternative methodology; treating complex genotypes of HLA genes as “objects” or “exemplars,” one focuses on systemic associations of disease phenotype with “objects” via similarity measurements. Conceptually, this approach assigns disease risks base on complex genotype profiles instead of specific disease-associated genotypes or alleles. Effectively, it transforms large, discrete, and sparse HLA genotypes into a matrix of similarity-based covariates. By the Kernel representative theorem and machine learning techniques, it uses a penalized likelihood method to select disease-associated exemplars in building predictive models. To illustrate this methodology, we apply it to a T1D study with eight HLA genes (HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5, HLA-DQA1, HLA-DQB1, HLA-DPA1, and HLA-DPB1) to build a predictive model. The resulted predictive model has an area under curve of 0.92 in the training set, and 0.89 in the validating set, indicating that this methodology is useful to build predictive models with complex HLA genotypes. Genet Epidemiol 40:315–332, 2016. © 2016 Wiley Periodicals, Inc. KEY WORDS: generalized linear model; kernel machine; multiallelic genotypes; penalized regression; prediction; similarity regression; statistical learning Introduction Following the successes of next generation sequencing tech- nologies, a goal of future biotechnology innovation is to pro- duce fully phased diploids of human genomes, that is, a pair of phased haplotypes with multiple single nucleotide polymor- phisms (SNPs) [Tewhey et al., 2011; Yang et al., 2011]. Within a functional gene, multiple-phased SNP alleles, together with all monomorphic nucleotides, represent fully phased se- quences that are useful to decipher functional transcript or protein sequences. Indeed, it is of great interest to turn multi- ple phased SNPs into multiallelic polymorphisms. Probably, one of best example genes is the human leukocyte antigen (HLA) genes, located between 6p22.1 and 6p21.3 on chromo- some 6. For example, the HLA-DRB1 gene consists of a pair Supporting Information is available in the online issue at wileyonlinelibrary.com. Correspondence to: Lue Ping Zhao, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA. E-mail: [email protected]; ˚ Ake Lernmark, Department of Clinical Sciences, Lund University/CRC, Sk ˚ ane University Hospital, Malm ¨ o, Sweden. E-mail: [email protected] of alleles in each individual, and each allele corresponds to a phased sequence [Marsh, 2000]. By the most recent counting statistics (http://www.ebi.ac.uk/ipd/imgt/hla/), HLA-DRB1 has over 1,868 alleles that code 1,364 proteins. The ex- ceptional polymorphisms and presumed multifunctionality have presented challenges to studying its associations with diseases, most prominently autoimmune disorders such as type 1 diabetes (T1D) [Noble, 2015]. Further, the high poly- morphism also hinders the translation of positive discoveries from bench to bedside, because of limited sample sizes as- sociated with many less common alleles and multiple testing with numerous alleles. There is a need for a new analytic approach to overcoming this challenge. In genetic epidemiology, as perhaps in most scientific endeavors, mostly commonly used data analysis tools are regression-based methods focusing on individual covariates or elements. For example, genetic analysis tends to focus on individual alleles and/or genotypes in candi- date gene studies or individual alleles of SNPs in genome- wide association studies. In other words, our reductionist C 2016 WILEY PERIODICALS, INC.

Transcript of Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if...

Page 1: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

GeneticEpidemiologyRESEARCH ARTICLE

An Object-Oriented Regression for Building DiseasePredictive Models with Multiallelic HLA Genes

Lue Ping Zhao,1,2 ∗ Hamid Bolouri,3 Michael Zhao,4 Daniel E. Geraghty,5 Ake Lernmark,6 ∗ and The Better Diabetes DiagnosisStudy Group7

1Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America; 2Department ofBiostatistics, University of Washington School of Public Health, Seattle, Washington, United States of America; 3Division of Human Biology, FredHutchinson Cancer Research Center, Seattle, Washington, United States of America; 4Bellevue High School, Seattle, Washington, United Statesof America; 5Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America; 6Department ofClinical Sciences, Lund University/CRC, Skane University Hospital, Malmo, Sweden; 7Members of the Better Diabetes Diagnosis Study Are Listedin the Appendix

Received 3 December 2015; Revised 11 February 2016; accepted revised manuscript 17 February 2016.

Published online 12 April 2016 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/gepi.21968

ABSTRACT: Recent genome-wide association studies confirm that human leukocyte antigen (HLA) genes have the strongestassociations with several autoimmune diseases, including type 1 diabetes (T1D), providing an impetus to reduce this geneticassociation to practice through an HLA-based disease predictive model. However, conventional model-building methodstend to be suboptimal when predictors are highly polymorphic with many rare alleles combined with complex patterns ofsequence homology within and between genes. To circumvent this challenge, we describe an alternative methodology; treatingcomplex genotypes of HLA genes as “objects” or “exemplars,” one focuses on systemic associations of disease phenotypewith “objects” via similarity measurements. Conceptually, this approach assigns disease risks base on complex genotypeprofiles instead of specific disease-associated genotypes or alleles. Effectively, it transforms large, discrete, and sparse HLAgenotypes into a matrix of similarity-based covariates. By the Kernel representative theorem and machine learning techniques,it uses a penalized likelihood method to select disease-associated exemplars in building predictive models. To illustrate thismethodology, we apply it to a T1D study with eight HLA genes (HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5,HLA-DQA1, HLA-DQB1, HLA-DPA1, and HLA-DPB1) to build a predictive model. The resulted predictive model has anarea under curve of 0.92 in the training set, and 0.89 in the validating set, indicating that this methodology is useful to buildpredictive models with complex HLA genotypes.Genet Epidemiol 40:315–332, 2016. © 2016 Wiley Periodicals, Inc.

KEY WORDS: generalized linear model; kernel machine; multiallelic genotypes; penalized regression; prediction; similarityregression; statistical learning

Introduction

Following the successes of next generation sequencing tech-nologies, a goal of future biotechnology innovation is to pro-duce fully phased diploids of human genomes, that is, a pair ofphased haplotypes with multiple single nucleotide polymor-phisms (SNPs) [Tewhey et al., 2011; Yang et al., 2011]. Withina functional gene, multiple-phased SNP alleles, together withall monomorphic nucleotides, represent fully phased se-quences that are useful to decipher functional transcript orprotein sequences. Indeed, it is of great interest to turn multi-ple phased SNPs into multiallelic polymorphisms. Probably,one of best example genes is the human leukocyte antigen(HLA) genes, located between 6p22.1 and 6p21.3 on chromo-some 6. For example, the HLA-DRB1 gene consists of a pair

Supporting Information is available in the online issue at wileyonlinelibrary.com.∗Correspondence to: Lue Ping Zhao, Division of Public Health Sciences,

Fred Hutchinson Cancer Research Center, Seattle, WA. E-mail: [email protected];

Ake Lernmark, Department of Clinical Sciences, Lund University/CRC, Skane University

Hospital, Malmo, Sweden. E-mail: [email protected]

of alleles in each individual, and each allele corresponds to aphased sequence [Marsh, 2000]. By the most recent countingstatistics (http://www.ebi.ac.uk/ipd/imgt/hla/), HLA-DRB1has over 1,868 alleles that code 1,364 proteins. The ex-ceptional polymorphisms and presumed multifunctionalityhave presented challenges to studying its associations withdiseases, most prominently autoimmune disorders such astype 1 diabetes (T1D) [Noble, 2015]. Further, the high poly-morphism also hinders the translation of positive discoveriesfrom bench to bedside, because of limited sample sizes as-sociated with many less common alleles and multiple testingwith numerous alleles.

There is a need for a new analytic approach to overcomingthis challenge. In genetic epidemiology, as perhaps in mostscientific endeavors, mostly commonly used data analysistools are regression-based methods focusing on individualcovariates or elements. For example, genetic analysis tendsto focus on individual alleles and/or genotypes in candi-date gene studies or individual alleles of SNPs in genome-wide association studies. In other words, our reductionist

C© 2016 WILEY PERIODICALS, INC.

Page 2: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 1. An illustration of typical reductionist and holistic approaches: the conventional reductionist approach reduces complex object into anarray of individual elements (genes, SNPs, or alleles), and discovers specific elements associated with outcome. The systemic approach identifies aset of commonly observed data patterns (arrays of individual elements, combination of genotypes, haplotypes, or interesting objects) as exemplars(or systems), and discovers specific exemplars associated with outcome.

approach is to discover “needles in the haystack” (see Fig. 1for illustration). Inevitably, the covariate-specific regressionapproach encounters challenges in dealing with too many co-variates, as is the case with too many alleles in HLA genes. Inrecent years of omics research, the scientific community haspaid an increasing attention to a “systems biology” approach,focusing on profiles of multiple genes and their joint associa-tions with phenotype, that is, “systemic” or “holistic” associ-ations. Rather than reducing to specific elements, a systemicapproach tends to address if a variation at the system levelis functional. If two systems have similar variation profiles,they likely have a similar phenotype, and similarity patternsare often visually displayed [Cardinal-Fernandez et al., 2014;Smith et al. 2014]. Extending the idea of systemic analysis toanalyze complex HLA genotypes, our interest is to develop ananalytic framework to quantify systemic associations of a dis-ease phenotype with genotype profiles, rather than individualalleles or genotypes. Toward this goal, we treat genotype pro-files of multiple genes as “objects” or “exemplars,” measuresimilarity of subjects’ genotypes with exemplars’, and assesssystemic associations of the phenotype with similarity mea-surements via the kernel machine [Hastie et al., 2015]. Con-ceptually, this approach focuses on “objects” in the kernel ma-chine, rather than individual elements in “objects,” leadingto object-oriented regression (OOR). Hereafter, “genotypeprofiles,” “objects,” or “exemplars” are used interchange-ably. To illustrate OOR in Figure 1, we show that the sys-temic approach identifies a set of exemplars, representing sys-temic variations, and assesses which exemplars associate withphenotype.

Although being independently motivated, OOR has a closeconnection with three recent applications of the kernel ma-chine. Recently, Wu et al. [2010 and 2011] have publishedtwo high impact papers, introducing kernel-based meth-ods for testing gene-set associations and for assessing rare

variants in case-control studies [Kwee et al., 2008; Wu et al.,2010, 2011]. The key idea is to encapsulate genetic associa-tions of main effects and interactions by modeling their kernelfunction. Extending the same idea, Minnier et al. [2015] havedescribed a method for risk classification with multiple inde-pendent gene sets, in which they model gene-specific kernelfunctions via single-value decomposition.

In term of the analytic objective, OOR has a closer con-nection with Zhu and Hastie’s [2005] kernel logistic regres-sion and support vector machine. Without modeling thekernel function, the kernel logistic regression proposes touse “import points” to reduce the kernel space. Taking theiridea of “import points” further, OOR introduces exemplarsthat can be derived internally or externally. More impor-tantly, we assume that coefficients associated with manyexemplars are approximately zero, so that OOR uses thepenalized likelihood to deselect those uninformative exem-plars that are not associated with the disease phenotype,leaving a set of “informative exemplars” for the predictivemodel.

In the remainder of the manuscript, the first section onMethodology provides statistical motivations for OOR, laysout the OOR framework, identifies approaches for choos-ing exemplars, and builds up predictive models. Further, theMethodology section describes a general flow from convert-ing complex genotypes to similarity measures and to buildingpredictive models. Besides detailing choices of exemplars andselection of predictors, the Methodology section describeshow to assess the stability of choosing the penalty parameterand how to assess the concordance of informative exemplarsthrough bootstrap. To illustrate OOR, the Application sec-tion describes the T1D study and illustrates the utility ofOOR for exploring disease associations with HLA genes andfor building a T1D predictive model. The Results sectiondescribes associations of HLA-DRB1 with T1D and a T1D

316 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 3: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

predictive model with six HLA genes. The manuscript endswith conclusions and discussions on OOR and related results.

Methodology

Motivation

Consider genotypes arising from studies of highly poly-morphic genes. To be specific, the motivating study is a case-control study of T1D and eight class II HLA genes (HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5, HLA-DQA1,HLA-DQB1, HLA-DPA1, and HLA-DPB1) [Delli et al., 2010,2012]. Because of their structural polymorphisms, only oneof HLA-DRB3, HLA-DRB4, and HLA-DRB5 alleles appearson any single chromosome, and hence HLA-DRB345 is usedto denote genotypes of all three genes hereafter. Each geneconsists of two alleles, and each allele represents a fully phasednucleotide sequence. If the jth gene has, say mj , possible se-quence variations, a genotype with a pair of alleles can takeone of mj (mj + 1)/2 possible genotypic polymorphisms un-der the Hardy-Weinberg equilibrium (HWE), that is, statis-tically independent within a locus. An array of genotypesat multiple gene loci is referred to as a genotype profile. Ifthese genes were in linkage equilibrium (LE), that is, statis-tical independence between loci, the total number of geno-type profiles could theoretically be as large as their cross-products,

∏j mj (mj + 1)/2. The total number of genotype

profiles could easily exceed the typical sample sizes of mostpopulation-based studies. In practice, however, the observednumber of genotype profiles is much smaller than the theoret-ical total, due to biological constraints: (1) HLA genetic poly-morphisms are highly selected within populations; (2) pairedHLA gene alleles within loci tend to deviate from HWE; (3)genotype profiles of multiple HLA genes tend to deviate fromLE because of physical adjacency or gene-gene interactions;and 4) the genetic region covering HLA genes is known tohave relatively lower recombination rate than the rest of thegenome, despite presence of “recombination hot spots” inthe region [Cullen et al., 1997; Jeffreys & May, 2004; Marsh,2000]. These complexities lead to typical observations thatsome genotype profiles are overrepresented and that manyothers are completely absent. Such phenomenon presentschallenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypeswithin a single gene, to investigate one genetic associationafter stratifying over genotypes of another gene, or to carryout haplotype analysis with two or more genes.

Instead of focusing on individual alleles or genotypes, acomplementary approach is a systemic approach, that is,focusing on genotype profiles and examining their overallassociations with outcome [Bell & Koithan, 2006; Fang &Casadevall, 2011]. In other words, treating observed geno-type profiles as exemplars, one computes similarities of sub-jects’ genotypes with exemplars, and assesses if similarity toexemplars associates the disease phenotype. Given the samplesize n in a case-control study, the total number of possibleexemplars, if derived internally from the study, is at most

n. As noted above, the actual number of unique genotypepatterns formed by eight HLA genes is less than the samplesize n. Treating all unique genotype profiles as exemplars,one can directly assess T1D association with subjects’ simi-larity measures of all these exemplars. Formalization of theseobservations motivates the proposal of OOR.

Consider a profile of m genotypes denoted byg i 1, g i 2, . . . , g im observed on the ith subject (i = 1, 2, . . . ,n). Over all subjects, unique genotype profiles are identifiedand are denoted g

∼∗k as the kth exemplar (k = 1,2, . . . , q).

By observed genotypes, one measures the subject’s similaritywith the kth exemplar via a similarity function, denoted asK (g

∼i, g∼

∗k ), which is referred to as a kernel function [Cris-

tianini & Shawe-Taylor, 2000; Hastie et al., 2009; Minnieret al., 2015; Zhu & Hastie, 2005]. The analytic objective ofOOR is to assess genetic association with the disease pheno-type denoted as: yi = 0 for control and yi = 1 for case. With-out imposing any parametric assumptions, one can use theRepresenter’s theorem [Kimeldor & Wahba, 1971] to capturethis genetic association by the following representation:

logit[ Pr(yi = 1 | g∼i)] = α +

n∑k=1

θkK (g∼i, g

∼k), (1)

where the summation is over all observed samples, the kernelfunction K (g

∼i, g∼k)measures similarity between the ith and

kth genotype profiles, θk is a vector of kernel parameters, andα is an intercept so that the summation is centered aroundzero. Wu et al. [2010, 2011] assume a Gaussian distributionto model kernel parameters, as a way to testing rare variantassociations and to examine SNP-set association. Recently,Minnier et al. [2015] uses single-value decomposition to sim-plify the similarity matrix K (g

∼i, g∼k) as a way to approximate

the above representation. Zhu and Hastie [2005] suggest re-ducing the above summation over all observed samples to asummation over a set of “import points” with fewer kernelparameters to be estimated.

In the current context, we note that the number ofunique genotype profiles tends to be smaller than the sam-ple size. Let g

∼∗k denote all q unique genotype profiles, where

k = 1, 2, . . . , q. Many terms, sharing identical unique geno-type profiles, in the above representation can be merged, anda more compact representation of the Equation (1) may bewritten as:

logit[ Pr(yi = 1 | g∼i)] = α +

q∑k=1

⎛⎜⎝∑

g∼l =g

∼∗k

θl

⎞⎟⎠ K (g

∼i, g∼

∗k ), (2)

where the internal summation (∑

g∼l =g

∼∗kθl) is over kernel pa-

rameters with the same unique g∼

∗k , and the summation of

parameters can be reparametrized into a new parameter βk.This observation leads us to propose the following logisticregression model:

logit Pr(yi = 1 | g∼i) = α +

q∑k–1

βkK (g∼i, g

∼∗k ), (3)

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 317

Page 4: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

where regression coefficient βk quantifies the disease associ-ation with the kth similarity measure of g

∼i with g∼

∗k . Let g

∼∗k

denote an exemplar. If βk does not equal zero (βk �= 0), itimplies that subjects similar to the kth exemplar are at eitherincreased or decreased risk. Similarly, (βk = 0) suggests thatbeing similar to the kth exemplar is inconsequential to theirdisease risk. Rather than modeling this kernel parameter βk,we assume that many βk are probably equal to zero. Identify-ing those nonzero βk and estimating their values are primaryanalytic objectives for OOR. By focusing on estimation ofthe regression coefficients βk, the interpretation of βk hingesdirectly on the similarity of genotype profiles with exem-plars’, rather than specific alleles or genotypes in traditionalcovariate-specific regression analysis.

An OOR Framework

The motivation for OOR is straightforward, and its presen-tation in Equation (3) is deceptively simple. In reality, to useOOR, we have to address three distinct methodological issues:(1) choice of a similarity measure, (2) choice of exemplars,and (3) selection of informative exemplars with nonzero βk

coefficients. Different choices of these three methodologicalcomponents lead to various versions of methods in the OORframework.

Similarity Measures

Purely from a theoretical consideration, choice of the sim-ilarity measure needs to ensure that the kernel function issymmetric and semipositive definite [Kimeldor & Wahba,1971; Zhu & Hastie, 2005]. In practice, many similarity mea-sures are suitable but are context dependent. Here, we usea similarity measure suitable for genetic analysis. Supposethe exemplar g

∼∗k = (a∗

k1/a∗k1, . . . , a∗

k6/a∗k6) for HLA gene loci,

where a pair of alleles a∗kj/a∗

kj denotes the genotype at the jthgene locus. When measuring the similarity with the exemplar,we consider the following function:

K(g∼i, g

∼∗k

)=

1

6

6∑j =1

K(aij/aij, a∗

kj/a∗kj

)and

K (aij/aij, a∗kj/a∗

kj) =1

2max

[I(aij = a∗

kj

)+ I

(a = a∗

kj

),

I(aij = a∗

kj

)+ I

(aij = a∗

kj

)],

(4)

where I (.) is an indicator function and each K (aij/aij, a∗kj/a∗

kj)is the identity-by-state measure commonly used in geneticanalysis [Bishop & Williamson, 1990]. The above similaritymeasure takes value between 0 and 1, ranging from no simi-larity to identity, respectively. However, the current measurehas not accommodated potentially different functional sig-nificances of individual genes or individual alleles. One wayto generalize the above similarity measure is to introducegene- or allele-specific weights in the calculation.

Choice of Exemplars

In the derivation of OOR formulation (3), we have usedobserved but unique genotype profile as exemplar. However,from the application perspective, one can choose exemplarseither externally or internally, depending on research ques-tions of interest. For example, one may choose exemplarsexternally from literature. In contrast, one can choose exem-plars internally, the focus of this manuscript. When selectingexemplars internally, one may choose all unique genotypeprofiles as exemplars, as described above. When dealing alarge number of alleles or sequence variations, like HLA,one may consider clustering analysis via a n × n similar-ity matrix K = | K (g

∼i, g∼k) |n×n, with pairwise measurement

of similarity, and choose those “centroids” of clusters asexemplars.

Selections of Informative Exemplars

Following the identification of exemplars g∼

∗k , the next an-

alytic objective of OOR is to select those exemplars whosesimilarity measures significantly associate with the diseasephenotype of interest. As noted above, one expects that manyregression coefficients βk equal zero, and should be dese-lected from the logistic regression model (3), leaving onlythose informative exemplars whose similarity measures as-sociate with the disease phenotype. Because the number ofexemplars can still be relatively large, we consider penalizedlikelihood methods to select informative exemplars, to avoidoverfitting. Using the same notation as above, the penalizedlog likelihood function may be written as:

(α, β1, β2, . . . , βq)

= argmaxα,β1,...,βq

(–

1

n

n∑i=1

[yi(α + K ′

iβ) – log(1 + eα+K ′iβ)

]

+ λ[0.5(1 – η)∣∣β∣∣2

2+ η|β|1]

), (5)

where λ is a tuning parameter to determine the penaltylevel, | β |1 and | β |2 are l1 and l2 norm, respectively. Bychoosing η to take value 0, or 1, or 0.5, the above penalizedlikelihood method corresponds to least absolute shrinkageand selection operator (LASSO), ridge regression, and elas-tic net, respectively. The tuning parameter λ is estimatedto have a minimum prediction error, and is chosen by thecross-validation.

In contrast, the best-known traditional strategy of select-ing variables is a hybrid of forward and backward stepwiseselection of predictors based upon information criterion (IC)measures such as Akaike’s IC (AIC). Given the extensive lit-erature on the likelihood-based estimation, it suffices to notethat under the logistic regression model [3], one can use alog-likelihood function similar to [5], except replacing thepenalty component with λ by 2(1 + q).

318 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 5: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Penalty Parameter and Section of Informative Exemplars

It is known in the literature that the turning parameter inpenalized likelihood methods imposes a penalty on param-eter estimations, trading biases in estimated regression co-efficients with estimated variances [Cox & O’Sullivan, 1989;Friedman et al., 2010; Sun et al., 2013; Tibshirani et al., 2012].Cross-validation is commonly recommended to estimate thepenalty parameter. However, the cross-validation procedureis a random process, resulting in a random estimate. Therandomness may affect selection of exemplars. Here, we rec-ommend to repeat cross-validation process multiple times,and to estimate its empirical distribution, based on which wewill then evaluate the stability of variable selections with fixedpenalty parameter (see discussion below). Computationally,we estimate penalty parameter with 10-fold cross-validation(a default recommendation in cv.glmnet, an R implemen-tation of GLMNET), and repeat the computation, say 100times. All empirically estimated parameters are used to con-struct an empirical distribution.

Assessing Stability of Exemplar Selection with FixedPenalty Parameter (λ)

A major challenge facing, practically, all variable selec-tion procedures, dealing with complex or high-dimensionaldata, is the stability of selected variables [Meinshausen &Buhlmann, 2010]. Selection of informative exemplars isnot an exception. Upon assessing empirical distribution ofpenalty parameter estimates above, we are interested in sta-bility of selecting informative exemplars. To address thisissue, we use the bootstrap method. Briefly, we randomlydraw sample observations from the study population withreplacement, while keeping the same sample size. On eachbootstrap sample, we perform a penalized likelihood analy-sis with fixed penalty parameters. Then, we compute Kappastatistics to measure if selected exemplars are consistentlyselected [Agresti, 1988; Cohen, 1968].

T1D Case-Control Study

A case-control study of T1D and HLA genes motivates theOOR development, and is reported elsewhere [Zhao et al.,2016]. Briefly, this study identified 970 T1D patients whoseages range from year 1 to 18, from geographically diverse clin-ics in Sweden, and treated them as cases. From comparableregions, the study identified 448 controls who were free fromT1D. From all study subjects, the study collected blood sam-ples, following Human Subjects review and approval, andextracted their DNA. Focusing on HLA genes, this studyused the next generation sequencing technologies to assesshigh-resolution genotypes of HLA genes (HLA-DRB1, HLA-DRB345, HLA-DQA1, HLA-DQB1, HLA-DPA1, and HLA-DPB1). The analytic objective of this study is to investigateT1D associations with HLA genes, and to build a predictivemodel of T1D status with these HLA genotypes. To be in-clusive, we randomly chose 479 cases and 226 controls as atraining set and the remainder as validation set (222 controls

and 483 cases). Allelic frequencies of all genes among con-trols and cases are largely comparable between training andvalidating sets (for illustration, the supplementary Table S1includes allelic frequencies of HLA-DRB1 in among controlsand cases from the training and validation set).

Results

Application to HLA-DRB1

To illustrate OOR in dealing with complex HLA data, wefirst consider T1D association with the HLA-DRB1 genealone. Table 1 lists genotypic distribution of HLA-DRB1among, respectively, controls and cases above and below thediagonal line, respectively. For those homozygous genotypesfalling diagonal line, genotypic frequencies among controlsand cases are denoted in numerator and denominator (#/#),respectively. The first impression from reading this genotypefrequency table is that the genotypic distribution with only44 alleles is sparse and has only 159 unique genotypes, muchsmaller than the theoretically possible number of genotypes,that is, 990 (=44 × 45/2) computed under HWE. Second,certain genotypes exhibit substantially different frequenciesbetween cases and controls, implying their associations withT1D. For example, the homozygote DRB1∗04:01:01/04:01:01has frequencies of 0.4/9.3 among controls and cases, respec-tively, implying a rate ratio of 23.25. At the other extreme,the heterozygote DRB1∗15:01:01/07:01:01 has frequencies of4.4 among controls and 0 among cases, implying that thisheterozygote appears to be protective against T1D. For thosecommon genotypes, direct assessment of T1D associationsis practical with the current sample size, and has often beenreported. Even in such a case, the analysis has to deal withzero frequency, such as DRBA∗15:01:01/07:01:01, with spe-cial statistical treatment, like Firth correction [Firth, 1993].For many less common genotypes, the rigorous assessmentis difficult, because of the sparseness, small sample sizes, andlarge number of comparisons. Given the desire to examineT1D associations with the gene as whole, we are motivatedto seek an alternative analytic approach.

Consider OOR model for capturing T1D association withHLA-DRB1 via the Equation (3), without invoking any as-sumptions. Because of varying allelic frequencies and devia-tions from HWE for some alleles, many theoretically possiblegenotypes are absent, that is, they have zero frequency inboth cases and controls (Table 1). Consequently, by treatingall unique genotypes as exemplars, we have a total of q = 159exemplars in the Equation (3). Among these 159 regressioncoefficients (β∗

k ), we expect that most equal zero, leaving afew informative exemplars.

In this application, the element in the similarity matrixtakes a value of 1 for identity, 0.5 for sharing one allele,and 0 for sharing no alleles, between a pair of subjects. Thesimilarity matrix among 705 subjects, in which a pair ofsubjects shares both alleles (red), share one allele (black),and share no alleles (green) is shown as a heat map (Fig. 2).From an HLA-DRB1 perspective, it is possible to identify a

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 319

Page 6: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Tabl

e1.

Estim

ated

geno

typi

cfr

eque

ncie

sof

HLA

-DRB

1in

the

trai

ning

seta

mon

gco

ntro

ls(b

elow

the

diag

onal

line)

and

amon

gca

ses

(abo

veth

edi

agon

allin

e)

HLA

-D

RB

1

∗01:01:01

∗01:02:01

∗01:03

∗03:01:01

∗03:05:01

∗04:01:01

∗04:02:01

∗04:03:01

∗04:04:01

∗04:05:01

∗04:05:04

∗04:06:01

∗04:07:01

∗04:08:01

∗04:10:01

∗04:13

∗07:01:01

∗08:01:01

∗08:02:01

∗08:03:02

∗08:04:01

∗09:01:02

∗10:01:01

∗11:01:01

∗11:01:02

∗11:02:01

∗11:03

∗11:04:01

∗12:01:01

∗13:01:01

∗13:02:01

∗13:03:01

∗13:05:01

∗14:01:01

∗14:02

∗14:04

∗14:12:01

∗14:54:01

∗15:01:01

∗15:02:01

∗15:03:01

∗16:01:01

∗16:02:01

∗16:09

∗ 01:0

1:01

0.9/

0.2

..

2.7

.11

.3.

.0.

40.

2.

..

..

.0.

60.

8.

..

..

..

..

..

.0.

2.

..

..

..

0.2

..

..

.∗ 01

:02:

010.

4.

.0.

6.

0.4

..

..

..

..

..

.0.

2.

..

..

..

..

..

0.2

0.2

..

..

..

..

..

..

.∗ 01

:03

..

.0.

2.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 03

:01:

013.

50.

4.

0.4/

5.6

.22

.11

0.2

6.3

1.3

..

0.2

..

.0.

81

..

.0.

4.

0.4

0.2

..

..

.1.

9.

..

..

..

..

.0.

4.

.∗ 03

:05:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:01:

010.

9.

.1.

3.

0.4/

9.3

0.2

.2.

30.

2.

.0.

20.

4.

.2.

34

0.2

..

1.7

0.2

1.

..

0.2

0.4

3.1

3.8

..

..

0.2

..

0.2

0.2

.0.

40.

2.

∗ 04:0

2:01

.0.

4.

..

.0/

0.2

..

..

..

0.2

..

0.2

0.2

..

..

..

..

..

0.2

.0.

2.

..

..

..

..

..

..

∗ 04:0

3:01

..

.0.

4.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:04:

011.

3.

.1.

8.

0.4

..

0.4/

0.4

0.2

..

..

..

0.2

1.

..

..

..

..

..

0.6

10.

2.

..

..

.0.

2.

..

..

∗ 04:0

5:01

..

..

..

..

..

..

..

..

.0.

2.

..

0.2

0.4

..

..

..

..

..

..

..

..

..

..

.∗ 04

:05:

04.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:06:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:07:

01.

..

0.4

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 04:0

8:01

..

..

.0.

4.

..

..

..

..

..

..

..

..

..

..

.0.

2.

.0.

2.

..

..

..

..

..

.∗ 04

:10:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:13

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 07:0

1:01

1.3

.0.

40.

4.

3.1

..

1.8

.0.

4.

0.4

..

.0.

9/0

0.4

..

.0.

4.

..

..

..

.0.

2.

..

..

..

..

..

..

∗ 08:0

1:01

0.4

..

0.4

.2.

2.

0.4

0.4

..

..

..

.0.

9.

..

.0.

2.

..

..

..

0.2

..

..

..

..

..

..

..

∗ 08:0

2:01

..

.0.

4.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 08

:03:

02.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 08

:04:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 09

:01:

020.

4.

..

..

..

..

..

0.4

..

.0.

4.

..

..

..

..

..

..

0.2

..

..

..

..

..

..

.∗ 10

:01:

010.

4.

..

..

..

.0.

4.

..

..

.0.

4.

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 11:0

1:01

1.3

..

0.9

.1.

3.

.0.

4.

..

..

..

0.4

0.4

..

..

.0.

4/0

..

..

..

..

..

..

..

0.2

..

..

.∗ 11

:01:

02.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 11

:02:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 11

:03

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 11:0

4:01

..

.0.

4.

..

..

..

..

..

.0.

4.

..

..

.0.

4.

..

..

..

..

..

..

0.2

..

..

..

∗ 12:0

1:01

.0.

40.

4.

.0.

4.

.0.

4.

..

..

..

0.4

..

..

.0.

4.

..

..

..

..

..

..

..

0.2

..

..

.∗ 13

:01:

012.

7.

.1.

8.

2.2

..

0.9

..

.0.

90.

4.

.1.

30.

4.

..

0.4

.0.

4.

..

.0.

4.

..

..

..

..

..

..

..

∗ 13:0

2:01

0.4

..

..

2.2

..

0.9

..

..

..

.1.

80.

4.

..

..

0.9

..

.0.

90.

4.

0/0.

2.

..

..

..

..

..

..

∗ 13:0

3:01

0.4

..

0.4

.0.

4.

..

..

..

..

.0.

4.

..

..

..

..

.0.

4.

..

..

..

..

..

..

..

.∗ 13

:05:

01.

..

..

0.4

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 14:0

1:01

..

..

..

..

..

..

..

..

.0.

4.

..

..

..

..

0.4

..

..

..

..

..

..

..

..

∗ 14:0

2.

0.4

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 14:0

4.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 14

:12:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 14

:54:

010.

4.

.0.

4.

.0.

4.

0.4

..

..

..

..

0.9

..

..

.0.

4.

..

..

.0.

40.

4.

..

..

..

..

..

.∗ 15

:01:

011.

8.

0.9

2.7

.4.

9.

0.4

0.9

..

..

..

.4.

40.

90.

4.

.0.

4.

1.3

..

.0.

90.

42.

21.

3.

..

0.4

..

.2.

2/0

..

..

.∗ 15

:02:

01.

..

..

..

..

..

..

0.4

..

..

..

..

..

..

..

..

..

..

..

.0.

4.

..

..

.∗ 15

:03:

01.

..

..

..

..

..

0.4

..

..

..

..

..

..

..

..

..

0.4

..

..

..

..

..

..

.∗ 16

:01:

01.

..

0.4

.0.

4.

..

..

..

..

.0.

4.

..

..

..

..

..

.0.

40.

9.

..

..

..

0.4

..

..

.∗ 16

:02:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 16

:09

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

0.4

..

..

.

Gen

otyp

icfr

equ

enci

esof

hom

ozyg

ous

gen

otyp

esam

ong

con

trol

san

dca

ses

are

den

oted

inn

um

erat

or/d

enom

inat

or,r

esp

ecti

vely

.

320 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 7: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 2. The computed similarity matrix (705 × 705) among 705 subjects in the training set, and each element takes value 0 (green), 0.5 (black),and 1 (red) to indicate that the paired subject shared zero allele, one allele, and both alleles, respectively.

group of subjects who are identical (red squares falling onthe diagonal line), and another group of subjects who shareonly one allele (black rectangles). Note that for genotypeswith extreme frequencies in cases and controls, such as theexemplar DRBA∗15:01:01/07:01:01, many controls share oneallele with this exemplar, even though no control is identicalto this exemplar.

Following the above OOR formulation, we perform a uni-variate association analysis by regressing the disease outcomeon one exemplar-specific similarity measurement a time, togain insight into its marginal association. Detailed resultsfrom univariate analysis include estimated log odds ratios,standard errors, Z scores, and P values, which are listed in thesupplementary Table S2, together with exemplars and asso-ciated genotypes. For more intuitive interpretation, roundedintegers of Z-scores are shown in a matrix format (Table 2),where absolute values of 2 or greater, corresponding to asignificance level of 0.05 or better (without correcting mul-tiple comparisons) are presented for simplicity. Results fromthese univariate analyses reveal that HLA-DRB1∗03:01:01and ∗04:01:01 are positively associated with T1D, which are

colored in red stripes. On the other hand, six alleles HLA-DRB1∗07:01:01, ∗11:01:01, ∗11:04:01, 12:01:01, 13:01:01 and15:01:01:01 are protective against T1D, which are colored ingreen stripes. It is interesting to note that heterozygous geno-type with one risk and one protective allele tends to have apositive association with T1D.

The next step of OOR is to select informative exemplars.For an empirical comparison purpose, we use four differentestimation methods: LASSO, Ridge, Elastic Net, and stepwisemethods, discussed above. All estimated regression coeffi-cients are listed the supplementary Table S3. LASSO selects18 predictors, from 159 exemplars, with estimated coeffi-cients, that is, log odds ratios. Interestingly, positive coeffi-cients tend to associate with those exemplars from patients,whereas negative associations tend to associate with exem-plars from controls.

In contrast, the Ridge regression produced estimated co-efficients for all of exemplars, without deselecting any exem-plar. For interpretation, all exemplars in supplementary TableS3 are sorted by corresponding coefficients. Unlike LASSOestimates, estimated coefficients by Ridge regression take

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 321

Page 8: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Tabl

e2.

Estim

ated

Zsc

ores

(rou

nded

toth

eiri

nteg

ers

and

equa

lora

bove

2)ar

eex

trac

ted

from

mar

gina

lass

ocia

tion

anal

ysis

byO

OR

∗01:01:01

∗01:02:01

∗01:03

∗03:01:01

∗03:05:01

∗04:01:01

∗04:02:01

∗04:03:01

∗04:04:01

∗04:05:01

∗04:05:04

∗04:06:01

∗04:07:01

∗04:08:01

∗04:10:01

∗04:13

∗07:01:01

∗08:01:01

∗08:02:01

∗08:03:02

∗08:04:01

∗09:01:02

∗10:01:01

∗11:01:01

∗11:01:02

∗11:02:01

∗11:03

∗11:04:01

∗12:01:01

∗13:01:01

∗13:02:01

∗13:03:01

∗13:05:01

∗14:01:01

∗14:02

∗14:04

∗14:12:01

∗14:54:01

∗15:01:01

∗15:02:01

∗15:03:01

∗16:01:01

∗16:02:01

∗16:09

∗ 01:0

1:01

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 01:0

2:01

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 01:0

3.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 03

:01:

016

77

7.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 03

:05:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:01:

018

10.

12.

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:02:

01.

..

7.

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:03:

01.

..

7.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:04:

01.

..

6.

11.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:05:

01.

..

7.

11.

.2

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:05:

04.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:06:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:07:

01.

..

7.

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:08:

01.

..

..

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:10:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 04

:13

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 07:0

1:01

– 4.

– 63

.6

– 5.

– 4.

– 6.

– 6.

..

– 6.

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 08:0

1:01

..

.7

.9

..

..

..

..

..

– 5.

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 08:0

2:01

..

.7

.10

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 08:0

3:02

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 08:0

4:01

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 09:0

1:02

..

.7

.10

..

.2

..

..

..

– 5.

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 10:0

1:01

..

..

.10

..

..

..

..

..

– 6.

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 11:0

1:01

– 2.

.5

.8

..

..

..

..

..

– 7– 3

..

..

.– 4

..

..

..

..

..

..

..

..

..

..

∗ 11:0

1:02

..

.7

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 11:0

2:01

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 11:0

3.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 11

:04:

01.

..

6.

9.

..

..

..

..

.– 7

..

..

..

– 5.

..

..

..

..

..

..

..

..

..

.∗ 12

:01:

01.

– 2– 3

..

9.

..

..

..

– 2.

.– 6

..

..

.– 3

..

..

..

..

..

..

..

..

..

..

.∗ 13

:01:

01– 3

– 4.

4.

7.

.– 2

..

.– 5

– 4.

.– 8

– 3.

..

– 4.

– 6.

..

..

..

..

..

..

..

..

..

.∗ 13

:02:

01.

..

6.

9.

..

..

..

..

.– 5

..

..

..

– 4.

..

– 3.

..

..

..

..

..

..

..

.∗ 13

:03:

01.

..

7.

10.

..

..

..

– 2.

.– 6

..

..

..

..

..

– 4.

..

..

..

..

..

..

..

.∗ 13

:05:

01.

..

..

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 14

:01:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

– 3.

..

..

..

..

..

..

..

.∗ 14

:02

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

∗ 14:0

4.

..

..

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 14

:12:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 14

:54:

01.

..

6.

.– 2

..

..

..

..

..

– 2.

..

..

– 5.

..

– 4.

.– 3

– 4.

..

..

..

..

..

.∗ 15

:01:

01– 7

.– 8

..

4.

– 8– 7

..

..

..

.– 1

0– 8

– 8.

.– 8

.– 9

..

.– 8

.– 1

0– 8

..

.– 8

..

.– 7

..

..

.∗ 15

:02:

01.

..

..

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

– 3.

..

..

.∗ 15

:03:

01.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.– 2

..

..

..

..

..

..

.∗ 16

:01:

01.

..

6.

10.

..

..

..

..

.– 6

..

..

..

..

..

..

– 5– 2

..

..

..

.– 8

..

..

.∗ 16

:02:

01.

..

..

10.

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.∗ 16

:09

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

– 8.

..

..

Two

maj

oral

lele

s(H

LA-D

RB

1∗ 03:0

1:01

and

∗ 04:0

1:01

)as

sert

stro

ng

risk

asso

ciat

ion

s(r

edst

rips

).Si

xal

lele

s(H

LA

-DR

B1∗ 07

:01:

01,∗ 11

:01:

01,∗ 11

:01:

01,∗ 11

:04:

01,∗ 12

:01:

01,∗ 13

:01:

01an

d∗ 15

:01:

01)

asse

rtst

ron

gpr

otec

tive

asso

ciat

ion

sw

ith

T1D

.

322 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 9: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 3. Estimated sensitivity, 1-specificity, and area under curve (AUC) for predictive models with selected exemplars by LASSO, Ridgeregression, the Elastic Net, and Stepwise regression in the training set (colored solid line) and also in the validating set (dashed black line). Thecolored bar indicates corresponding values of risk scores (

∑159k=1 β∗

k K (g∼i , g

∼∗k )) under respective models.

modest values around zero. As expected, directionalities ofestimated coefficients tend to be consistent with case/controlsources of all exemplars. Further, for those exemplars selectedby LASSO, ridge estimates are consistent, in directionality, tothose obtained by LASSO. The third column of supplemen-tary Table S3 shows coefficients estimated by the Elastic Net,which selected 39 exemplars. Majority of selected exemplarsoverlap with those selected by LASSO. Quantitatively, esti-mated coefficients are highly correlated between the ElasticNet and LASSO (not shown). Lastly, the stepwise regressionselected 14 exemplars, and 10 of them overlap with thoseselected LASSO. Despite this seemingly high concordance,many estimated coefficients tend to be large, in comparisonwith their counterparts obtained by LASSO.

To gain insights into performances of predictive modelswith selected exemplars by these four procedures, we carriedout receiver’s operating characteristics (ROC) curve analy-sis; evaluating sensitivity, specificity, and area under curve(AUC) for all four predictive models (Fig. 3). ROC curvesand associated AUC values in the training set and in the val-idating set are shown for LASSO (Fig. 3A), Ridge (Fig. 3B),Elastic Net (Fig. 3C), and Stepwise (Fig. 3D). In the trainingset, estimated ROC curves and AUC values are around 0.9,and are largely comparable across four different methods.

In the validation set, estimated AUC values slightly reducedto 0.866, as expected. Interestingly, differences among AUCvalues by three methods are less than 0.001. The compara-bility of the ROC curve analysis, across these three methods,demonstrates that there are probably many predictive models,with different exemplars, with comparable predictive perfor-mances.

In contrast with the result from the stepwise regressionanalysis, however, estimated AUC fell down to below 0.5, thenull value. This result suggests that the stepwise procedureprobably overfits the training data set, with excessively largeregression coefficient estimates.

Application to all Class II HLA Genes

The next task is to build a T1D predictive model ap-plying OOR to all eight HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5, HLA-DQA1, HLA-DQB1, HLA-DPA1,and HLA-DPB1 class II HLA genes. We use the same train-ing set to explore exemplars and to build a predictive model,and then validate the predictive model in the validation set.With respect to the similarity measure, we use the unweightedmean similarity measure defined in Equation (4), denoted by|K (g

∼i, g∼

∗k )|n∗n, where n = 705 and each element takes a value

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 323

Page 10: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 4. Estimated similarity matrix with each element measuring the unweighted identity-by-state at HLA-DRB1, HLA-DRB3, HLA-DRB4,HLA-DRB5, HLA-DQA1, HLA-DQB1, HLA-DPA1 and HLA-DPB1. Color range from green, to dark, and to red correspond to low, medium, and highsimilarity, denoted in the legend.

between 0 and 1. For easy visualization, we use hierarchi-cal clustering algorithm to organize this similarity matrix,and present its heatmap (Fig. 4). The central diagonal cluster(red square, highlighted by an annotation arrow) indicatesthe presence of many subjects who are either identical orhighly similarly to each other. Additionally, there are mul-tiple smaller clusters of highly similar subjects, indicated byannotation arrows. Clustering patterns indicate that subjectsin the lower right corner tend to carry more common geno-type profiles, because more individuals carry common geno-type profiles and their pairwise similarity measures tend tobe high. On the other hand, those in the upper left cor-ner tend to have smaller clusters of individuals with relativelysimilarity measures, probably because their genotype profileshave relatively low frequencies and relatively smaller groupsof individuals carry similar genotype profiles. Interestingly,there are subjects in the upper right corner with relatively lowsimilarity measures, probably because individuals with com-mon genotype profiles tend to segregate from those with lesscommon genotype profiles. Patterns of clusters are intuitiveand are useful to guide data explorations, even though it isdifficult to quantify patterns with any definitive conclusions.

The visual display of the similarity matrix indicates thatthere are duplicated genotype profiles and multiple clusters.Without relying on clustering result in this application, we

have chosen all unique genotype profiles in the training set asexemplars. In total, there are 499 exemplars in the training set.As part of the descriptive analysis, we apply OOR to performunivariate association analysis of T1D with all exemplars;estimated coefficients, standard errors, Z scores, and their Pvalues are listed along with HLA genotypes (supplementaryTable S4). Exemplars are sorted by Z scores, directionalitiesfor which are concordant with case and control status.

Now the next task is to select informative exemplar andto build a predictive model. We have chosen not to use thestepwise procedure, because of the overfitting problem. TheRidge regression is not considered, because it tends to retainmost of exemplars and is not consistent with the prior as-sumption that there are only a small number of informativeexemplars. Finally, we have not included the Elastic Net, be-cause it has a comparable performance to LASSO. To focusour analytic exploration, we have chosen LASSO to build aT1D predictive model with estimated regression coefficients(Table 3). There are 26 informative exemplars selected byLASSO. Estimated coefficients are positive for those exem-plars derived from cases, while those are negative for thoseexemplars derived from controls. For example, a subject whois highly similar to exemplars, such as D1612, is at relativelyhigh risk for T1D. On the other extreme, a subject who issimilar to N000982 would have relatively low risk.

324 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 11: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Tabl

e3.

Estim

ated

coef

ficie

nts

fori

nfor

mat

ive

exem

plar

sfo

rpre

dict

ing

T1D

with

clas

sII

HLA

gene

s(H

LA-D

RB1,

HLA

-DRb

345,

HLA

-DQ

A1,

HLA

-DQ

B1,

HLA

-DPA

1,an

dH

LA-D

PB1)

usin

gLA

SSO

IDE

xem

plar

sD

RB

1D

RB

345

DQ

A1

DQ

B1

DPA

1D

PB

1C

oef.

1D

1612

∗ 04:0

1:01

∗ 04:0

1:01

DR

B4∗ 01

:03:

01D

RB

4∗ 01:0

3:01

∗ 03:0

1:01

∗ 03:0

1:01

∗ 03:0

2:01

∗ 03:0

2:01

∗ 01:0

3:01

∗ 01:0

3:01

∗ 02:0

1:02

∗ 04:0

1:01

3.05

2D

1868

∗ 07:0

1:01

∗ 04:0

1:01

DR

B4∗ 01

:01:

01D

RB

4∗ 01:0

3:01

∗ 02:0

1∗ 03

:02

∗ 02:0

2:01

∗ 03:0

2:01

∗ 01:0

3:01

∗ 02:0

1:01

∗ 02:0

1:02

∗ 03:0

1:01

2.69

3D

1214

∗ 04:0

1:01

∗ 13:0

2:01

DR

B3∗ 03

:01:

01D

RB

4∗ 01:0

3:01

∗ 03:0

1:01

∗ 01:0

2:01

∗ 03:0

2:01

∗ 06:0

9∗ 01

:03:

01∗ 01

:03:

01∗ 02

:01:

02∗ 03

:01:

012.

544

D12

09∗ 04

:05:

01∗ 01

:01:

01D

RB

4∗ 01:0

1:01

Non

ampl

ified

∗ 03:0

2∗ 01

:01:

01∗ 03

:02:

01∗ 05

:01:

01∗ 01

:03:

01∗ 02

:01:

01∗ 02

:01:

02∗ 09

:01:

011.

805

D40

5∗ 03

:01:

01∗ 03

:01:

01D

RB

3∗ 02:0

2:01

DR

B3∗ 02

:02:

01∗ 05

:01:

01∗ 05

:01:

01∗ 02

:01:

01∗ 02

:01:

01∗ 01

:03:

01∗ 01

:03:

01∗ 04

:01:

01∗ 30

:01

1.74

6D

1344

∗ 03:0

1:01

∗ 03:0

1:01

DR

B3∗ 02

:02:

01D

RB

3∗ 02:0

2:01

∗ 05:0

1:01

∗ 05:0

1:01

∗ 02:0

1:01

∗ 02:0

1:01

∗ 01:0

3:01

∗ 01:0

3:01

∗ 02:0

1:02

∗ 03:0

1:01

1.52

7D

1569

∗ 03:0

1:01

∗ 03:0

1:01

DR

B3∗ 01

:01:

02D

RB

3∗ 01:0

1:02

∗ 05:0

1:01

∗ 05:0

1:01

∗ 02:0

1:01

∗ 02:0

1:08

∗ 01:0

3:01

∗ 02:0

1:02

∗ 01:0

1:01

∗ 04:0

1:01

1.29

8D

2102

∗ 04:0

5:01

∗ 09:0

1:02

DR

B4∗ 01

:03:

01D

RB

4∗ 01:0

3:01

∗ 03:0

2∗ 03

:02

∗ 03:0

2:01

∗ 03:0

3:02

∗ 01:0

3:01

∗ 02:0

1:01

∗ 04:0

1:01

∗ 13:0

1:01

1.04

9D

1437

∗ 04:0

1:01

∗ 04:0

1:01

DR

B4∗ 01

:03:

01D

RB

4∗ 01:0

3:01

∗ 03:0

1:01

∗ 03:0

1:01

∗ 03:0

2:01

∗ 03:0

2:01

∗ 01:0

3:01

∗ 01:0

3:01

∗ 03:0

1:01

∗ 04:0

1:01

0.89

10D

1596

∗ 03:0

1:01

∗ 03:0

1:01

DR

B3∗ 01

:01:

02D

RB

3∗ 01:0

1:02

∗ 05:0

1:01

∗ 05:0

1:01

∗ 02:0

1:01

∗ 02:0

1:08

∗ 01:0

3:01

∗ 01:0

3:01

∗ 04:0

1:01

∗ 04:0

1:01

0.83

11D

704

∗ 04:0

1:01

∗ 09:0

1:02

DR

B4∗ 01

:03:

01D

RB

4∗ 01:0

3:01

∗ 03:0

1:01

∗ 03:0

2∗ 03

:02:

01∗ 03

:03:

02∗ 01

:03:

01∗ 02

:02:

02∗ 04

:01:

01∗ 05

:01:

010.

7712

D45

9∗ 04

:01:

01∗ 04

:01:

01D

RB

4∗ 01:0

3:01

DR

B4∗ 01

:03:

01∗ 03

:01:

01∗ 03

:01:

01∗ 03

:02:

01∗ 03

:02:

01∗ 01

:03:

01∗ 01

:03:

01∗ 04

:01:

01∗ 04

:01:

010.

7513

D62

4∗ 03

:01:

01∗ 03

:01:

01D

RB

3∗ 01:0

1:02

DR

B3∗ 01

:01:

02∗ 05

:01:

01∗ 05

:01:

01∗ 02

:01:

01∗ 02

:01:

01∗ 01

:03:

01∗ 02

:01:

04∗ 02

:01:

02∗ 13

:01:

010.

6814

D20

55∗ 03

:01:

01∗ 04

:01:

01D

RB

3∗ 01:0

1:02

DR

B4∗ 01

:03:

01∗ 05

:01:

01∗ 03

:02

∗ 02:0

1:01

∗ 03:0

2:01

∗ 01:0

3:01

∗ 02:0

1:02

∗ 01:0

1:01

∗ 04:0

1:01

0.57

15D

1499

∗ 04:0

1:01

∗ 01:0

1:01

DR

B4∗ 01

:03:

01N

onam

plifi

ed∗ 03

:02

∗ 01:0

1:01

∗ 03:0

2:01

∗ 05:0

1:01

∗ 01:0

3:01

∗ 02:0

2:01

∗ 04:0

2:01

∗ 19:0

10.

5116

N00

5872

∗ 07:0

1:01

∗ 07:0

1:01

DR

B4∗ 01

:01:

01D

RB

4∗ 01:0

1:01

∗ 02:0

1∗ 02

:01

∗ 02:0

2:01

∗ 02:0

2:01

∗ 02:0

1:01

∗ 02:0

1:01

∗ 10:0

1∗ 11

:01:

01– 0

.37

17N

0019

91∗ 15

:01:

01∗ 13

:01:

01D

RB

3∗ 02:0

2:01

DR

B5∗ 01

:01:

01∗ 01

:02:

01∗ 01

:03:

01∗ 06

:02:

01∗ 06

:03:

01∗ 01

:03:

01∗ 02

:02:

01∗ 02

:01:

02∗ 19

:01:

01– 0

.52

18N

0028

42∗ 03

:01:

01∗ 16

:01:

01D

RB

3∗ 02:0

2:01

DR

B5∗ 02

:02

∗ 05:0

1:01

∗ 01:0

2:02

∗ 02:0

1:01

∗ 05:0

2:01

∗ 01:0

3:01

∗ 02:0

1:01

∗ 04:0

2:01

∗ 14:0

1– 0

.65

19N

0051

82∗ 07

:01:

01∗ 15

:01:

01D

RB

4∗ 01:0

3:01

DR

B5∗ 01

:01:

01∗ 02

:01

∗ 01:0

2:01

∗ 03:0

3:02

∗ 06:0

2:01

∗ 01:0

3:01

∗ 02:0

1:01

∗ 04:0

1:01

∗ 13:0

1:01

– 0.7

020

N00

3698

∗ 12:0

1:01

∗ 15:0

1:01

DR

B3∗ 02

:02:

01D

RB

5∗ 01:0

1:01

∗ 05:0

5:01

∗ 01:0

2:01

∗ 03:0

1:01

∗ 06:0

2:01

∗ 01:0

3:01

∗ 02:0

2:01

∗ 04:0

2:01

∗ 19:0

1:01

– 0.9

721

N00

1707

∗ 04:0

4:01

∗ 14:5

4:01

DR

B3∗ 02

:02:

01D

RB

4∗ 01:0

3:01

∗ 03:0

1:01

∗ 01:0

1:01

∗ 03:0

2:01

∗ 05:0

3:01

∗ 01:0

3:01

∗ 01:0

4∗ 04

:01:

01∗ 15

:01

– 1.4

122

N00

2460

∗ 04:0

7:01

∗ 13:0

1:01

DR

B3∗ 01

:01:

02D

RB

4∗ 01:0

3:01

∗ 03:0

2∗ 01

:03:

01∗ 03

:01:

01∗ 06

:03:

01∗ 01

:03:

01∗ 01

:03:

01∗ 04

:01:

01∗ 04

:02:

01– 1

.91

23N

0027

09∗ 07

:01:

01∗ 15

:01:

01D

RB

4∗ 01:0

3:01

DR

B5∗ 01

:01:

01∗ 02

:01

∗ 01:0

2:01

∗ 02:0

2:01

∗ 06:0

2:01

∗ 01:0

3:01

∗ 01:0

3:01

∗ 04:0

1:01

∗ 04:0

2:01

– 2.9

624

N00

4319

∗ 04:0

7:01

∗ 07:0

1:01

DR

B4∗ 01

:03:

01D

RB

4∗ 01:0

3:01

∗ 03:0

1:01

∗ 02:0

1∗ 03

:01:

01∗ 03

:03:

02∗ 01

:03:

01∗ 01

:03:

01∗ 02

:01:

02∗ 04

:01:

01– 3

.61

25N

0043

85∗ 14

:02

∗ 15:0

1:01

DR

B3∗ 01

:01:

02D

RB

5∗ 01:0

1:01

∗ 05:0

3∗ 01

:02:

01∗ 03

:01:

01∗ 06

:02:

01∗ 01

:03:

01∗ 01

:03:

01∗ 03

:01:

01∗ 06

:01

– 3.9

526

N00

0982

∗ 14:5

4:01

∗ 15:0

2:01

DR

B3∗ 02

:02:

01D

RB

5∗ 01:0

2∗ 01

:01:

01∗ 01

:03:

01∗ 05

:03:

01∗ 06

:01:

01∗ 01

:03:

01∗ 02

:01:

01∗ 02

:01:

02∗ 14

:01

– 4.1

8

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 325

Page 12: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 5. Evaluation of T1D predictive model with class II HLA genes (HLA-DRB1, HLA-DRB345, HLA-DQA1, HLA-DQB1, HLA-DPA1 and HLA-DPB1) in the training set (top panels) and in the validating set (bottom panels). The box plots show distributions of risk scores in training andvalidating set. ROC curves are shown on the left hand panels.

With estimated coefficients as weights from the trainingset, we now construct a risk score as weighted sum:

R(g∼

) =

26∑k=1

βkK (g∼, g

∼∗k ), (6)

where the summation is over all those 26 selected exemplarsand βk is the estimated coefficient listed in Table 3. To eval-uate empirical distributions of risk scores, we draw boxplotsfor risk scores among controls and cases in the training set(Fig. 5). Clearly, risk scores among cases are generally greaterthan those among controls in the training set, and their differ-ence is statistically significant (P-value < 0.001, not shown).Risk scores among controls have a symmetric distribution,

while those among patients are somewhat skewed. With riskscores ranging from –5.52 to 4.1, computed sensitivity (y-axisof ROC curve) and 1-specificity (x-axis) form an ROC curvewith AUC = 0.92 in the training set.

To validate the above predictive model, we computed riskscores for all samples in the validating set with fixed exem-plars and associated weights in the model above (6). On thelower left panel, the boxplot shows the distribution of riskscores among controls and cases (Fig. 4). Clearly, empiricaldistributions of risk scores in the validating set are largelycomparable to those in the training set. Furthermore, ROCanalysis on the validating set reveals a comparable sensitivity-specificity curve with AUC = 0.89 (Fig. 5).

326 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 13: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 6. Empirical distribution of estimated penalty parameter (λ) from repeated cross-validation estimate (top panel) with the profile deviancefunction with varying penalty parameter.

Stability in Selecting Exemplars

It is known that the choice of the penalty parameter (λ)has a direct and profound impact on the selection of infor-mative exemplars [Friedman et al., 2010; Tibshirani et al.,2012]. Conventional cross-validation is used to determinethe penalty value that achieves the minimum deviance (mis-classification error or AUC). The top panel of Figure 6 showsan XY plot of deviance vs. different penalty parameter values(on log scale). It achieves the minimum with the logarith-mic value of the estimated penalty parameter somewhere be-tween –6.0 and –5.5. The flatness of this function implies that

cross-validation likely has a major influence on estimatedpenalty parameters. To assess its influence, we repeated theestimation of the penalty parameter 1,000 times, and es-timated corresponding values. The lower panel of Figure 6shows the empirical distribution of estimated penalty param-eter. Interestingly, estimated penalty values in the training setare discrete with a total of 15 unique values, probably becauseof discreteness in the similarity matrix.

Given that the value of penalty parameter affects vari-able selection, it is of interest, first, whether the selectedvariables are stable with different penalty parameter val-ues, and, second, whether the selection is stable even with

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 327

Page 14: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Table 4. Averaged Kappa values over 1,000 bootstrap samples, to measure concordance of selected informative exemplars acrossLASSO estimations with different penalty parameter values (on the logarithmic scale)

–Log(lamda) – 6.5

12

– 6.3

26

– 5.6

75

– 5.5

82

– 5.4

89

– 5.3

96

– 5.3

03

– 5.2

10

– 5.1

17

– 5.0

24

– 4.9

31

– 4.8

38

– 4.7

45

– 4.6

52

– 4.5

59

–6.512 0.87 0.54 0.50 0.46 0.43 0.39 0.37 0.34 0.31 0.29 0.27 0.25 0.23 0.21–6.326 0.04 0.61 0.56 0.53 0.49 0.45 0.42 0.39 0.36 0.34 0.31 0.29 0.27 0.25–5.675 0.06 0.06 0.92 0.86 0.80 0.75 0.70 0.65 0.60 0.56 0.52 0.48 0.45 0.42–5.582 0.07 0.06 0.03 0.92 0.86 0.80 0.75 0.70 0.65 0.60 0.56 0.52 0.48 0.45–5.489 0.07 0.07 0.05 0.04 0.93 0.86 0.80 0.75 0.70 0.65 0.60 0.56 0.52 0.49–5.396 0.07 0.07 0.05 0.05 0.03 0.93 0.86 0.80 0.75 0.70 0.65 0.60 0.56 0.53–5.303 0.07 0.07 0.06 0.06 0.05 0.03 0.93 0.86 0.80 0.75 0.70 0.65 0.61 0.56–5.210 0.07 0.07 0.06 0.06 0.06 0.05 0.04 0.93 0.86 0.80 0.74 0.69 0.65 0.61–5.117 0.07 0.07 0.07 0.06 0.06 0.05 0.05 0.04 0.93 0.86 0.80 0.75 0.70 0.65–5.024 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.05 0.04 0.93 0.86 0.80 0.75 0.70–4.931 0.06 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.05 0.04 0.93 0.86 0.81 0.75–4.838 0.06 0.06 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.05 0.04 0.93 0.87 0.81–4.745 0.06 0.06 0.07 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.05 0.04 0.93 0.87–4.652 0.06 0.06 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.05 0.04 0.93–4.559 0.06 0.06 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.06 0.05 0.04

Averaged Kappa values are shown above the diagonal and are highlighted yellow if the value is greater than 0.8. Standard deviations for estimated Kappa values are recordedbelow the diagonal line.

a fixed penalty parameter. To address this question, we per-formed a bootstrap analysis with these 15 different penaltyparameter values. On each of 1,000 bootstrap samples, weperformed LASSO with, respectively, fixed lambda values,and selected informative exemplars by the penalized likeli-hood. For qualitative comparisons, we choose to use Kappastatistics to measure overlaps of selected exemplars [Agresti,1988; Cohen, 1968]; the large Kappa value corresponds tothe greater overlap of selected exemplars by two LASSO es-timations with different penalty parameter values. AveragedKappa values, along with their standard deviations, are com-puted across all Bootstrap samples (Table 4, Kappa valuesin the upper triangle, and standard deviations in the lowertriangle). The results indicate that concordance among these15 penalty values is around 80% for adjacent penalty val-ues. Concordances degrade as the differences of penalty pa-rameter values increases, as expected. To gain further in-sight into quantitative concordance of estimated coefficientswith different penalty values, we compute averaged coeffi-cient estimates over all bootstraps, and plot a pairwise XYplot for mean coefficients with different penalty values (in-dicated in the diagonal boxes; Fig. 7). It is obvious that es-timated coefficient averages are highly correlated with eachother, if their penalty values of the pair are close. Otherwise,estimated coefficients, with different penalty values, coulddiffer.

As noted earlier, it appears that multiple predictive mod-els on a single training set have comparable performance.The question is if predictive models, with different penaltyparameter values, have similar performances, even though se-lected informative exemplars and associated coefficients vary.For this purpose, we use LASSO, with fixed penalty param-eter value, to select informative exemplars and to constructcorresponding predictive models. On each predictive model,we perform ROC analysis on the training set as well as onthe validating set (Fig. 8). Fifteen ROC curve analysis resultswith estimated AUC values suggested that the ROC curvesare largely comparable. AUC values in the training set vary

from 0.91 to 0.93, while these values in the validation set arearound 0.89.

In light of comparable performance and also high concor-dances of selected exemplars across different penalty param-eter values, we chose the medium penalty parameter value(log(λ) = –5.21) to evaluate the stability of individual coeffi-cient estimates across 1,000 bootstrap samples. The estimatedcoefficients for all 499 exemplars across 1,000 bootstrap sam-ples, after performing two-way clustering analysis are shownin Fig. 9. All values are truncated at –2 and 2 for easy vi-sualization. Interestingly, estimated coefficients, with a fixedpenalty value, are remarkably consistent across all 1,000 boot-strap samples, despites some subtle variations.

Discussion

In this paper, we describe a new approach, termed OOR,to explore disease associations or to build predictive mod-els with highly polymorphic genes. To circumvent the chal-lenge with complex genotypes, OOR transforms the usualgenotype-specific (or allele-specific) regression approach to aregression problem with similarity measurements of subjectswith exemplars. By using this new “metric,” one regresses dis-ease phenotype on similarities of all subjects with exemplarsto explore disease associations with exemplars. Through ap-plying LASSO, OOR allows one to construct a parsimoniouspredictive model with informative exemplars.

Although closely connecting with well-established kernelmachine methods, there are differences worth noting. First,exemplars of OOR can be derived internally or externally.Second, OOR actually focuses on disease associations withexemplar-specific similarity measures and interprets associa-tion results accordingly, which is different from earlier appli-cations of the kernel machine. Third, by LASSO, OOR builds aparsimonious predictive model with informative exemplars.Fourth, it is quite convenient to apply exemplar-based pre-dictive models to large databases, provided that similaritywith exemplars can be measured. Fifth, OOR analyzes on

328 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 15: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 7. Pairwise XY plots for averaged coefficient estimates over 1,000 bootstrap samples with one penalty value (X-axis) vs. another penaltyvalue (Y-axis). Logarithmic values of penalty parameters are shown on the diagonal line.

genotype profiles, alleviating the need for haplotypes of mul-tiple HLA genes.

To illustrate the construction of predictive models by OOR,we built a predictive model with all HLA genes (DRB1,DRB345, DQA1, DQB1, DPA1, and DPB1), followed withassessing its performance and the stability of the selectedpredictors with varying penalty parameter values. On thetraining set, OOR selected 26 informative exemplars as pre-dictors, and the predictive model had an admirable sensitivityand specificity profile with the AUC of 0.92. Fixing exemplarsand regression coefficients, we applied our predictive modelto an independently selected validating set, and ROC analysisrevealed comparable sensitivity and specificity to those in thetraining data set, and AUC of 0.89. If further validated by

external data sets, this predictive model is ready to be use-ful for screening T1D in the general population. Note thatour analysis has not adjusted for population stratification,because there are two major ethnic populations in Swe-den (Swedish and Finish). Given genetic closeness betweenSwedish and Finish populations, this confounding effect, ifany, is not likely to alter such a strong predictive association.

An important property worthy commenting is that OORresults are complementary to allele- or genotype-specificresults from conventional regression analyses. Recall thatgenotype-specific regression analysis of HLA genes is typ-ically confined only to those common genotypes, suchas HLA-DRB1∗03:01:01/03:01:01 or ∗04:01:01/04:01:01, forwhich numbers of observations are sufficiently large for

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 329

Page 16: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 8. Results from ROC analyses on all predictive models with selected exemplars by LASSO, when penalty parameters are fixed to 1 of 15unique values on the logarithmic scale. AUC values are computed in the training set (colored curve) and also in the validating set (dotted blackcurve).

meaningful statistical analysis. For genotypes of less frequen-cies or uneven distributions between cases and controls, theanalysis becomes more difficult or infeasible without specialhandling. In contrast, OOR assesses disease association withthe similarity of subject’s genotypes with exemplars’ geno-types, bypassing the limitation noted above. For example,DRB1∗15:01:01/07:01:01 has frequencies of 4.4 and 0.0 incontrols and cases, respectively (Table 1). When examiningphenotype association with similarity to this genotype, manycases and controls have variable similarities, ranging from 0,0.5, and 1 and yielding robust result with Z = –10 (Table 2).

The idea of using similarity measures has numerous con-nections with methods developed and used in statistical ge-netics [Khoury et al., 1993; Vogel, 1997]. Although tracing allof these connections is not intended here, it suffices to notethat classical and modern genetics aim to discover outcome-associated susceptibility genes through exploiting correlat-edness of subjects within families, because shared diseasegenes, prior to their discovery, likely lead to increasing simi-larities among related individuals. In the early day of genetics,segregation and linkage methods were developed to charac-terize and discover genes through familial aggregations ofcases [Khoury et al., 1993; Thompson, 1986a,1986b]. More

recently, some research groups have proposed to assess sim-ilarity of genetic markers in cases and controls, and to usesimilarity regression as a way to discover disease genes [Wes-sel & Schork, 2006]. Further, Tzeng et al. [2009] have de-scribed an association analysis idea of regressing “similarityof traits” on “similarity of genotypes”. Beyond modeling onresponse, similarity of genotypes has been used in assessingassociations of variance components on haplotypes [Tzeng& Zhang, 2007]. Although bearing the same interest in sim-ilarities, OOR has a different analytic goal from discoveringdisease-associated SNPs.

With respect to the data mining literature in computersciences, OOR has a close connection with a class of ap-proaches known as the k-nearest neighbor (kNN) methods[Biau et al., 2012; Houle et al., 2010]. The key idea underlyingkNN is that objects, in “close neighborhood” defined by cer-tain characteristics, tend to have similar outcome. In essence,one can use the kNN to make predictions, without any mod-eling assumption. Such approaches are known sometimes asnonparametric predictive methods. However, kNN is less ef-ficient, partly because it has not taken advantage of the factthat “distant neighbors” may also inform the disease asso-ciations and can thus be combined to improve prediction

330 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016

Page 17: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

Figure 9. Estimated LASSO coefficients across 1,000 bootstrap samples with the fixed penalty parameter value at log(λ) = −5.21. Although thecolor intensity corresponds to the magnitudes of coefficients, green denotes positive values, while red denotes negative values.

accuracy. In comparison, OOR utilizes all of neighboring in-formation with multiple exemplars. At the conceptual level,OOR could be thought of as an extension to nearest neighborregression estimates [Devroye et al., 1994].

Another closely related method is the Grade of Mem-bership technique, known as GoM [Kovtun et al., 2004;Manton et al., 1986; Pomarol-Clotet et al., 2010]. Concep-tually, GoM aims to model the joint distribution of out-come and covariates through introducing a set of latentmembership variables that create clusters of subjects. Un-der sensible distributional assumption on these latent mem-bership variables, one can derive a marginalized likelihoodafter integrating over all GoM latent membership variablesfor estimation and inference. At the end of the GoM anal-ysis, one can interpret parameters as properties associatedwith individuals, rather than specific marginal interpreta-tion of individual covariates. In this regard, OOR, just likeGoM, utilizes similarity information to achieve the ana-lytic goal, but differs on modeling assumptions and relatedimplementations. The primary advantage of OOR is thatit requires no distributional assumptions on latent mem-bership and makes inferences purely based on empiricalevidence.

OOR has potential for two major extensions. First, OORis constructed for binary disease phenotype under the lo-gistic regression model (3). Naturally, by extending the lo-gistic regression to the generalized linear model [McCullagh& Nelder, 1989], one can generalize OOR to studies withother types of phenotypes, such as continuous, categorical,or censored phenotypes, with appropriate choice of the linkfunction. The second extension is to consider other covariate

types, such as text strings (e.g., from web searches), electronicsignals or images. Further, covariates can be high dimen-sional data, where the number of dimension is far greaterthan the sample size (Zhao et al., March 12, 2016 (online)).For these diverse applications, the key is to choose context-dependent similarity measure to define “similarity metrics”between subjects with respect to their covariate profiles.

Acknowledgements

Authors thank Dr. Chad He for discussions on variable se-lection techniques, and to thank Dr. Neil Risch for bringingGoM to their attention. Authors also thank two anonymousreviewers whose comments have substantially improved thepresentation of the manuscript. This work is supported inpart by European Fund for Research on Diabetes (EFSD), theSwedish Child Diabetes Foundation (Barndiabetesfonden),the National Institutes of Health (DK63861, DK26190), theSwedish Research Council including a Linne grant to LundUniversity Diabetes Centre, an equipment grant from the KAWallenberg Foundation, the Skane County Council for Re-search and Development as well as the Swedish Associationof Local Authorities and Regions (SKL), and also by the in-stitutional developmental fund at Fred Hutchinson CancerResearch Center (LPZ).

Conflict of Interest

Authors declare that there is no conflict of interest with thiswork.

Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016 331

Page 18: Genetic RESEARCH ARTICLE Epidemiology Object-Oriented...challenges to HLA association analysis, if one aims to ex-amine disease associations with individual alleles/genotypes within

References

Agresti A. 1988. A model for agreement between ratings on an ordinal scale. Biometrics44(539–548):539–548.

Bell IR, Koithan M. 2006. Models for the study of whole systems. Integr Cancer Ther5(4):293–307.

Biau G, Devroye L, Dujmovic V, Krzyzak A. 2012. An affine invariant k-nearest neighborregression estimate. J Multivar Anal 112:24–34.

Bishop DT, Williamson JA. 1990. The power of identity-by-state methods for linkageanalysis. Am J Hum Gnet 46:254–265.

Cardinal-Fernandez P, Nin N, Ruiz-Cabello J, Lorente JA. 2014. Systems medicine: anew approach to clinical practice. Arch Bronconeumol 50(10):444–451.

Cohen J. 1968. Weighted kappa: nominal scale agreement with provision for scaleddisagreement or partial credit. Psychol Bull 70(4):213–220.

Cox DD, O’Sullivan F. 1989. Generalized nonparametric regression via penalized like-lihood. AMS:1–31. Technical Report No. 170, Department of Statistics, Universityof Washington.

Cristianini N, Shawe-Taylor J. 2000. An Introduction to Support Vector Machines: AndOther Kernel-Based Learning Methods. Cambridge/New York: Cambridge Univer-sity Press.

Cullen M, Noble J, Erlich H, Thorpe K, Beck S, Klitz W, Trowsdale J, Carrington M.1997. Characterization of recombination in the HLA class II region. Am J HumGenet 60(2):397–407.

Delli AJ, Lindblad B, Carlsson A, Forsander G, Ivarsson SA, Ludvigsson J, MarcusC, Lernmark A, Better Diabetes Diagnosis Study Group. 2010. Type 1 diabetespatients born to immigrants to Sweden increase their native diabetes risk and differfrom Swedish patients in HLA types and islet autoantibodies. Pediatr Diabetes11(8):513–520.

Delli AJ, Vaziri-Sani F, Lindblad B, Elding-Larsson H, Carlsson A, Forsander G,Ivarsson SA, Ludvigsson J, Kockum I, Marcus C and others. 2012. Zinc trans-porter 8 autoantibodies and their association with SLC30A8 and HLA-DQgenes differ between immigrant and Swedish patients with newly diagnosedtype 1 diabetes in the Better Diabetes Diagnosis study. Diabetes 61(10):2556–2564.

Devroye L, Gyorfi L, Krzyzak A, Lugosi G. 1994. On the strong universal consistency ofnearest-neighbor regression function estimates. Ann Stat 22(3):1371–1385.

Fang FC, Casadevall A. 2011. Reductionistic and holistic science. Infect Immun79(4):1401–1404.

Firth D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80(1):27–38.

Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linearmodels via coordinate descent. J Stat Softw 33(1):1–22.

Hastie T, Tibshirani R, Friedman JH. 2009. The Elements of Statistical Learning: DataMining, Inference, and Prediction. Springer Series in Statistics, Second Edition.Corrected 7th printing edition. New York: Springer, p. 1. Online resource (xxii,745 pp.).

Hastie T, Tibshirani R, Wainwright M. 2015. Statistical Learning with Sparsity: TheLasso and Generalizations. Boca Raton: CRC Press, Taylor & Francis Group.

Houle ME, Kriegel HP, Kroger P, Schubert E, Zimek A. 2010. Can shared-neighbordistances defeat the curse of dimensionality? Sci Stat Database Manag 6187:482–500.

Jeffreys AJ, May CA. 2004. Intense and highly localized gene conversion activity inhuman meiotic crossover hot spots. Nat Genet 36(2):151–156.

Khoury MJ, Beaty TH, Cohen BH. 1993. Fundamentals of genetic epidemiology. NewYork: Oxford University Press.

Kimeldor G, Wahba G. 1971. Some results on Tchebycheffian spline functions. J MathAnal Appl 33(1):82–95.

Kovtun M, IAkushevich I, Manton KG, Tolley HD. 2004. Grade of Membership Analysis:One Possible Approach to Foundations. Cornell University Library. Ithaca, NewYork.

Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. 2008. A powerful and flexible multilocusassociation test for quantitative traits. Am J Hum Genet 82(2):386–397.

Manton KG, Stallard E, Woodbury MA, Yashin AI. 1986. Applications of the gradeof membership technique to event history analysis—Extensions to multivariateunobserved heterogeneity. Math Model 7(9–12):1375–1391.

Marsh S. 2000. The HLA FactsBook. San Diego: Academic Press.McCullagh P, Nelder JA. 1989. Generalized Linear Model. New York: Chapman and

Hall.Meinshausen N, Buhlmann P. 2010. Stability selection. J R Stat Soc Ser B Stat Methodol

72:417–473.Minnier J, Yuan M, Liu JS, Cai T. 2015. Risk classification with an adaptive naive Bayes

Kernel machine model. J Am Stat Assoc 110(509):393–404.Noble JA. 2015. Immunogenetics of type 1 diabetes: a comprehensive review. J Autoim-

mun 64:101–112.Pomarol-Clotet E, Salvador R, Murray G, Tandon S, McKenna PJ. 2010. Are there

valid subtypes of schizophrenia? A grade of membership analysis. Psychopathology43(1):53–62.

Smith SE, Slaughter BD, Unruh JR. 2014. Imaging methodologies for systems biology.Cell Adh Migr 8(5):468–477.

Sun W, Wang JH, Fang YX. 2013. Consistent selection of tuning parameters via variableselection stability. J Mach Learn Res 14:3419–3440.

Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. 2011. The importance of phaseinformation for human genomics. Nat Rev Genet 12(3):215–223.

Thompson EA. 1986a. Genetic epidemiology: a review of the statistical basis. Stat Med5(4):291–302.

Thompson EA. 1986b. Pedigree Analysis in Human Genetics. Baltimore, MD: The JohnsHopkins University Press.

Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ. 2012.Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser BStat Methodol 74(2):245–266.

Tzeng JY, Zhang D. 2007. Haplotype-based association analysis via variance-components score test. Am J Hum Genet 81(5):927–938.

Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. 2009. Gene-trait similarityregression for multimarker-based association analysis. Biometrics 65(3):822–832.

Vogel FMAG. 1997. Human Genetics, Third Edition. New York: Springer-Verlag.Wessel J, Schork NJ. 2006. Generalized genomic distance-based regression methodology

for multilocus association analysis. Am J Hum Genet 79(5):792–806.Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. 2010.

Powerful SNP-set analysis for case-control genome-wide association studies. AmJ Hum Genet 86(6):929–942.

Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. 2011. Rare-variant association testingfor sequencing data with the sequence kernel association test. Am J Hum Genet89(1):82–93.

Yang H, Chen X, Wong WH. 2011. Completely phased genome sequencing throughchromosome sorting. Proc Natl Acad Sci USA 108(1):12–17.

Zhao LP, Alshiekh S, Carlsson A, Elding-Larsson H, Forsander G, Ivarsson SA, Lud-vigsson J, Kockum I, Marcus C, Persson M and others. 2016. Next generationsequencing reveals that HLA-DRB3, -DRB4 and -DRB5 may be associated withislet autoantibodies and risk for childhood type 1 diabetes. Diabetes 65(3):710–718.

Zhao LP, Bolouri H. Object-Oriented Regression for Building Predictive Models withHigh Dimensional Omics Data from Translational Studies. J Biomed Inform, March12, 2016 (online).

Zhu J, Hastie T. 2005. Kernel logistic regression and the import vector machine. JComput Graph Stat 14(1):185–205.

332 Genetic Epidemiology, Vol. 40, No. 4, 315–332, 2016