Research Article Detecting Genetic Interactions for...

11
Research Article Detecting Genetic Interactions for Quantitative Traits Using -Spacing Entropy Measure Jaeyong Yee, 1 Min-Seok Kwon, 2 Seohoon Jin, 3 Taesung Park, 4 and Mira Park 5 1 Department of Physiology and Biophysics, Eulji University, Daejeon, Republic of Korea 2 Department of Bioinformatics, Seoul National University, Seoul, Republic of Korea 3 Department of Informational Statistics, Korea University, Jochiwon, Republic of Korea 4 Department of Statistics, Seoul National University, Seoul, Republic of Korea 5 Department of Preventive Medicine, Eulji University, Daejeon, Republic of Korea Correspondence should be addressed to Mira Park; [email protected] Received 14 November 2014; Revised 4 February 2015; Accepted 8 March 2015 Academic Editor: Xiang-Yang Lou Copyright © 2015 Jaeyong Yee et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binary traits. However, many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always be straightforward and meaningful. Association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits. We extend the usefulness of this information gain by proposing a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype. Hence, the information gain can be obtained for any phenotype distribution. Because any functional form, such as Gaussian, is not assumed for the entire distribution of a trait or a given genotype, this method is expected to be robust enough to be applied to any phenotypic association data. Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait. 1. Introduction Recent advances in high-throughput genotyping techniques have produced massive volumes of genetic data. Although it is common to analyze single SNP effects extensively, such approaches cannot adequately explain the intricate genetic contributions to complex diseases such as hypertension, diabetes, and certain psychiatric disorders. Consequently there are still large amounts of genetic components that remain unexplained. Gene-gene interaction analysis may be one method to adequately address this missing heritability problem [1]. For case-control studies, which formulate the measures for a binary trait, a number of statistical methods for detecting gene-gene interactions have been proposed. One of the most popular methods is multifactor dimensionality reduction (MDR) [2] that converts a high-dimensional contingency table to a one-dimensional model without raising the issue of sparse cells. Several variants of MDR have been recently developed [38], while another approach was developed [911] from information theory [12, 13]. More recently, an entropy-based approach which utilizes the relative gain of information, as well as its standardized measure, has also been proposed [14]. However, for quantitative traits such as the blood pres- sure, body mass index, and patient survival times, relatively few attempts have been made to analyze the genetic inter- actions. Because many phenotype measures are intrinsically quantitative, and categorizing a continuous trait may not always be straightforward and meaningful, association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. To that end, introducing a new statistic is one way to tackle the problem [15]. Extending the MDR algorithm to continuous traits, as in the ways of the generalized MDR (GMDR) and the model-based MDR (MB-MDR), has been Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 523641, 10 pages http://dx.doi.org/10.1155/2015/523641

Transcript of Research Article Detecting Genetic Interactions for...

Page 1: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

Research ArticleDetecting Genetic Interactions for Quantitative Traits Using119898-Spacing Entropy Measure

Jaeyong Yee1 Min-Seok Kwon2 Seohoon Jin3 Taesung Park4 and Mira Park5

1Department of Physiology and Biophysics Eulji University Daejeon Republic of Korea2Department of Bioinformatics Seoul National University Seoul Republic of Korea3Department of Informational Statistics Korea University Jochiwon Republic of Korea4Department of Statistics Seoul National University Seoul Republic of Korea5Department of Preventive Medicine Eulji University Daejeon Republic of Korea

Correspondence should be addressed to Mira Park miraeuljiackr

Received 14 November 2014 Revised 4 February 2015 Accepted 8 March 2015

Academic Editor Xiang-Yang Lou

Copyright copy 2015 Jaeyong Yee et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binarytraits However many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always bestraightforward and meaningful Association of gene-gene interactions with an observed distribution of such phenotypes needsto be investigated directly without categorization Information gain based on entropy measure has previously been successful inidentifying genetic associations with binary traits We extend the usefulness of this information gain by proposing a nonparametricevaluation method of conditional entropy of a quantitative phenotype associated with a given genotype Hence the informationgain can be obtained for any phenotype distribution Because any functional form such as Gaussian is not assumed for the entiredistribution of a trait or a given genotype this method is expected to be robust enough to be applied to any phenotypic associationdata Here we show its use to successfully identify the main effect as well as the genetic interactions associated with a quantitativetrait

1 Introduction

Recent advances in high-throughput genotyping techniqueshave produced massive volumes of genetic data Althoughit is common to analyze single SNP effects extensively suchapproaches cannot adequately explain the intricate geneticcontributions to complex diseases such as hypertensiondiabetes and certain psychiatric disorders Consequentlythere are still large amounts of genetic components thatremain unexplained Gene-gene interaction analysis may beone method to adequately address this missing heritabilityproblem [1]

For case-control studies which formulate the measuresfor a binary trait a number of statisticalmethods for detectinggene-gene interactions have been proposed One of the mostpopular methods is multifactor dimensionality reduction(MDR) [2] that converts a high-dimensional contingencytable to a one-dimensional model without raising the issue

of sparse cells Several variants of MDR have been recentlydeveloped [3ndash8] while another approach was developed[9ndash11] from information theory [12 13] More recently anentropy-based approach which utilizes the relative gain ofinformation as well as its standardized measure has alsobeen proposed [14]

However for quantitative traits such as the blood pres-sure body mass index and patient survival times relativelyfew attempts have been made to analyze the genetic inter-actions Because many phenotype measures are intrinsicallyquantitative and categorizing a continuous trait may notalways be straightforward and meaningful association ofgene-gene interactions with an observed distribution ofsuch phenotypes needs to be investigated directly withoutcategorization To that end introducing a new statistic is oneway to tackle the problem [15] Extending theMDRalgorithmto continuous traits as in the ways of the generalized MDR(GMDR) and the model-based MDR (MB-MDR) has been

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 523641 10 pageshttpdxdoiorg1011552015523641

2 BioMed Research International

proposed [3 6] More recently a quantitative MDR (QMDR)was proposed to replace the balanced accuracy metric witha 119905-test statistic [16] However these MDR-based approachesmay oversimplify the original data to some degree throughclassification of phenotypes An entropy-based approachmaywell be an alternative model Entropy is commonly used ininformation theory to measure the uncertainty of randomvariables [12 13] and information gain ormutual informationhas been shown useful to represent association strengths [17ndash19] Although the usefulness of such information theoreticalmethods is well known the statistical methods based onthis approach for analyzing gene-gene interactions of thequantitative traits are rarely found with the exception ofone specific case [20] However the application may also belimited by assuming a normal distribution

Here we extend the usefulness of the information conceptto quantitative traits by considering nonparametric estimatesbased on sample-spacing or 119898-spacing [22ndash25] for theconditional entropy of a quantitative phenotype based ona given genotype The challenge therefore is to couplea nonparametric entropy estimator to correct and stableinformation gainsWe thus developed the useful informationgain standardized (IGS) approach and applied it to datasetscomposed of several genotypes and the quantitative traitThis approach could be considered an extension of previouswork on categorical traits [14] to the quantitative phenotypesThe proposed method however does not attempt in anyway to classify quantitative phenotypes like other methodssuch as variants of MDR but instead handles them directlyproviding an intrinsic advantage of removing the chanceof misclassification While previous entropy-based methodsof analyzing quantitative traits assumed the shape of itsdistribution to be normal [20] our method does not need tospecify the distribution to estimate the association Any reg-ular or irregular distribution would not cause any difficultiesAlthough this is also an advantage of GMDR or QMDR wepropose a method that takes the advantageous characteristicsfrom both of those methods We also performed extensivesimulation studies to compare the powers of the proposedmethod to QMDR and GMDR demonstrating its advantagein detection power

In the following sections after a brief review of nonpara-metric entropy estimation we describe a new method formodeling genetic interactions A nonparametric entropy esti-mator is shown to successfully couple with genetic datasetsthrough our modifying work in the Materials and MethodsApplication of this information gain standardized (IGS)approach is evaluated for both simulation and real datasetsin the Results and Discussions

2 Materials and Methods

21 Estimation of the Entropy for a Continuous Variable If119883is a random vector with probability density function119891(119909) itsdifferential entropy is defined by

119867(119891) = minusint119891 (119909) ln (119891 (119909)) 119889119909 (1)

A well-known approach for estimating a solution to thisequation is to use plug-in estimates In this approach 119891(119909)

is first estimated using a standard density estimation methodsuch as a histogram or kernel density estimator and theentropy is then computed Integral resubstitution splittingdata and cross-validation estimates are among the usualplug-in estimates [22] Another approach is based on sample-spacing Let 119883119896 be a set of independent and identicallydistributed real valued random variables with correspondingorder statistics of 119883119899119896 Here 119899 represents the total numberof measured samples For the arbitrary integers 119894 and 119898

satisfying the condition of 1 le 119894 lt 119894 + 119898 le 119899 a spacing oforder 119898 or 119898-spacing is defined as 119883119899119894+119898 minus 119883119899119894 A densityestimate based on sample-spacing119898 is then constructed as

119891119899 (119909) =119898

119899

1

119883119899119894119898 minus 119883119899(119894minus1)119898

(2)

where 119909 isin [119883119899(119894minus1)119898 119883119899119894119898) [14] This density estimate isconsistent if as 119899 rarr infin 119898 rarr infin and 119898119899 rarr 0

[22] Several variations of an entropy estimator with minordifferences have been proposed all based on the abovedensity estimates [23 24] Among them the following werereported to approximate with lowered variance [25]

119867119898119899 =1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896)) (3)

Asymptotic bias of this estimator can be corrected by addingadditional terms including the digamma function [22 28]

119867119898119899 =1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896)) minus

Γ1015840(119898)

Γ (119898)+ ln119898

(4)

As119898 increases the correctional terms become negligible andthe two estimators coincide Our evaluation of the entropyof a phenotype 119867(119875) of a quantitative trait is based on thisestimator

22 Modification of the 119898-Spacing Based Entropy EstimatorThe estimator in (4) has both 119899 and 119898 as parameters Ingenetic association studies the number of samples 119899 ofseveral hundreds is common However when the conditionalentropy is estimated there may be a minor allele that couldhave a much smaller number of samples corresponding tothat allele Moreover the choice of the sample-spacing 119898should affect the resulting estimation of an entropy valueTherefore it is required to have an entropy estimation schemeindependent of the number of samples without the needof choosing a particular value of the sample-spacing Toillustrate such a requirement an ensemble of 3000 sets of therandom deviation from 119873(0 1

2) was generated for each data

point in Figure 1 where the mean and standard deviation ofthe estimates are plotted for each ensemble On the left panelof Figure 1 119898 is fixed to 10 and 20 while 119899 is varied Theanalytic formula of the entropy for a normal distribution canbe obtained as follows [20] where 119890 is Eulerrsquos number

119867 = ln (120590radic2120587119890) (5)

BioMed Research International 3

15

14

13

12

11

10

101

102

103

104

105

n-sample⟨m-spacing⟩

H120590 = 10

10

20

mn

(a)

m-spacing

⟨n-sample⟩

0 100 200 300 400

15

14

13

12

11

10

H

120590 = 10

400

mn

(b)

Figure 1 The 119899-dependence (a) and 119898-dependence (b) of the entropy estimator 119867119898119899

An ensemble of 3000 sets of random sampling from119873(0 1

2)was constructed and used for each point in the plotThe sample-spacing119898 was fixed while varying the number of samples 119899 (a) to

evaluate the 119899-dependence of the entropy estimator In (b) 119899 was fixed and 119898 was varied to show the 119898-dependence Analytically obtainedtrue values are represented by the arrowed horizontal lines

The calculated value of (5) is pointed on the vertical axiswith a horizontal arrow with the corresponding 120590 aboveit The obvious 119899-dependence of the estimator can be seenin this plot where the estimation approaches the analyticvalue as 119899 increases with radic119899-consistency as expected [24]In Figure 1(b) 119899 is fixed to 400 while 119898 is varied In thisplot the estimated entropy again changes in value throughoutthe possible range of 119898 It is shown that the estimatedvalue is always smaller than the analytically calculated valueTherefore assigning a particular value to 119898 such as radic119899 thetypical choice [25] would not be appropriate in this samplingrange Because of these 119899- and119898-dependences the estimatorin (4) may need to be modified Therefore we modify theentropy estimator in (4) as follows

119867⟨119898⟩119899 =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

(6)

In this modification an entropy estimator is averaged overthe possible 119898 values for each 119899 which is denoted by ⟨119898⟩This estimator is used to plot the entropy versus number ofsamples in Figure 2 Over a wide range of 119899 this entropy esti-mator yields very stable values in contrast to Figure 1(a) Anincrease in the extremely small 119899 range should be within thetolerable error in an application of genome-wide association

as the contribution to the conditional entropy by such aminorallele would be suppressed by the weighting factor of themarginal probability that should be proportional to the num-ber of corresponding samples Analytically obtained entropyvalues for 119873(0 120590

2) with three different 120590rsquos are marked on

the vertical axis on the right-hand side Regardless of thevalue of 120590 the differences between the analytically obtainedvalue and the values given by the estimator stay essentiallythe same Considering that the association study measuresthe difference between the entropy and the correspondingconditional entropy the stability should be a more criticalissue than the absolute value of the estimates Thereforecompensation of this Δ would not be necessary as long asit is stable Furthermore the underestimation of the entropyshown in the plot should have little effect on the associationstrength Hence an entropy estimator has been set up thatshould satisfy the practical 119899-independence without the needto find a proper sample-spacing

23 Evaluation of a Conditional Entropy Now let 119866 be acategorical variable assigned to each sample measurement119883119896 119866may be a genotype given by a measured SNP or a com-bination of SNPs while 119883119896 represents the measured value ofa phenotype For detecting the main effect of a single SNP 119866consists of three categories of 119866 = 0 119866 = 1 and 119866 = 2 Fordetecting the interaction between SNP119894 and SNP119895 119866 consistsof 9 categories such that 119866 = 0 = (SNP119894 = 0 SNP119895 = 0)

4 BioMed Research International

119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896

along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction

119867(119875) =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

119867 (119875 | 119866)

=

3119889minus1

sum

119892=0

119899119892

119899(

1

119899119892 minus 1

119899119892minus1

sum

119898119892=1

(1

119899119892 minus 119898119892

sdot

119899119892minus119898119892

sum

119896=1

ln(

119899119892

119898119892

(119883119899119892 119896+119898119892minus 119883119899119892 119896

))

minus

Γ1015840(119898119892)

Γ (119898119892)

+ ln119898119892))

=

3119889minus1

sum

119892=0

119899119892

119899119867 (119875 | 119866 = 119892)

(7)

24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888

is a constant scale factor) the difference would be ln 119888

119867119888119883119894= ln 119888 + 119867119883119894

(8)

For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength

IG = 119867 (119875) minus 119867 (119875 | 119866) (9)

IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]

20

15

10

05

101

102

103

104

105

n-sample

H⟨m

120590 = 10

120590 = 14

120590

120590 = 07

10

14

07

Δ = 0183

Δ = 0183

Δ = 0183

n

Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590

2) While

varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899

Let IG(1)119894

denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)

1

IG(1)2

IG(1)119899

can be computed as follows

IG119901 =sum119899

119894=1IG(1)119894

119899 119878119901 =

radicsum119899

119894=1(IG(1)119894

minus IG119901)2

119899 minus 1

(10)

where 119899 is the number of permuted datasets Now IGS isdefined as follows

IGS =

IG minus IG119901119878119901

(11)

3 Results and Discussions

31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method

BioMed Research International 5

07

06

05

BA (b

y G

MD

R) P lt 0001

P = 0003

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(a)

P lt 0001

P = 0003

t-st

atist

ic (b

y Q

MD

R)

6

2

0

4

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(b)

Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair

a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while

119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions

32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows

119910 | (SNP1 = 119894 SNP2 = 119895) sim 119873(119891119894119895 1205902) (12)

Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three

6 BioMed Research International

(0 0)

0486

171(0 1)

0960

78

minus4 minus2 4

(0 2)

0538

11

(1 0)

0947

80

minus3 minus1

(1 1)

0004

30(1 2)

0811

6

(2 0)

0640

16

minus2 minus1

(2 1)

0606

8

minus2 minus1

(2 2)

0909

0

minus2 0 2 4

00

02

04

minus2 0 2

1 2 3

1 2 3

0 24

00

02

04

00

02

04

minus2 0

0 1 2 30

2 4

00

02

04

minus2 0 2 4

00

02

04

00

02

04

minus2 0 2 4

00

02

04

00

02

04

00

02

04

Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group

different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample

size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below

119910 | (SNP1 = 119894 SNP2 = 119895) sim Γ (119896 120579) (13)

The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579

2Penetrance models were classified by 7 heritability values

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

2 BioMed Research International

proposed [3 6] More recently a quantitative MDR (QMDR)was proposed to replace the balanced accuracy metric witha 119905-test statistic [16] However these MDR-based approachesmay oversimplify the original data to some degree throughclassification of phenotypes An entropy-based approachmaywell be an alternative model Entropy is commonly used ininformation theory to measure the uncertainty of randomvariables [12 13] and information gain ormutual informationhas been shown useful to represent association strengths [17ndash19] Although the usefulness of such information theoreticalmethods is well known the statistical methods based onthis approach for analyzing gene-gene interactions of thequantitative traits are rarely found with the exception ofone specific case [20] However the application may also belimited by assuming a normal distribution

Here we extend the usefulness of the information conceptto quantitative traits by considering nonparametric estimatesbased on sample-spacing or 119898-spacing [22ndash25] for theconditional entropy of a quantitative phenotype based ona given genotype The challenge therefore is to couplea nonparametric entropy estimator to correct and stableinformation gainsWe thus developed the useful informationgain standardized (IGS) approach and applied it to datasetscomposed of several genotypes and the quantitative traitThis approach could be considered an extension of previouswork on categorical traits [14] to the quantitative phenotypesThe proposed method however does not attempt in anyway to classify quantitative phenotypes like other methodssuch as variants of MDR but instead handles them directlyproviding an intrinsic advantage of removing the chanceof misclassification While previous entropy-based methodsof analyzing quantitative traits assumed the shape of itsdistribution to be normal [20] our method does not need tospecify the distribution to estimate the association Any reg-ular or irregular distribution would not cause any difficultiesAlthough this is also an advantage of GMDR or QMDR wepropose a method that takes the advantageous characteristicsfrom both of those methods We also performed extensivesimulation studies to compare the powers of the proposedmethod to QMDR and GMDR demonstrating its advantagein detection power

In the following sections after a brief review of nonpara-metric entropy estimation we describe a new method formodeling genetic interactions A nonparametric entropy esti-mator is shown to successfully couple with genetic datasetsthrough our modifying work in the Materials and MethodsApplication of this information gain standardized (IGS)approach is evaluated for both simulation and real datasetsin the Results and Discussions

2 Materials and Methods

21 Estimation of the Entropy for a Continuous Variable If119883is a random vector with probability density function119891(119909) itsdifferential entropy is defined by

119867(119891) = minusint119891 (119909) ln (119891 (119909)) 119889119909 (1)

A well-known approach for estimating a solution to thisequation is to use plug-in estimates In this approach 119891(119909)

is first estimated using a standard density estimation methodsuch as a histogram or kernel density estimator and theentropy is then computed Integral resubstitution splittingdata and cross-validation estimates are among the usualplug-in estimates [22] Another approach is based on sample-spacing Let 119883119896 be a set of independent and identicallydistributed real valued random variables with correspondingorder statistics of 119883119899119896 Here 119899 represents the total numberof measured samples For the arbitrary integers 119894 and 119898

satisfying the condition of 1 le 119894 lt 119894 + 119898 le 119899 a spacing oforder 119898 or 119898-spacing is defined as 119883119899119894+119898 minus 119883119899119894 A densityestimate based on sample-spacing119898 is then constructed as

119891119899 (119909) =119898

119899

1

119883119899119894119898 minus 119883119899(119894minus1)119898

(2)

where 119909 isin [119883119899(119894minus1)119898 119883119899119894119898) [14] This density estimate isconsistent if as 119899 rarr infin 119898 rarr infin and 119898119899 rarr 0

[22] Several variations of an entropy estimator with minordifferences have been proposed all based on the abovedensity estimates [23 24] Among them the following werereported to approximate with lowered variance [25]

119867119898119899 =1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896)) (3)

Asymptotic bias of this estimator can be corrected by addingadditional terms including the digamma function [22 28]

119867119898119899 =1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896)) minus

Γ1015840(119898)

Γ (119898)+ ln119898

(4)

As119898 increases the correctional terms become negligible andthe two estimators coincide Our evaluation of the entropyof a phenotype 119867(119875) of a quantitative trait is based on thisestimator

22 Modification of the 119898-Spacing Based Entropy EstimatorThe estimator in (4) has both 119899 and 119898 as parameters Ingenetic association studies the number of samples 119899 ofseveral hundreds is common However when the conditionalentropy is estimated there may be a minor allele that couldhave a much smaller number of samples corresponding tothat allele Moreover the choice of the sample-spacing 119898should affect the resulting estimation of an entropy valueTherefore it is required to have an entropy estimation schemeindependent of the number of samples without the needof choosing a particular value of the sample-spacing Toillustrate such a requirement an ensemble of 3000 sets of therandom deviation from 119873(0 1

2) was generated for each data

point in Figure 1 where the mean and standard deviation ofthe estimates are plotted for each ensemble On the left panelof Figure 1 119898 is fixed to 10 and 20 while 119899 is varied Theanalytic formula of the entropy for a normal distribution canbe obtained as follows [20] where 119890 is Eulerrsquos number

119867 = ln (120590radic2120587119890) (5)

BioMed Research International 3

15

14

13

12

11

10

101

102

103

104

105

n-sample⟨m-spacing⟩

H120590 = 10

10

20

mn

(a)

m-spacing

⟨n-sample⟩

0 100 200 300 400

15

14

13

12

11

10

H

120590 = 10

400

mn

(b)

Figure 1 The 119899-dependence (a) and 119898-dependence (b) of the entropy estimator 119867119898119899

An ensemble of 3000 sets of random sampling from119873(0 1

2)was constructed and used for each point in the plotThe sample-spacing119898 was fixed while varying the number of samples 119899 (a) to

evaluate the 119899-dependence of the entropy estimator In (b) 119899 was fixed and 119898 was varied to show the 119898-dependence Analytically obtainedtrue values are represented by the arrowed horizontal lines

The calculated value of (5) is pointed on the vertical axiswith a horizontal arrow with the corresponding 120590 aboveit The obvious 119899-dependence of the estimator can be seenin this plot where the estimation approaches the analyticvalue as 119899 increases with radic119899-consistency as expected [24]In Figure 1(b) 119899 is fixed to 400 while 119898 is varied In thisplot the estimated entropy again changes in value throughoutthe possible range of 119898 It is shown that the estimatedvalue is always smaller than the analytically calculated valueTherefore assigning a particular value to 119898 such as radic119899 thetypical choice [25] would not be appropriate in this samplingrange Because of these 119899- and119898-dependences the estimatorin (4) may need to be modified Therefore we modify theentropy estimator in (4) as follows

119867⟨119898⟩119899 =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

(6)

In this modification an entropy estimator is averaged overthe possible 119898 values for each 119899 which is denoted by ⟨119898⟩This estimator is used to plot the entropy versus number ofsamples in Figure 2 Over a wide range of 119899 this entropy esti-mator yields very stable values in contrast to Figure 1(a) Anincrease in the extremely small 119899 range should be within thetolerable error in an application of genome-wide association

as the contribution to the conditional entropy by such aminorallele would be suppressed by the weighting factor of themarginal probability that should be proportional to the num-ber of corresponding samples Analytically obtained entropyvalues for 119873(0 120590

2) with three different 120590rsquos are marked on

the vertical axis on the right-hand side Regardless of thevalue of 120590 the differences between the analytically obtainedvalue and the values given by the estimator stay essentiallythe same Considering that the association study measuresthe difference between the entropy and the correspondingconditional entropy the stability should be a more criticalissue than the absolute value of the estimates Thereforecompensation of this Δ would not be necessary as long asit is stable Furthermore the underestimation of the entropyshown in the plot should have little effect on the associationstrength Hence an entropy estimator has been set up thatshould satisfy the practical 119899-independence without the needto find a proper sample-spacing

23 Evaluation of a Conditional Entropy Now let 119866 be acategorical variable assigned to each sample measurement119883119896 119866may be a genotype given by a measured SNP or a com-bination of SNPs while 119883119896 represents the measured value ofa phenotype For detecting the main effect of a single SNP 119866consists of three categories of 119866 = 0 119866 = 1 and 119866 = 2 Fordetecting the interaction between SNP119894 and SNP119895 119866 consistsof 9 categories such that 119866 = 0 = (SNP119894 = 0 SNP119895 = 0)

4 BioMed Research International

119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896

along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction

119867(119875) =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

119867 (119875 | 119866)

=

3119889minus1

sum

119892=0

119899119892

119899(

1

119899119892 minus 1

119899119892minus1

sum

119898119892=1

(1

119899119892 minus 119898119892

sdot

119899119892minus119898119892

sum

119896=1

ln(

119899119892

119898119892

(119883119899119892 119896+119898119892minus 119883119899119892 119896

))

minus

Γ1015840(119898119892)

Γ (119898119892)

+ ln119898119892))

=

3119889minus1

sum

119892=0

119899119892

119899119867 (119875 | 119866 = 119892)

(7)

24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888

is a constant scale factor) the difference would be ln 119888

119867119888119883119894= ln 119888 + 119867119883119894

(8)

For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength

IG = 119867 (119875) minus 119867 (119875 | 119866) (9)

IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]

20

15

10

05

101

102

103

104

105

n-sample

H⟨m

120590 = 10

120590 = 14

120590

120590 = 07

10

14

07

Δ = 0183

Δ = 0183

Δ = 0183

n

Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590

2) While

varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899

Let IG(1)119894

denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)

1

IG(1)2

IG(1)119899

can be computed as follows

IG119901 =sum119899

119894=1IG(1)119894

119899 119878119901 =

radicsum119899

119894=1(IG(1)119894

minus IG119901)2

119899 minus 1

(10)

where 119899 is the number of permuted datasets Now IGS isdefined as follows

IGS =

IG minus IG119901119878119901

(11)

3 Results and Discussions

31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method

BioMed Research International 5

07

06

05

BA (b

y G

MD

R) P lt 0001

P = 0003

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(a)

P lt 0001

P = 0003

t-st

atist

ic (b

y Q

MD

R)

6

2

0

4

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(b)

Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair

a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while

119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions

32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows

119910 | (SNP1 = 119894 SNP2 = 119895) sim 119873(119891119894119895 1205902) (12)

Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three

6 BioMed Research International

(0 0)

0486

171(0 1)

0960

78

minus4 minus2 4

(0 2)

0538

11

(1 0)

0947

80

minus3 minus1

(1 1)

0004

30(1 2)

0811

6

(2 0)

0640

16

minus2 minus1

(2 1)

0606

8

minus2 minus1

(2 2)

0909

0

minus2 0 2 4

00

02

04

minus2 0 2

1 2 3

1 2 3

0 24

00

02

04

00

02

04

minus2 0

0 1 2 30

2 4

00

02

04

minus2 0 2 4

00

02

04

00

02

04

minus2 0 2 4

00

02

04

00

02

04

00

02

04

Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group

different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample

size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below

119910 | (SNP1 = 119894 SNP2 = 119895) sim Γ (119896 120579) (13)

The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579

2Penetrance models were classified by 7 heritability values

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

BioMed Research International 3

15

14

13

12

11

10

101

102

103

104

105

n-sample⟨m-spacing⟩

H120590 = 10

10

20

mn

(a)

m-spacing

⟨n-sample⟩

0 100 200 300 400

15

14

13

12

11

10

H

120590 = 10

400

mn

(b)

Figure 1 The 119899-dependence (a) and 119898-dependence (b) of the entropy estimator 119867119898119899

An ensemble of 3000 sets of random sampling from119873(0 1

2)was constructed and used for each point in the plotThe sample-spacing119898 was fixed while varying the number of samples 119899 (a) to

evaluate the 119899-dependence of the entropy estimator In (b) 119899 was fixed and 119898 was varied to show the 119898-dependence Analytically obtainedtrue values are represented by the arrowed horizontal lines

The calculated value of (5) is pointed on the vertical axiswith a horizontal arrow with the corresponding 120590 aboveit The obvious 119899-dependence of the estimator can be seenin this plot where the estimation approaches the analyticvalue as 119899 increases with radic119899-consistency as expected [24]In Figure 1(b) 119899 is fixed to 400 while 119898 is varied In thisplot the estimated entropy again changes in value throughoutthe possible range of 119898 It is shown that the estimatedvalue is always smaller than the analytically calculated valueTherefore assigning a particular value to 119898 such as radic119899 thetypical choice [25] would not be appropriate in this samplingrange Because of these 119899- and119898-dependences the estimatorin (4) may need to be modified Therefore we modify theentropy estimator in (4) as follows

119867⟨119898⟩119899 =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

(6)

In this modification an entropy estimator is averaged overthe possible 119898 values for each 119899 which is denoted by ⟨119898⟩This estimator is used to plot the entropy versus number ofsamples in Figure 2 Over a wide range of 119899 this entropy esti-mator yields very stable values in contrast to Figure 1(a) Anincrease in the extremely small 119899 range should be within thetolerable error in an application of genome-wide association

as the contribution to the conditional entropy by such aminorallele would be suppressed by the weighting factor of themarginal probability that should be proportional to the num-ber of corresponding samples Analytically obtained entropyvalues for 119873(0 120590

2) with three different 120590rsquos are marked on

the vertical axis on the right-hand side Regardless of thevalue of 120590 the differences between the analytically obtainedvalue and the values given by the estimator stay essentiallythe same Considering that the association study measuresthe difference between the entropy and the correspondingconditional entropy the stability should be a more criticalissue than the absolute value of the estimates Thereforecompensation of this Δ would not be necessary as long asit is stable Furthermore the underestimation of the entropyshown in the plot should have little effect on the associationstrength Hence an entropy estimator has been set up thatshould satisfy the practical 119899-independence without the needto find a proper sample-spacing

23 Evaluation of a Conditional Entropy Now let 119866 be acategorical variable assigned to each sample measurement119883119896 119866may be a genotype given by a measured SNP or a com-bination of SNPs while 119883119896 represents the measured value ofa phenotype For detecting the main effect of a single SNP 119866consists of three categories of 119866 = 0 119866 = 1 and 119866 = 2 Fordetecting the interaction between SNP119894 and SNP119895 119866 consistsof 9 categories such that 119866 = 0 = (SNP119894 = 0 SNP119895 = 0)

4 BioMed Research International

119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896

along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction

119867(119875) =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

119867 (119875 | 119866)

=

3119889minus1

sum

119892=0

119899119892

119899(

1

119899119892 minus 1

119899119892minus1

sum

119898119892=1

(1

119899119892 minus 119898119892

sdot

119899119892minus119898119892

sum

119896=1

ln(

119899119892

119898119892

(119883119899119892 119896+119898119892minus 119883119899119892 119896

))

minus

Γ1015840(119898119892)

Γ (119898119892)

+ ln119898119892))

=

3119889minus1

sum

119892=0

119899119892

119899119867 (119875 | 119866 = 119892)

(7)

24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888

is a constant scale factor) the difference would be ln 119888

119867119888119883119894= ln 119888 + 119867119883119894

(8)

For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength

IG = 119867 (119875) minus 119867 (119875 | 119866) (9)

IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]

20

15

10

05

101

102

103

104

105

n-sample

H⟨m

120590 = 10

120590 = 14

120590

120590 = 07

10

14

07

Δ = 0183

Δ = 0183

Δ = 0183

n

Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590

2) While

varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899

Let IG(1)119894

denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)

1

IG(1)2

IG(1)119899

can be computed as follows

IG119901 =sum119899

119894=1IG(1)119894

119899 119878119901 =

radicsum119899

119894=1(IG(1)119894

minus IG119901)2

119899 minus 1

(10)

where 119899 is the number of permuted datasets Now IGS isdefined as follows

IGS =

IG minus IG119901119878119901

(11)

3 Results and Discussions

31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method

BioMed Research International 5

07

06

05

BA (b

y G

MD

R) P lt 0001

P = 0003

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(a)

P lt 0001

P = 0003

t-st

atist

ic (b

y Q

MD

R)

6

2

0

4

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(b)

Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair

a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while

119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions

32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows

119910 | (SNP1 = 119894 SNP2 = 119895) sim 119873(119891119894119895 1205902) (12)

Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three

6 BioMed Research International

(0 0)

0486

171(0 1)

0960

78

minus4 minus2 4

(0 2)

0538

11

(1 0)

0947

80

minus3 minus1

(1 1)

0004

30(1 2)

0811

6

(2 0)

0640

16

minus2 minus1

(2 1)

0606

8

minus2 minus1

(2 2)

0909

0

minus2 0 2 4

00

02

04

minus2 0 2

1 2 3

1 2 3

0 24

00

02

04

00

02

04

minus2 0

0 1 2 30

2 4

00

02

04

minus2 0 2 4

00

02

04

00

02

04

minus2 0 2 4

00

02

04

00

02

04

00

02

04

Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group

different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample

size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below

119910 | (SNP1 = 119894 SNP2 = 119895) sim Γ (119896 120579) (13)

The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579

2Penetrance models were classified by 7 heritability values

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

4 BioMed Research International

119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896

along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction

119867(119875) =1

119899 minus 1

119899minus1

sum

119898=1

(1

119899 minus 119898

119899minus119898

sum

119896=1

ln(119899

119898(119883119899119896+119898 minus 119883119899119896))

minusΓ1015840(119898)

Γ (119898)+ ln119898)

119867 (119875 | 119866)

=

3119889minus1

sum

119892=0

119899119892

119899(

1

119899119892 minus 1

119899119892minus1

sum

119898119892=1

(1

119899119892 minus 119898119892

sdot

119899119892minus119898119892

sum

119896=1

ln(

119899119892

119898119892

(119883119899119892 119896+119898119892minus 119883119899119892 119896

))

minus

Γ1015840(119898119892)

Γ (119898119892)

+ ln119898119892))

=

3119889minus1

sum

119892=0

119899119892

119899119867 (119875 | 119866 = 119892)

(7)

24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888

is a constant scale factor) the difference would be ln 119888

119867119888119883119894= ln 119888 + 119867119883119894

(8)

For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength

IG = 119867 (119875) minus 119867 (119875 | 119866) (9)

IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]

20

15

10

05

101

102

103

104

105

n-sample

H⟨m

120590 = 10

120590 = 14

120590

120590 = 07

10

14

07

Δ = 0183

Δ = 0183

Δ = 0183

n

Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590

2) While

varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899

Let IG(1)119894

denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)

1

IG(1)2

IG(1)119899

can be computed as follows

IG119901 =sum119899

119894=1IG(1)119894

119899 119878119901 =

radicsum119899

119894=1(IG(1)119894

minus IG119901)2

119899 minus 1

(10)

where 119899 is the number of permuted datasets Now IGS isdefined as follows

IGS =

IG minus IG119901119878119901

(11)

3 Results and Discussions

31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method

BioMed Research International 5

07

06

05

BA (b

y G

MD

R) P lt 0001

P = 0003

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(a)

P lt 0001

P = 0003

t-st

atist

ic (b

y Q

MD

R)

6

2

0

4

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(b)

Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair

a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while

119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions

32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows

119910 | (SNP1 = 119894 SNP2 = 119895) sim 119873(119891119894119895 1205902) (12)

Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three

6 BioMed Research International

(0 0)

0486

171(0 1)

0960

78

minus4 minus2 4

(0 2)

0538

11

(1 0)

0947

80

minus3 minus1

(1 1)

0004

30(1 2)

0811

6

(2 0)

0640

16

minus2 minus1

(2 1)

0606

8

minus2 minus1

(2 2)

0909

0

minus2 0 2 4

00

02

04

minus2 0 2

1 2 3

1 2 3

0 24

00

02

04

00

02

04

minus2 0

0 1 2 30

2 4

00

02

04

minus2 0 2 4

00

02

04

00

02

04

minus2 0 2 4

00

02

04

00

02

04

00

02

04

Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group

different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample

size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below

119910 | (SNP1 = 119894 SNP2 = 119895) sim Γ (119896 120579) (13)

The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579

2Penetrance models were classified by 7 heritability values

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

BioMed Research International 5

07

06

05

BA (b

y G

MD

R) P lt 0001

P = 0003

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(a)

P lt 0001

P = 0003

t-st

atist

ic (b

y Q

MD

R)

6

2

0

4

Main effect2-order3-order

minus4 0 4 8

IGS (by m-spacing)

(b)

Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair

a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while

119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions

32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows

119910 | (SNP1 = 119894 SNP2 = 119895) sim 119873(119891119894119895 1205902) (12)

Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three

6 BioMed Research International

(0 0)

0486

171(0 1)

0960

78

minus4 minus2 4

(0 2)

0538

11

(1 0)

0947

80

minus3 minus1

(1 1)

0004

30(1 2)

0811

6

(2 0)

0640

16

minus2 minus1

(2 1)

0606

8

minus2 minus1

(2 2)

0909

0

minus2 0 2 4

00

02

04

minus2 0 2

1 2 3

1 2 3

0 24

00

02

04

00

02

04

minus2 0

0 1 2 30

2 4

00

02

04

minus2 0 2 4

00

02

04

00

02

04

minus2 0 2 4

00

02

04

00

02

04

00

02

04

Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group

different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample

size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below

119910 | (SNP1 = 119894 SNP2 = 119895) sim Γ (119896 120579) (13)

The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579

2Penetrance models were classified by 7 heritability values

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

6 BioMed Research International

(0 0)

0486

171(0 1)

0960

78

minus4 minus2 4

(0 2)

0538

11

(1 0)

0947

80

minus3 minus1

(1 1)

0004

30(1 2)

0811

6

(2 0)

0640

16

minus2 minus1

(2 1)

0606

8

minus2 minus1

(2 2)

0909

0

minus2 0 2 4

00

02

04

minus2 0 2

1 2 3

1 2 3

0 24

00

02

04

00

02

04

minus2 0

0 1 2 30

2 4

00

02

04

minus2 0 2 4

00

02

04

00

02

04

minus2 0 2 4

00

02

04

00

02

04

00

02

04

Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group

different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample

size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below

119910 | (SNP1 = 119894 SNP2 = 119895) sim Γ (119896 120579) (13)

The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579

2Penetrance models were classified by 7 heritability values

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

BioMed Research International 7

15

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(a)

10

05

00

m-spacing

QMDR

GMDR

00 02 04

Heritability

Hit

ratio

(b)

10

05

00

GMDR

00 02 04

Heritability

Hit

ratio

m-spacing

QMDR

(c)

Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot

001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition

33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing

method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

8 BioMed Research International

best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation

To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions

34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool

Table 1 Type I error estimation with the significance level 120572 of 005

Type I error rate () Normal Gamma Mixed

MAF 02 50 50 5104 51 50 51

Heritability

001 53 50 48002 49 54 52005 53 43 5301 50 53 5102 50 53 5103 48 49 4804 51 47 53

Overall 50 50 51

as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect

4 Conclusion

In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

BioMed Research International 9

Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect

Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10

minus5 lowast589 times 10

minus6

rs7316119 12 87531 1 times 10minus5 mdash

rs936634 18 86125 2 times 10minus5 mdash

rs7632381 3 78235 1 times 10minus5 mdash

rs2079795 17 76542 1 times 10minus5 292 times 10

minus6

Ref [26]rs1344672 3 76177 1 times 10

minus5 lowast521 times 10

minus7

rs2523865 6 76044 4 times 10minus5 mdash

rs3790199 20 75362 2 times 10minus5 mdash

rs6440003 3 75231 1 times 10minus5 387 times 10

minus7

Ref [27]rs17628655 19 75117 6 times 10

minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed

Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction

2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10

minus4

rs2529232 7 rs1788421 21 43869 1 times 10minus4

rs2241704 19 rs1788421 21 43855 1 times 10minus4

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)

References

[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012

[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001

[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007

[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007

[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007

[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008

[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008

[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013

[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008

[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008

[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

10 BioMed Research International

[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011

[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013

[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011

[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013

[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004

[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003

[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009

[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007

[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997

[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992

[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009

[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004

[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009

[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008

[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology