Transethnic genetic correlation estimates from summary ... · 2/23/2016  · Consortium, Chun...

24
1 Transethnic genetic correlation estimates from summary statistics Brielin C. Brown, Asian Genetic Epidemiology Network-Type 2 Diabetes (AGEN-T2D) Consortium, Chun Jimmie Ye, Alkes L. Price, Noah Zaitlen Abstract The increasing number of genetic association studies conducted in multiple populations provides unprecedented opportunity to study how the genetic architecture of complex phenotypes varies between populations, a problem important for both medical and population genetics. Here we develop a method for estimating the transethnic genetic correlation: the correlation of causal variant effect sizes at SNPs common in populations. We take advantage of the entire spectrum of SNP associations and use only summary-level GWAS data. This avoids the computational costs and privacy concerns associated with genotype-level information while remaining scalable to hundreds of thousands of individuals and millions of SNPs. We apply our method to gene expression, rheumatoid arthritis, and type-two diabetes data and overwhelmingly find that the genetic correlation is significantly less than 1. Our method is implemented in a python package called popcorn. Introduction Many complex human phenotypes vary dramatically in their distributions between populations due to a combination of genetic and environmental differences. For example, northern Europeans are on average taller than southern Europeans 1 and African Americans have an increased rate of hypertension relative to European Americans 2 . The genetic contribution to population phenotypic differentiation is driven by differences in causal allele frequencies, effect sizes, and genetic architectures. Understanding the root causes of phenotypic differences worldwide has profound implications for biomedical and clinical practice in diverse populations, the transferability of epidemiological results, aiding multi- ethnic disease mapping 3,4 , assessing the contribution of non-additive and rare variant effects, and modeling the genetic architecture of complex traits. In this work we consider a central question in the global study of phenotype: do genetic variants have the same phenotypic effects in different populations? While the vast majority of GWAS have been conducted in European populations 5 , the growing number of non-European and multi-ethnic studies 4,6,7 provide an opportunity to study genetic effect distributions across populations. For example, one recent study used mixed-model based methods to show that the genome-wide genetic correlation of . CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted February 23, 2016. ; https://doi.org/10.1101/036657 doi: bioRxiv preprint

Transcript of Transethnic genetic correlation estimates from summary ... · 2/23/2016  · Consortium, Chun...

  • 1

    Transethnicgeneticcorrelationestimatesfromsummarystatistics

    BrielinC.Brown,AsianGeneticEpidemiologyNetwork-Type2Diabetes(AGEN-T2D)Consortium,ChunJimmieYe,AlkesL.Price,NoahZaitlen

    Abstract

    Theincreasingnumberofgeneticassociationstudiesconductedinmultiple

    populationsprovidesunprecedentedopportunitytostudyhowthegeneticarchitectureof

    complexphenotypesvariesbetweenpopulations,aproblemimportantforbothmedicaland

    populationgenetics.Herewedevelopamethodforestimatingthetransethnicgenetic

    correlation:thecorrelationofcausalvarianteffectsizesatSNPscommoninpopulations.We

    takeadvantageoftheentirespectrumofSNPassociationsanduseonlysummary-level

    GWASdata.Thisavoidsthecomputationalcostsandprivacyconcernsassociatedwith

    genotype-levelinformationwhileremainingscalabletohundredsofthousandsof

    individualsandmillionsofSNPs.Weapplyourmethodtogeneexpression,rheumatoid

    arthritis,andtype-twodiabetesdataandoverwhelminglyfindthatthegeneticcorrelationis

    significantlylessthan1.Ourmethodisimplementedinapythonpackagecalledpopcorn.

    Introduction

    Manycomplexhumanphenotypesvarydramaticallyintheirdistributionsbetween

    populationsduetoacombinationofgeneticandenvironmentaldifferences.Forexample,

    northernEuropeansareonaveragetallerthansouthernEuropeans1andAfricanAmericans

    haveanincreasedrateofhypertensionrelativetoEuropeanAmericans2.Thegenetic

    contributiontopopulationphenotypicdifferentiationisdrivenbydifferencesincausal

    allelefrequencies,effectsizes,andgeneticarchitectures.Understandingtherootcausesof

    phenotypicdifferencesworldwidehasprofoundimplicationsforbiomedicalandclinical

    practiceindiversepopulations,thetransferabilityofepidemiologicalresults,aidingmulti-

    ethnicdiseasemapping3,4,assessingthecontributionofnon-additiveandrarevariant

    effects,andmodelingthegeneticarchitectureofcomplextraits.Inthisworkweconsidera

    centralquestionintheglobalstudyofphenotype:dogeneticvariantshavethesame

    phenotypiceffectsindifferentpopulations?

    WhilethevastmajorityofGWAShavebeenconductedinEuropeanpopulations5,

    thegrowingnumberofnon-Europeanandmulti-ethnicstudies4,6,7provideanopportunity

    tostudygeneticeffectdistributionsacrosspopulations.Forexample,onerecentstudyused

    mixed-modelbasedmethodstoshowthatthegenome-widegeneticcorrelationof

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 2

    schizophreniabetweenEuropeanandAfricanAmericansisnonzero8.Whilepowerful,

    computationalcostsandprivacyconcernslimittheutilityofgenotype-basedmethods.In

    thiswork,wemaketwosignificantcontributionstostudiesoftransethnicgenetic

    correlation.First,weexpandthedefinitionofgeneticcorrelationtobetteraccountfora

    transethniccontext.Second,wedevelopanapproachtoestimatinggeneticcorrelation

    acrosspopulationsthatusesonlysummarylevelGWASdata.Similartootherrecent

    summarystatisticsbasedmethods9–20,ourapproachsupplementssummaryassociation

    datawithlinkagedisequilibrium(LD)informationfromexternalreferencepanels,avoids

    privacyconcerns,andisscalabletohundredsofthousandsofindividualsandmillionsof

    markers.UnliketraditionalapproachesthatfocusonthesimilarityofGWASresults21–25we

    utilizetheentirespectrumofGWASassociationswhileaccountingforLDinordertoavoid

    filteringcorrelatedSNPs.

    Inasinglepopulation,thegeneticcorrelationoftwophenotypesisdefinedasthe

    correlationcoefficientofSNPeffectsizes18,26.Inmultiplepopulations,differencesinallele

    frequencymotivatemultiplepossibledefinitionsofgeneticcorrelation.Becauseavariant

    mayhaveahighereffectsizebutlowerfrequencyinonepopulation,weconsiderboththe

    correlationofalleleeffectsizesaswellasthecorrelationofallelicimpact.Wedefinethe

    transethnicgeneticeffectcorrelation(ρge,previouslydefinedbyLeeetal26andimplemented

    inGCTA)asthecorrelationcoefficientoftheper-alleleSNPeffectsizes,andthetransethnic

    geneticimpactcorrelation(ρgi)asthecorrelationcoefficientofthepopulation-specificallele

    variancenormalizedSNPeffectsizes.

    Intuitively,thegeneticeffectcorrelationmeasurestheextenttowhichthesame

    varianthasthesamephenotypicchange,whilethegeneticimpactcorrelationgivesmore

    weighttocommonallelesthanrareonesseparatelyineachpopulation.Considerthecaseof

    aSNPthatisrareinpopulation1butcommoninpopulation2,andhasanidenticaleffect

    sizeinbothpopulations.Inthiscase,thecorrelationofeffectsizes(thegeneticeffect

    correlationρge)is1,butthisprovidesanincompletepictureoftherelationshipbetweenthe

    twopopulations,astheallelehasamuchbiggerimpactonthedistributionofthephenotype

    inpopulation2.Therefore,wedefinethegeneticimpactcorrelationρgiasthecorrelationof

    effectsizesafternormalizinggenotypestohavemean0andvariance1.Inourhypothetical

    caseρgi

  • 3

    twopopulationswillbesimilarandthereforeρgiwillbecloseto1.Whileotherdefinitionsof

    thegeneticcorrelationarepossible(seediscussion),thesequantitiescapturetwoimportant

    questionsaboutthestudyofdiseaseinmultiplepopulations:towhatextentdothesame

    mutationsinmultiplepopulationsdifferintheirphenotypiceffects?And,towhatextentare

    thesedifferencesmitigatedorexacerbatedbydifferencesinallelefrequency?

    Toestimategeneticcorrelation,wetakeaBayesianapproachwhereinweassume

    genotypesaredrawnseparatelyfromwithineachpopulationandeffectsizeshaveanormal

    prior(theinfinitesimalmodel27).Whileunlikelytorepresentreality,thismodelhasbeen

    usedsuccessfullyinpractice8,16,17,28,29.Theinfinitesimalassumptionyieldsamultivariate

    normaldistributionontheobservedteststatistics(Z-scores),whichisafunctionofthe

    heritabilityandgeneticcorrelation.RatherthanpruningSNPsinLD10,30,31,thisallowsusto

    explicitlymodeltheresultinginflationofZ-scores.Wethenmaximizeanapproximate

    weightedlikelihoodfunctiontofindtheheritabilityandgeneticcorrelation.Thismethodis

    implementedinapythonpackagecalledpopcorn.Thoughderivedforquantitative

    phenotypes,popcornextendseasilytobinaryphenotypesundertheliabilitythreshold

    model.Weshowviaextensivesimulationthatpopcornproducesunbiasedestimatesofthe

    geneticcorrelationandthepopulationspecificheritabilities,withastandarderrorthat

    decreasesasthenumberofSNPsandindividualsinthestudiesincreases.Furthermore,we

    showthatourapproachisrobusttoviolationsoftheinfinitesimalassumption.

    WeapplypopcorntoEuropeanandYorubangeneexpressiondata32aswellasGWAS

    summarystatisticsfromEuropeanandEastAsianrheumatoidarthritisandtype-two

    diabetescohorts,33,34.OuranalysisofgEUVADISshowsthatoursummarystatisticbased

    estimatorisconcordantwiththemixedmodelbasedestimator.Wefindthatthemean

    transethnicgeneticcorrelationacrossallgenesislow(ρge=0.320(0.009)),butincreases

    substantiallywhenthegeneishighlyheritableinbothpopulations(ρge=0.772(0.017)).in

    RAandT2D,wefindthegeneticeffectcorrelationtobe0.463(0.058)and0.621(0.088),

    respectively.

    Acrossallphenotypesconsidered,weoverwhelminglyfindthatthetransethnic

    geneticcorrelationissignificantlylessthanone.Thisobservationhighlightstheneedto

    studyphenotypesinmultiplepopulationsasitimpliesthat,uptotheeffectsofun-observed

    variants,effectsizesatcommonSNPstendtodifferbetweenpopulations.Thisindicates

    thatGWASresultsmaynottransferbetweenpopulations,andthereforediseaserisk

    predictioninnon-EuropeansbasedoncurrentGWASresultsmaybeproblematic,

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 4

    necessitatingamulti-populationapproachtogaininsightintointer-populationdifferences

    inthegeneticarchitectureofcomplextraits.

    Methods

    Ourmethodtakesasinputsummaryassociationstatisticsfromtwostudiesofa

    phenotypeintwodifferentpopulations,alongwithtwosetsofreferencegenotypeseach

    matchingoneofthepopulationsinthestudy.Ourmethodhastwosteps:first,weestimate

    thediagonalelementsoftheLDmatrixproductsΣ12,Σ22andΣ1Σ2,then,usingthese

    estimates,wefindthemaximumlikelihoodvaluesandestimatestandarderrorsofthe

    parametersofinterest:h12,h22andρgeorρgi.Thedetailsfollow.

    ConsiderGWASofaphenotypeconductedintwodifferentpopulations.Assumewe

    haveN1individualsgenotypedonMSNPsinstudyoneandN2individualsgenotypedonthe

    sameSNPsstudytwo.LetX1,X2bethematricesofmean-centeredgenotypesinstudyone

    andstudytwo,respectively,andletY1,Y2betheirnormalizedphenotypes.Letf1,f2be

    vectorsoftheallelefrequenciesoftheMSNPscommontobothpopulations.Assuming

    Hardy-Weinbergequilibriumwithineachpopulationseparately,theallelevariancesareσ12

    =2f1(1–f1),σ22=2f2(1–f2).Letβ1,β2bethe(unobserved)per-alleleeffectsizesforeachSNP

    instudiesoneandtwo,respectively.Theheritabilityinstudyoneisthenh12=Σiσ1i2β1i2

    (andlikewiseforstudytwo).Theobjectiveofthisworkistoestimatetransethnicgenetic

    correlationfromsummarystatisticsofcommonvariants (andlikewise

    forstudytwo)andestimatesofpopulationLDmatrices(Σ1andΣ2)fromexternalreference

    panels.Definethegeneticeffectcorrelationρge=Cor(β1,β2)andthegeneticimpact

    correlationρgi=Cor(σ1β1,σ2β2).

    Weassumethegenotypesaredrawnrandomlyfromeachpopulationandthat

    phenotypesaregeneratedbythelinearmodelY1=X1β1+ε1(likewiseforphenotypetwo).

    Wheneffectsizesβareassumedinverselyproportionaltoallelefrequency,asiscommonly

    done16,29,weshow(Appendix)thatunderthelinearinfinitesimalgeneticarchitecture,the

    jointdistributionoftheZ-scoresfromeachstudyisasymptoticallymultivariatenormalwith

    mean andvariance:

    (1)

    Z1 =(X1/�1)>Y1p

    N1

    �!0

    Var(Z) =

    "⌃1 +

    N1+1M h

    21⌃

    21 ⇢gi

    ph21h

    22

    pN1N2M ⌃1⌃2

    ⇢gip

    h21h22

    pN1N2M ⌃2⌃1 ⌃2 +

    N2+1M h

    22⌃

    22

    #

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 5

    However,wheneffectsizesareassumedindependentofallelefrequencyweshow:

    (2)

    Giventheseequationsforvariance,thequantitiesρgiorρgeandh12,h22canbeestimatedby

    maximizingthemultivariatenormallikelihood,

    ,whereCiseitheroftheabovecovariance

    matrices(1)and(2).BecauseΣ1andΣ2areestimatedfromfiniteexternalreferencepanels,

    maximumlikelihoodestimationoftheabovemultivariatenormalleadstoover-fitting.We

    employtwooptimizationstoavoidthisproblem.First,wemaximizeanapproximate

    weightedlikelihoodthatusesonlythediagonalelementsofeachblockofVar(Z).This

    allowsustoaccountfortheLD-inducedinflationoftestsstatistics,butdiscardscovariance

    informationbetweenpairsofZ-scores,andthereforeleadstoover-countingZ-scoresof

    SNPsinhighLD.Tocompensateforthis,wedownweightZ-scoresofSNPsinproportion

    theirLD.Second,ratherthancomputethefullproductsΣ12,Σ22andΣ1Σ2overallMSNPsin

    thegenome,wechooseawindowsizeWandapproximatetheproductby

    .TheseoptimizationsaresimilartothoseemployedbyLDscore

    regression16.Thefulldetailsofthederivationandoptimizationareprovidedinthe

    appendix.

    Results

    SimulatedGenotypesandSimulatedPhenotypes

    Wesimulated50,000European-like(EUR)and50,000EastAsian-like(EAS)

    individualsat248,953SNPsfromchromosomes1-3withallelefrequencyabove1%inboth

    EuropeanandEastAsianHapMap3populationswithHapGen235.HapGen2implementsa

    haplotyperecombinationwithmutationmodelthatresultsinexcesslocalrelatedness

    amongthesimulatedindividuals.Toaccountforthislocalstructure,weusedPlink236to

    filterindividualswithgeneticrelatednessabove0.05,resultingin4499EUR-likeindivudals

    and4837EAS-likeindividuals.Fromthesesimulatedindividuals,500perpopulationwere

    chosenuniformlyatrandomtoserveasanexternalreferencepanelforestimatingΣ1andΣ2.

    l(⇢g{i,e}, h21, h

    22|Z,⌃,�) / �ln(|C|)� Z>C�1Z

    (⌃a⌃b)ii =w=i+WX

    w=i�Wraiwrbiw

    Var(Z) =

    2

    4⌃1 +

    N1+1k�21k1

    h21⌃1�21⌃1 ⇢ge

    ph21h

    22

    pN1N2p

    k�21k1k�22k1⌃1

    p�21�

    22⌃2

    ⇢gep

    h21h22

    pN1N2p

    k�21k1k�22k1⌃2

    p�22�

    21⌃1 ⌃2 +

    N2+1k�22k1

    h22⌃2�22⌃2

    3

    5

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 6

    Ineachsimulationeffectsizesweredrawnfroma“spikeandslab”model,where

    withprobabilitypand with

    probability1-p.ρgiwasanalyticallycomputedfromthesimulatedeffectsizesandallele

    frequenciesinthesimulatedreferencegenotypes.Quantitativephenotypesweregenerated

    underalinearmodelwithi.i.d.noiseandnormalizedtohavemean0andvariance1,while

    binaryphenotypesweregeneratedunderaliabilitythresholdmodelwhereindividualsare

    labeledcaseswhentheirliabilityexceedsathreshold ,withKthe

    populationdiseaseprevalence37.

    Wevariedh12,h22,ρge,andρgi,aswellasthenumberofindividualsineachstudy(N1,

    N2),thenumberofSNPs(M),thepopulationprevalenceK,andproportionofcausalvariants

    (p)inthesimulatedGWASandgeneratedsummarystatisticsforeachstudy.Theresults

    showninFigure1andFigureS1demonstratethattheestimatorsarenearlyunbiasedasthe

    geneticcorrelationandheritabilitiesvary.Furthermore,byvaryingtheproportionofcausal

    variantspweshowthatourestimatorisrobusttoviolationsoftheinfinitesimalassumption

    (FigureS2).InfigureS3,weshowthatthestandarderroroftheestimatordecreasesasthe

    numberofSNPsandindividualsinthestudyincreases.Finally,weshowinTableS1thatour

    estimatesoftheheritabilityofliabilityincasecontrolstudiesarenearlyunbiased.

    Simulationswithnonstandarddiseasemodels

    Ourapproach,aswellasgenotype-basedmethodssuchasGCTA,makes

    assumptionsaboutthegeneticarchitectureofcomplextraits.Previousworkhasshownthat

    violationsoftheseassumptionscanleadtobiasinheritabilityestimation38,thereforewe

    soughttoquantifytheextentthatthisbiasmayeffectourestimates.Wesimulated

    phenotypesundersixdifferentdiseasemodels.Independent:effectsizeindependentof

    allelefrequency.Inverse:effectsizeinverselyproportionaltoallelefrequency.Rare:only

    SNPswithallelefrequencyunder10%affectthetrait.Common:onlySNPswithallele

    frequencybetween40%and50%affectthetrait.Difference:effectsizeproportionalto

    differenceinallelefrequency.Adversarial:differencemodelwithsignofbetasettoincrease

    thephenotypeinthepopulationwherethealleleismostcommon.Additionalgenetic

    architecturesarepossible,includingoneswhereeffectsizesarenotadirectfunctionof

    MAF39.

    �1i,�2i ⇠ N✓0,

    h21 ⇢ge

    ph21h

    22

    ⇢geph21h

    22 h

    22

    �◆

    �1i,�2i = (0, 0)

    ⌧ = ��1(1�K)

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 7

    Wesimulatedphenotypesusinggenotypeswithallelefrequencyabove1%or5%

    andcomparedthetrueandestimatedgeneticimpactandeffectcorrelationamongall

    models(Table1).WefindthatwhenonlySNPswithfrequencyabove5%inboth

    populationsareused,thedifferenceinρgeandρgiisminimalexceptinthemostadversarial

    cases.Evenintheadversarialmodel,thetruedifferenceisonly7%.Thoughunlikelyto

    representreality,thefournonstandarddiseasemodelsresultinsubstantialbiasinour

    estimators.WhenSNPswithallelefrequencyabove1%inbothpopulationsareincluded,

    thedifferencesaremorepronounced.Thisisbecausethenormalizingconstant1/σrapidly

    increasesastheSNPbecomesmorerare.Indeed,asSNPsbecomemorerarehavingan

    accuratediseasemodelbecomesincreasinglyimportant.Thereforeweproceedwitha5%

    MAFcutoffinouranalysisofrealdata,andusethenotationhc2torefertotheheritabilityof

    SNPswithallelefrequencyabove5%inbothpopulations(thecommon-SNPheritability).

    Note,however,thatoneoftheadvantagesofmaximumlikelihoodestimationingeneralis

    thatthelikelihoodcanbereformulatedtomimicthediseasemodelofinterest.

    ValidationofPopcornusinggeneexpressioninGEUVADIS

    Wecomparedthecommon-SNPheritability(hc2)andgeneticcorrelationestimates

    ofpopcorntoGCTAinthegEUVADISdatasetforwhichrawgenotypesarepubliclyavailable.

    gEUVADISconsistsofRNA-seqdatafor464lymphoblastoidcellline(LCL)samplesfrom

    fivepopulationsinthe1000genomesproject.Ofthese,375areofEuropeanancestry(CEU,

    FIN,GBR,TSI)and89areofAfricanancestry(YRI).RawRNA-sequencingreadsobtained

    fromtheEuropeanNucleotideArchivewerealignedtothetranscriptomeusingUCSC

    annotationsmatchinghg19coordinates.RSEMwasusedtoestimatetheabundancesofeach

    annotatedisoformandtotalgeneabundanceiscalculatedasthesumofallisoform

    abundancesnormalizedtoonemilliontotalcountsortranscriptspermillion(TPM).For

    eQTLmapping,CaucasianandYorubansampleswereanalyzedseparately.Foreach

    population,TPMsweremedian-normalizedtoaccountfordifferencesinsequencingdepth

    ineachsampleandstandardizedtomean0andvariance1.Ofthe29763totalgenes,9350

    withTPM>2inbothpopulationswerechosenforthisanalysis.

    Foreachgene,weconductedacis-eQTLassociationstudyatallSNPswithin1

    megabaseofthegenebodywithallelefrequencyabove5%inbothpopulationsusing30

    principalcomponentsascovariates.WefoundthatGCTAandpopcornagreeontheglobal

    distributionofheritability(FigureS3),andthatGCTA’sestimatesofgeneticcorrelation

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 8

    haveasimilardistributiontopopcorn’sgeneticeffectandgeneticimpactcorrelation

    estimates(Figure2).WhilethenumberofSNPsandindividualsincludedineachgene

    analysisaretoosmalltoobtainaccuratepointestimatesofthegeneticcorrelationonaper-

    genebasis(N=464,M=4279.5),thelargenumberofgenesenablesaccurateestimationof

    theglobalmeanheritabilityandgeneticcorrelation.

    Common-SNPheritabilityandgeneticcorrelationofgeneexpressioningEUVADIS

    Wefindthattheaveragecis-hc2oftheexpressionofthegenesweanalyzedwas

    0.093(0.002)inEURand0.088(0.002)inYRI.Ourestimatesarehigherthanpreviously

    reportedaveragecis-heritabilityestimatesof0.055inwholebloodand0.057inadipose40,

    whichcouldariseforseveralreasons.First,weremove68%ofthetranscriptsthatare

    lowlyexpressedineithertheYRIorEURdata.Second,estimatesfromRNA-seqanalysisof

    celllinesmightnotbedirectlycomparabletomicroarraydatafromtissue.

    Theaveragegeneticeffectcorrelationwas0.320(0.010),whiletheaveragegenetic

    impactcorrelationwas0.313(0.010).Notably,thegeneticcorrelationincreasesasthecis-

    hc2ofexpressioninbothpopulationsincreases(Figure3).Inparticular,whenthecis-hc2of

    thegeneisatleast0.2inbothpopulationsthegeneticeffectcorrelationwas0.772(0.017)

    whilethegeneticimpactcorrelationwas0.753(0.018).

    Inordertoverifythattherewerenosmall-samplesizeorconditioningbiasesinour

    analysis,weanalyzedthegeneticcorrelationofsimulatedphenotypesoverthegEUVADIS

    genotypes.Wesampledpairsofheritabilitiesfromtheestimatedexpressionheritability

    distributionandsimulatedpairsofphenotypestohavethegivenheritabilityandagenetic

    effectcorrelationof0.0overrandomlychosen4000baseregionsfromchromosome1ofthe

    gEUVADISgenotypes.Withoutconditioning,theaverageestimatedgeneticeffect

    correlationwas-0.002(0.003),indicatingthattheestimatorremainedunbiased.

    Furthermore,theaverageestimatedgeneticeffectcorrelationwasnotsignificantlydifferent

    from0.0conditionalontheestimatesofheritabilitybeingaboveacertainthreshold(Figure

    S4).

    Wefindthatwhiletheaveragegeneticcorrelationislow,thegeneticcorrelation

    increaseswiththecis-hc2ofthegene,indicatingthatascis-geneticregulationofgene

    expressionincreasesitdoessosimilarlyinbothYRIandEURpopulations.Thismayhelp

    interprettherecentobservationthatwhiletheglobalgeneticcorrelationofgeneexpression

    acrosstissuesislow40,cis-eQTL’stendtoreplicateacrosstissues41.Asthepresenceofacis-

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 9

    eQTLindicatessubstantialcis-geneticregulation,ananalysisofeQTLreplicationacross

    tissuesisimplicitlyconditioningontheheritabilityofgeneexpressionbeinghighand

    thereforemayindicatemuchhighergeneticcorrelationthanaverage.

    SummarystatisticsofRAandT2D

    Finally,wesoughttoexaminethetransethnicρgiandρgeinRAandT2Dcohortsfor

    whichrawgenotypesarenotavailable.WeobtainedsummarystatisticsofGWASfor

    rheumatoidarthritisandtype-2diabetesconductedinEuropeanandEastAsian

    populations.Weusedgenotypesfrom504EastAsianand503Europeanindividuals

    sequencedaspartofthe1000genomesprojectaspopulation-specificexternalreference

    panelsforourEASandEURsummarystatistics,respectively.WeremovedtheMHCregion

    (chromosome6,25–35Mb)fromtheRAsummarystatistics.Weestimatedthecommon-

    SNPheritabilityandgeneticcorrelationusing2,539,629SNPsgenotypedorimputedinboth

    RAstudiesand1,054,079SNPsgenotypedorimputedinbothT2Dstudieswithallele

    frequencyabove5%in1000genomesEURandEASpopulations.Thehc2andgenetic

    correlationestimatesarepresentedinTable2.OurRAhc2estimatesof0.177(0.015)and

    0.221(0.026)forEURandEAS,respectively,arelowerthanapreviouslyreportedmixed-

    modelbasedheritabilityestimatesof0.32(0.037)inEuropeans42.Similarly,ourT2Dhc2

    estimatesof0.242(0.013)and0.105(0.021)forEURandEAS,respectively,arelowerthan

    apreviouslyreportedmixed-modelbasedestimateof0.51(0.065)inEuropeans42.We

    stressthatthisdiscrepancyislikelyduetothedifferencebetweencommon-SNPheritability

    hc2andtotalnarrow-senseheritabilityhg2.Furthermore,estimatesoftheheritabilityofT2D

    fromfamilystudiescanvarysignificantly43,44.

    WefindthegeneticeffectcorrelationinRAandT2Dtobe0.463(0.058)and0.621

    (0.088),respectively,whilethegeneticimpactcorrelationisnotsignificantlydifferentat

    0.455(0.056)and0.606(0.083).Thetransethnicgeneticimpactandeffectcorrelationfor

    bothphenotypesaresignificantlydifferentfromboth1and0(Table2),showingthatwhile

    thereiscleargeneticoverlapbetweenthephenotypes,theperalleleeffectsizesdiffer

    significantlybetweenthetwopopulations.

    Discussion

    Wehavedevelopedthetransethnicgeneticeffectandgeneticimpactcorrelationand

    providedanestimatorforthesequantitiesbasedonlyonsummary-levelGWASinformation

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 10

    andsuitablereferencepanels.Wehaveappliedourestimatortoseveralphenotypes:

    rheumatoidarthritis,type-2diabetesandgeneexpression.WhilethegEUVADISdataset

    lacksthepowerrequiredtomakeinferencesaboutthegeneticcorrelationofsingleorsmall

    subsetsofgenes,wecanmakeinferencesabouttheglobalstructureofgeneticcorrelationof

    geneexpression.Wefindthattheglobalmeangeneticcorrelationislow,butthatit

    increasessubstantiallywhentheheritabilityishighinbothpopulations.Inallphenotypes

    analyzed,thegeneticcorrelationissignificantlydifferentfromboth0and1.Ourresults

    showthatglobaldifferencesinSNPeffectsizeofcomplextraitscanbelarge.Incontrast,

    effectsizesofgeneexpressionappeartobemoreconservedwherethereisstronggenetic

    regulation.

    Itisnotpossibletodrawconclusionsaboutpolygenicselectionfromestimatesof

    transethnicgeneticcorrelation.Theeffectsizesmaybeidentical(!!" = 1)whilepolygenicselectionactstochangeonlytheallelefrequencies.Similarly,theeffectsizesmaybe

    different(!!" < 1)withoutselection.DifferencesineffectsizesatcommonSNPscanresultfrommanyphenomena.Weexpectun-typedandun-imputedvariantsdifferentiallylinked

    toobservedSNPstocontributesignificantly,alongwithrareorpopulation-specificvariants

    differentiallylinkedtoobservedSNPs.Ifagene-geneorgene-environmentinteraction

    exists,butonlymarginaleffectsaretested,theobservedmarginaleffectsmaybedifferentin

    eachpopulationduetoallelefrequencydifferenceseveniftheinteractioneffectisthesame

    inbothpopulations,andthiswillresultindecreasedgeneticcorrelation.Whilewithin-locus

    (dominance)interactionsmayalsoplayarole45,themagnitudeofthiseffecthasbeen

    debated46.Weemphasizethatwecannotdifferentiatebetweentheseeffectsonthebasisof

    thisanalysisalone,andfurtherresearchisrequiredtoestablishthemagnitudeofthe

    contributionofeachoftheseeffectstointer-populationeffectsizedifferences.

    Estimatesofthetransethnicgeneticcorrelationareimportantforseveralreasons.

    Theymayhelpinformbestpracticesfortransethnicmeta-analysis,potentiallyoffering

    improvementsovercurrentmethodsthatuseFsttoclusterpopulationsforanalysis4.

    Further,thetransethnicgeneticcorrelationconstrainsthelimitofoutofsamplephenotype

    predictivepower.IfthemaximumwithinpopulationcorrelationofpredictedphenotypeP

    totruephenotypeYis ,thenthemaximumoutofpopulationcorrelationis

    (Appendix).OurobservationthatforRA,T2D,andgeneexpressionthe

    geneticcorrelationislowshowsthatoutofpopulationphenotypicpredictivepowerisquite

    ⇢maxY P

    =q

    h21

    ⇢maxY P

    = ⇢ge

    qh21

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 11

    low.Similarly,itimpliesthatdiseaseriskassessmentinnon-Europeansbasedoncurrent

    GWASresultsmaybeproblematic,necessitatingincreasedstudyofdiseaseinmany

    populationstogaininsightintodifferencesingeneticarchitectureandimproverisk

    assessment.

    Whilethegeneticcorrelationofmultiplephenotypesinonepopulationhasa

    relativelystraightforwarddefinition,extendingthistomultiplepopulationsmotivates

    multiplepossibleextensions.Inthisworkwehaveprovidedestimatorsforthecorrelation

    ofgeneticeffectandgeneticimpactbutotherquantitiesrelatedtothesharedgeneticsof

    complextraitsbetweenpopulationsincludethecorrelationofvarianceexplained

    andproportionofsharedcausalvariantsbetweenthetwo

    populations.Interestingly,whileourgoalwastoconstructanestimatorthatdeterminedthe

    extentofgeneticsharingindependentofallelefrequency,weobservethatthecorrelationof

    geneticeffectandgeneticimpactaresimilar.Furthermore,oursimulationsshowthatunder

    arandomeffectsmodelutilizingonlySNPswithallelefrequencyabove5%inboth

    populationsthetruegeneticeffectandgeneticimpactcorrelationaresimilar.Weconclude

    thatatvariantscommoninbothpopulations,differencesineffectsizeandnotallele

    frequencyaredrivingthetransethnicphenotypicdifferencesinthesetraits.

    Ourapproachtoestimatinggeneticcorrelationhastwomajoradvantagesover

    mixed-modelbasedapproaches.First,utilizingsummarystatisticsallowsapplicationofthe

    methodwithoutdata-sharingandprivacyconcernsthatcomewithrawgenotypes.Second,

    ourapproachislinearinthenumberofSNPsavoidingthecomputationalbottleneck

    requiredtoestimatethegeneticrelationshipmatrix.Conceptually,ourapproachisvery

    similartothattakenbyLDscoreregression.Indeed,thediagonaloftheLDmatrixproduct

    inonepopulationareexactlytheLD-scores( ).Onecouldignoreourlikelihood-

    basedapproachanddefinecross-populationscores inordertoexploitthe

    linearrelationship (asimilarapproachcanbetakenfor

    thegeneticeffectcorrelation).SinceLD-scoreregressionhasbeensuccessfullyusedto

    computethegeneticcorrelationoftwophenotypesinasinglepopulation,thisderivation

    canbeviewedasanextensionofLD-scoreregressiontoonephenotypeintwodifferent

    populations.Themaindifferenceinourapproachischoosingmaximumlikelihoodrather

    ⇢ge = Cor(�21�

    21 ,�

    22�

    22)

    ⌃21ii = li

    ci =X

    m

    r1imr2im

    E[Z1iZ2i] =pN1N2M

    ⇢gi

    qh21h

    22ci

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 12

    thanregressioninordertofitthemodel.Acomparisonofourmethodtotheldscsoftware

    showstheyperformsimilarlyasheritabilityestimators(FigureS5).

    Ofcourse,ourmethodisnotwithoutdrawbacks.First,itrequiresalargesample

    sizeandlargenumberofSNPstoachievestandarderrorslowenoughtogenerateaccurate

    estimates.UntilrecentlylargesampleGWAShavebeenrareinnon-Europeanpopulations,

    thoughtheyarebecomingmorecommon.Similarly,referencepanelqualitymaysufferin

    non-Europeanpopulationsandthismayimpactdownstreamanalysis47.Second,itislimited

    toanalyzingrelativelycommonSNPs,bothbecausehavinganaccuratediseasemodelis

    importantfortheanalysisofrarevariantsandbecauseeffectsizeandcorrelationcoefficient

    estimateshaveahighstandarderroratrareSNPs16.Third,ouranalysisiscurrentlylimited

    toSNPsthatarepresentinbothpopulations.Indeeditiscurrentlyunclearhowbestto

    handlepopulation-specificvariantsinthisframework.Fourth,ourestimatorofρis

    boundedbetween-1and1.Thismayinducebiaswhenthetruevalueisclosetothe

    boundaryandthesamplesizeissmall.Finally,admixedpopulationsinduceverylong-range

    LDthatisnotaccountedforinourapproachandwearethereforelimitedtoun-admixed

    populations16.

    Ouranalysisleavesopenseveralavenuesforfuturework.Weapproximately

    maximizethelikelihoodofan!×!multivariatenormaldistributionviaamethodthatusesonlythediagonalelementsofeachblock,discardingcovarianceinformationbetweenZ-

    scores.Abetterapproximationmaylowerthestandarderroroftheestimator,facilitating

    ananalysisofthegeneticcorrelationoffunctionalcategories,pathwaysandgeneticregions.

    Wewouldalsoliketoextendouranalysistoincludepopulationspecificvariantsaswellas

    variantsatfrequenciesbetween1-5%orlowerthan1%.Oursimulationsindicatethat

    havinganaccuratediseasemodelisimportantfordeterminingthedifferencebetweenthe

    geneticeffectandgeneticimpactcorrelationwhenrarevariantsareincluded.Maximum

    likelihoodapproachesarewellsuitedtodifferentgeneticarchitectures.Forexample,one

    couldestimateboththeglobalrelationshipbetweenallelefrequencyandeffectsizeandthe

    globalrelationshipbetweenper-SNPFSTandgeneticcorrelationbyincorporating

    parametersαandϒintothepriordistributionoftheeffectsizes,

    .Weexpectthatincorporating

    theseparameterswillimproveestimatesofheritabilityandgeneticcorrelationwhile

    revealingimportantbiologicalinsights.

    �1i,�2i ⇠ N✓0,

    h21�

    ↵1i ⇢ge

    ph21h

    22F

    �STi

    ⇢gep

    h21h22F

    �STi h

    22�

    ↵1i

    �◆

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • Appendix

    Consider two GWAS of a phenotype conducted in di↵erent populations populations. Assume we have N1

    individuals genotyped or imputed to M SNPs in study one and N2 individuals genotyped or imputed to

    M SNPs in study two. Let X1, X2 and Y1, Y2 be the matrices of mean-centered genotypes and phenotypes

    of the individuals in study one and two, respectively, with f1, f2 the allele frequencies of the M SNPs

    common to both populations. Assuming Hardy-Weinberg equilibrium, the allele variances are �21 = 2f1(1�

    f1),�22 = 2f2(1� f2). Let �1,�2 be the (unobserved) per-allele e↵ect sizes for each SNP in studies one and

    two, respectively. Define the genetic impact correlation ⇢gi

    = Cor(p�21�1,

    p�22�2) and the genetic e↵ect

    correlation ⇢ge

    = Cor(�1,�2). We present a maximum likelihood framework for estimating the heritability of

    the phenotype in study 1 and it’s standard error, the heritability of the phenotype in study 2 and it’s standard

    error, and the genetic e↵ect and impact correlation of the phenotype between the studies and it’s standard

    error given only the summary statistics Z1, Z2 and reference genotypes G1, G2 representing the populations

    in the studies. We assume that genotypes are drawn randomly from populations with expected correlation

    matrices ⌃1 (and similarly for study two), and that every SNP is causal with a normally distributed e↵ects

    size (though this assumption is not necessary in practice, see Figure S1).

    Genetic impact correlation

    Let X 01 =X1p�

    21

    (and similarly for study 2) be normalized genotype matrices. We consider the standard linear

    model for generation of the phenotypes, where

    Y1 = X01�1 + ✏1

    Y2 = X02�2 + ✏2

    For convenience of notation let h2ix

    = ⇢gi

    ph21h

    22. We assume the SNP e↵ects follow the infinitesimal model,

    where every SNP has an e↵ect size drawn from the normal distribution, and that the residuals are independent

    for each individual and normally distributed:

    0

    @ �1�2

    1

    A ⇠ N

    0

    @

    2

    4 0

    0

    3

    5 , 1M

    2

    4 h21IM h2

    ix

    IM

    h2ix

    IM

    h22IM

    3

    5

    1

    A (1)

    0

    @ ✏1✏2

    1

    A ⇠ N

    0

    @

    2

    4 0

    0

    3

    5 ,

    2

    4 (1� h21)IM 0

    0 (1� h22)IM

    3

    5

    1

    A (2)

    1

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • where h21, h22 are the heritability of the disease in study one and two, respectively, and ⇢gi is the genetic

    impact correlation.

    Using the above model, we compute the distribution of the observed Z scores as a function of the reference

    panel correlations and the model parameters (h21, h22, ⇢gi). Given a distribution for Z and an observation of

    Z we can then choose parameters which give the highest probability of observing Z. First, we compute the

    distribution of Z. It is well known that the Z-scores of a linear regression are normally distributed given

    � when the sample size is large enough. Since P(Z) / P(Z|�)P(�) and the product of normal distributions

    is normal, we only need to compute the unconditional mean and variance of Z to know its distribution.

    Specifically, let Z = [Z>1 , Z>2 ]

    >, then it’s mean is

    E[Z] = E

    2

    4X

    0>1 Y1pN1

    X

    0>2 Y2pN2

    3

    5 =

    2

    41pN1

    (E[X 0>1 X 01]E[�1] + E[X 0>1 ]E[✏1])1pN2

    (E[X 0>2 X 02]E[�2] + E[X 0>2 ]E[✏2])

    3

    5 = 0

    The within-population variance is:

    Cov[Z1i, Z1j ] = E[Z1iZ1j ] = EX,�,✏[E[Z1iZ1j |X,�, ✏]]

    =1

    N1EX,�,✏

    [X 0>1i (X01�1 + ✏1)(X

    01�1 + ✏1)

    >X 01j ]

    =1

    N1EX,�

    [X 0>1i X01�1�

    >1 X

    0>1 X2j ] +

    1

    N1EX,✏

    [X 0>1i ✏1✏>1 X

    01j ]

    =h21

    MN1EX

    [X 0>1i X01X

    0>1 X

    01j ] +

    1� h21N1

    EX

    [X 0>1i X01j ]

    =h21

    MN1(N1Mr1ij +N1

    MX

    m=1

    r1imr1jm +N21

    MX

    m=1

    r1imr1jm) +1� h21N1

    r1ij

    = r1ij +N1 + 1

    Mh21⌃1(i)⌃

    (j)1

    where rpij

    = ⌃pij

    is the correlation coe�cient of SNP i and j in population p. Similarly, the between-

    population variance is:

    Cov[Z1i, Z2j ] =1p

    N1N2EX,�

    [X 0>1i X01�1�

    >2 X

    0>2 X

    02j ] +

    1pN1N2

    EX,✏

    [X 0>1i ✏1✏>2 X

    02j ]

    =h2ix

    MpN1N2

    EX

    [X 0>1i X01X

    0>2 X

    02j ]

    =h2ix

    MpN1N2

    (N1N2

    MX

    m=1

    r1imr2jm)

    =

    pN1N2M

    h2ix

    ⌃1(i)⌃(j)2

    where ⌃(i) denotes the i’th row of ⌃ and ⌃(j) denotes the j’th column. The covariance of the Z-scores is

    2

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • thus

    C = Var(Z) =

    2

    4 ⌃1 +N1+1M

    h21⌃21 h

    2gx

    pN1N2

    M

    ⌃1⌃2

    h2gx

    pN1N2

    M

    ⌃2⌃1 ⌃2 +N2+1M

    h22⌃22

    3

    5 (3)

    and Z ⇠ N (0, C).

    Genetic e↵ect correlation

    Let hex

    = ⇢ge

    ph21h

    22. We modify the procedure above to use mean-centered instead of normalized genotype

    matrices and model the distribution of the e↵ect sizes as

    0

    @ �1�2

    1

    A ⇠ N

    0

    B@

    2

    4 0

    0

    3

    5 ,

    2

    64h

    21

    k�21k1IM

    h

    expk�21k1k�22k1

    IM

    h

    expk�21k1k�22k1

    IM

    h

    22

    k�22k1IM

    3

    75

    1

    CA (4)

    Notice that a linear model with e↵ects sizes acting on un-normalized genotypes is the same as a linear model

    with e↵ect sizes acting on normalized genotypes under the substitution �1,2 !q

    �21,2�1,2. Therefore the

    covariance of Z-scores on the per allele scale can be immediately inferred from the prior derivation

    C = Var(Z) =

    2

    64⌃1 +

    N1+1k�21k1

    h21⌃1�21⌃1 h

    2gx

    pN1N2p

    k�21k1k�22k1⌃1

    p�21�

    22⌃2

    h2gx

    pN1N2p

    k�21k1k�22k1⌃2

    p�22�

    21⌃1 ⌃2 +

    N2+1k�22k1

    h22⌃2�22⌃2

    3

    75

    Approximate maximum likelihood estimation

    Let C =

    2

    4 C11 C12C21 C22

    3

    5 be either of the above covariance matrices written in block form. We approximately

    optimize the above likelihood as follows: first we find h21 and h22 by maximizing the likelihood corresponding

    to C11 and C22, then we find ⇢gi or ⇢ge by maximizing the likelihood corresponding to C12:

    l(h21|Z1,⌃,�) ⇡ �MX

    i=1

    w11i

    ✓ln(C11ii) +

    Z21iC11ii

    l(h22|Z2,⌃,�) ⇡ �MX

    i=1

    w22i

    ✓ln(C22ii) +

    Z22iC22ii

    l(⇢g{i,e}|Z, ĥ21, ĥ22,⌃,�) ⇡ �

    MX

    i=1

    w12i

    ✓ln(C12ii) +

    Z1iZ2iC12ii

    Because we are discarding between-SNP covariance information (Cov(Z1i, Z1j)), highly correlated SNPs will

    be overcounted in our approximate likelihood. As a simple example, notice that two SNPs in perfect LD

    will each contribute identical terms to the approximate likelihood, and therefore should be downweighted

    by a factor of 1/2. The extent to which SNP i is over-counted is exactly the i’th entry in it’s corresponding

    3

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • LD-matrix product. Therefore we let wgijki

    = 1/ (⌃j

    ⌃k

    )ii

    and wgejki

    = 1/⇣⌃

    j

    q�2j

    �2k

    ⌃k

    ii

    to reduce the

    variance in our estimates of the parameters h21, h22, ⇢gi and ⇢ge.

    Furthermore, rather than compute the full products ⌃21, ⌃22 and ⌃1⌃2 over all M SNPs in the genome, we

    choose a window size W and approximate the product by (⌃a

    ⌃b

    )ii

    =P

    w=i+Ww=i�W raiwrbiw. Though maximum

    likelihood estimation admits a straightforward estimate of the standard error via the fisher information, we

    found these estimates to be inaccurate in practice. Instead, we use block jackknife with block size equal to

    min(100, M200 ) SNPs to ensure that blocks are large enough to remove residual correlations.

    Out of population prediction of phenotypic values

    Consider using the results of a GWAS with perfect power in population 2 to predict the phenotypic values of

    a set of individuals from population 1. This defines the upper limit of the correlation of true and predicted

    phenotypic values. Let the true values of the e↵ects sizes in population 2 be �2. Let the true phenotypes

    in population 1 be Y = X1�1 + ✏1 while the predicted phenotypes are P = X1�2. We are interested in the

    correlaiton of the predicted and true phenotypes ⇢MAXY P

    = Cor(Y, P ). Notice that, given X, the true and

    predicted phenotype of each individual is an a�ne transformation of a multivariate normal random variable

    (�) 2

    4 YiPi

    3

    5 =

    2

    4 X(i) 0M0M

    X(i)

    3

    5

    2

    4 �1�2

    3

    5+

    2

    4 ✏i0

    3

    5

    and therefore (Yi

    , Pi

    ) for individual i is multivariate normal with expected covariance matrix

    EX

    [Cov(Yi

    , Pi

    )] = EX

    2

    4 X(i) 0M0M

    X(i)

    3

    5

    2

    641

    k�21k1IM

    h

    expk�21k1k�22k1

    IM

    h

    expk�21k1k�22k1

    IM

    h

    22

    k�22k1IM

    3

    75

    2

    4 X(i) 0M0M

    X(i)

    3

    5>

    = EX

    2

    64

    Pm

    X

    2im

    k�21k1h

    ex

    Pm

    X

    2imp

    k�21k1k�22k1h

    ex

    Pm

    X

    2imp

    k�21k1k�22k1h

    22

    Pm

    X

    2im

    k�22k1

    3

    75

    =

    2

    4 1 hexq

    k�21k1k�22k1

    hex

    qk�21k1k�22k1

    h22k�21k1k�22k1

    3

    5

    and therefore the expected correlation E[Cor(Yi

    , Pi

    )] is hexph

    22

    qk�21k1k�22k1

    k�22k1k�21k1

    = ⇢ge

    ph21. The expected popula-

    tion correlation tends to the sample correlation as the number of samples increases, therefore

    ⇢MAXY P

    = Cor(Y, P ) ! ⇢ge

    qh21 (5)

    as N ! 1

    4

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 13

    DescriptionofSupplementaldata

    Supplementaldataincludesixfiguresandonetable.

    Acknowledgements

    TheauthorswouldliketoacknowledgeLiorPachterandHilaryFinucaneforinsightful

    discussionabouttheproblem.BCBissupportedbytheNSFGRFP.ALPissupportedbyNIH

    grantR01HG006399.NZissupportedbyNIHgrantK25HL121295.

    WebResources

    Popcornisavailableathttps://github.com/brielin/popcorn

    References

    1. Robinson, M.R., Hemani, G., Medina-Gomez, C., Mezzavilla, M., Esko, T., Shakhbazov, K., Powell, J.E., Vinkhuyzen, A., Berndt, S.I., Gustafsson, S., et al. (2015). Population genetic differentiation of height and body mass index across Europe. Nat. Genet. 47, 1357–1362. 2. Burt, V.L., Whelton, P., Roccella, E.J., Brown, C., Cutler, J.A., Higgins, M., Horan, M.J., and Labarthe, D. (1995). Prevalence of hypertension in the US adult population. Results from the Third National Health and Nutrition Examination Survey, 1988-1991. Hypertension 25, 305–313. 3. Coram, M.A., Candille, S.I., Duan, Q., Chan, K.H.K., Li, Y., Kooperberg, C., Reiner, A.P., and Tang, H. (2015). Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach. Am. J. Hum. Genet. 96, 740–752. 4. Morris, A.P. (2011). Transethnic Meta-Analysis of Genomewide Association Studies. Genet. Epidemiol. 35, 809–822. 5. Bustamante, C.D., De La Vega, F.M., and Burchard, E.G. (2011). Genomics for the world. Nature 475, 163–165. 6. (2011). A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nat Genet 43, 339–344. 7. Okada, Y., Wu, D., Trynka, G., Raj, T., Terao, C., Ikari, K., Kochi, Y., Ohmura, K., Suzuki, A., Yoshida, S., et al. (2014). Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381. 8. de Candia, T.R., Lee, S.H., Yang, J., Browning, B.L., Gejman, P.V., Levinson, D.F., Mowry, B.J., Hewitt, J.K., Goddard, M.E., O’Donovan, M.C., et al. (2013). Additive Genetic Variation in Schizophrenia Risk Is Shared by Populations of African and European Descent. Am. J. Hum. Genet. 93, 463–470. 9. Lee, S., Teslovich, T.M., Boehnke, M., and Lin, X. (2013). General Framework for Meta-analysis of Rare Variants in Sequencing Association Studies. Am. J. Hum. Genet. 93, 42–53. 10. Palla, L., and Dudbridge, F. (2015). A Fast Method that Uses Polygenic Scores to Estimate the Variance Explained by Genome-wide Marker Panels and the Proportion of Variants Affecting a Trait. Am. J. Hum. Genet. 97, 250–259. 11. Yang, J., Ferreira, T., Morris, A.P., Medland, S.E., Madden, P.A.F., Heath, A.C., Martin, N.G., Montgomery, G.W., Weedon, M.N., Loos, R.J., et al. (2012). Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369–375. 12. Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., Hirschhorn, J., Strachan, D.P., Patterson, N., and Price, A.L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics 30, 2906–2914.

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 14

    13. Hormozdiari, F., Kostem, E., Kang, E.Y., Pasaniuc, B., and Eskin, E. (2014). Identifying Causal Variants at Loci with Multiple Signals of Association. Genetics 198, 497–508. 14. Hormozdiari, F., Kichaev, G., Yang, W.-Y., Pasaniuc, B., and Eskin, E. (2015). Identification of causal genes for complex traits. Bioinformatics 31, i206–i213. 15. Kichaev, G., Yang, W.-Y., Lindstrom, S., Hormozdiari, F., Eskin, E., Price, A.L., Kraft, P., and Pasaniuc, B. (2014). Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies. PLoS Genet. 10, e1004722. 16. Bulik-Sullivan, B.K., Loh, P.-R., Finucane, H.K., Ripke, S., Yang, J., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson, N., Daly, M.J., Price, A.L., and Neale, B.M. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47, 291–295. 17. Finucane, H.K., Bulik-Sullivan, B., Gusev, A., Trynka, G., Reshef, Y., Loh, P.-R., Anttila, V., Xu, H., Zang, C., Farh, K., et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet 47, 1228–1235. 18. Bulik-Sullivan, B., Finucane, H.K., Anttila, V., Gusev, A., Day, F.R., Loh, P.-R., ReproGen Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3, Duncan, L., et al. (2015). An atlas of genetic correlations across human diseases and traits. Nat Genet 47, 1236–1241. 19. Park, D.S., Brown, B., Eng, C., Huntsman, S., Hu, D., Torgerson, D.G., Burchard, E.G., and Zaitlen, N. (2015). Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses. Bioinformatics 31, i181–i189. 20. Xu, Z., Duan, Q., Yan, S., Chen, W., Li, M., Lange, E., and Li, Y. (2015). DISSCO: direct imputation of summary statistics allowing covariates. Bioinformatics 31, 2434–2442. 21. International Schizophrenia Consortium (2009). Common polygenic variation contributes to risk of schizophrenia that overlaps with bipolar disorder. Nature 460, 748–752. 22. Zuo, L., Zhang, C.K., Wang, F., Li, C.-S.R., Zhao, H., Lu, L., Zhang, X.-Y., Lu, L., Zhang, H., Zhang, F., et al. (2011). A Novel, Functional and Replicable Risk Gene Region for Alcohol Dependence Identified by Genome-Wide Association Study. PLoS ONE 6, e26726. 23. Fesinmeyer, M., North, K., Ritchie, M., Lim, U., Franceschini, N., Wilkens, L., Gross, M., Bůžková, P., Glenn, K., Quibrera, P., et al. (2013). Genetic risk factors for body mass index and obesity in an ethnically diverse population: results from the Population Architecture using Genomics and Epidemiology (PAGE) Study. Obes. Silver Spring Md 21, 10.1002/oby.20268. 24. Chang, M., Ned, R.M., Hong, Y., Yesupriya, A., Yang, Q., Liu, T., Janssens, A.C.J.W., and Dowling, N.F. (2011). Racial/Ethnic Variation in the Association of Lipid-Related Genetic Variants With Blood Lipids in the US Adult Population. Circ. Cardiovasc. Genet. 4, 523–533. 25. Waters, K.M., Stram, D.O., Hassanein, M.T., Le Marchand, L., Wilkens, L.R., Maskarinec, G., Monroe, K.R., Kolonel, L.N., Altshuler, D., Henderson, B.E., et al. (2010). Consistent Association of Type 2 Diabetes Risk Variants Found in Europeans in Diverse Racial and Ethnic Groups. PLoS Genet. 6, e1001078. 26. Lee, S.H., Yang, J., Goddard, M.E., Visscher, P.M., and Wray, N.R. (2012). Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542. 27. Falconer, D.S., and Mackay, T.F.C. (1996). Introduction to quantitative genetics (Essex, England: Longman). 28. Loh, P.-R., Tucker, G., Bulik-Sullivan, B.K., Vilhjálmsson, B.J., Finucane, H.K., Salem, R.M., Chasman, D.I., Ridker, P.M., Neale, B.M., Berger, B., et al. (2015). Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290. 29. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569. 30. So, H.-C., Li, M., and Sham, P.C. (2011). Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet. Epidemiol. n/a – n/a. 31. Vattikuti, S., Guo, J., and Chow, C.C. (2012). Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits. PLoS Genet. 8, e1002637.

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 15

    32. ’t Hoen, P.A.C., Friedländer, M.R., Almlöf, J., Sammeth, M., Pulyakhina, I., Anvar, S.Y., Laros, J.F.J., Buermans, H.P.J., Karlberg, O., Brännvall, M., et al. (2013). Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022. 33. Morris, A.P., Voight, B.F., Teslovich, T.M., Ferreira, T., Segrè, A.V., Steinthorsdottir, V., Strawbridge, R.J., Khan, H., Grallert, H., Mahajan, A., et al. (2012). Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990. 34. Cho, Y.S., Chen, C.-H., Hu, C., Long, J., Hee Ong, R.T., Sim, X., Takeuchi, F., Wu, Y., Go, M.J., Yamauchi, T., et al. (2012). Meta-analysis of genome-wide association studies identifies eight new loci for type 2 diabetes in east Asians. Nat Genet 44, 67–72. 35. Su, Z., Marchini, J., and Donnelly, P. (2011). HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304–2305. 36. Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., and Lee, J.J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4,. 37. Lee, S.H., Wray, N.R., Goddard, M.E., and Visscher, P.M. (2011). Estimating Missing Heritability for Disease from Genome-wide Association Studies. Am. J. Hum. Genet. 88, 294–305. 38. Speed, D., Hemani, G., Johnson, M.R., and Balding, D.J. (2012). Improved Heritability Estimation from Genome-wide SNPs. Am. J. Hum. Genet. 91, 1011–1021. 39. Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A.A.E., Lee, S.H., Robinson, M.R., Perry, J.R.B., Nolte, I.M., van Vliet-Ostaptchouk, J.V., et al. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47, 1114–1120. 40. Price, A.L., Helgason, A., Thorleifsson, G., McCarroll, S.A., Kong, A., and Stefansson, K. (2011). Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-by-Descent in Related or Unrelated Individuals. PLoS Genet. 7, e1001317. 41. Gaffney, D.J. (2013). Global Properties and Functional Complexity of Human Gene Regulatory Variation. PLoS Genet. 9, e1003501. 42. Stahl, E.A., Wegmann, D., Trynka, G., Gutierrez-Achury, J., Do, R., Voight, B.F., Kraft, P., Chen, R., Kallberg, H.J., Kurreeman, F.A.S., et al. (2012). Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 44, 483–489. 43. Marenberg, M.E., Risch, N., Berkman, L.F., Floderus, B., and de Faire, U. (1994). Genetic susceptibility to death from coronary heart disease in a study of twins. N. Engl. J. Med. 330, 1041–1046. 44. Nora, J.J., Lortscher, R.H., Spangler, R.D., Nora, A.H., and Kimberling, W.J. (1980). Genetic--epidemiologic study of early-onset ischemic heart disease. Circulation 61, 503–508. 45. Chen, X., Kuja-Halkola, R., Rahman, I., Arpegård, J., Viktorin, A., Karlsson, R., Hägg, S., Svensson, P., Pedersen, N.L., and Magnusson, P.K.E. (2015). Dominant Genetic Variation and Missing Heritability for Human Complex Traits: Insights from Twin versus Genome-wide Common SNP Models. Am. J. Hum. Genet. 97, 708–714. 46. Zhu, Z., Bakshi, A., Vinkhuyzen, A.A.E., Hemani, G., Lee, S.H., Nolte, I.M., van Vliet-Ostaptchouk, J.V., Snieder, H., Esko, T., Milani, L., et al. (2015). Dominance Genetic Variation Contributes Little to the Missing Heritability for Human Complex Traits. Am. J. Hum. Genet. 96, 377–385. 47. Marchini, J., and Howie, B. (2010). Genotype imputation for genome-wide association studies. Nat Rev Genet 11, 499–511. 48. Ma, R.C.W., and Chan, J.C.N. (2013). Type 2 diabetes in East Asians: similarities and differences with populations in Europe and the United States: Diabetes in East Asians. Ann. N. Y. Acad. Sci. 1281, 64–91.

    FigureTitlesandLegends

    Figure1:Trueandestimatedgeneticimpactandeffectcorrelation.Allsimulations

    conductedwithsimulatedEURandEASheritabilityof0.5using4499simulatedEURand

    4836simulatedEASindividualsat248,953SNPs.

    Figure2:DistributionofgeneticcorrelationcomparisonbetweenpopcornandGCTA.

    Distributionwascomputedusingagaussiankdeonthesetofgeneticcorrelationestimates.

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • 16

    Figure3:Geneticcorrelationasafunctionofheritabilityforgeneexpression.Themeanand

    standarderrorofthegeneticcorrelationofthesetofgeneswithh12andh22exceeding

    thresholdhineachanalysis(y-axis)isplottedagainsth(x-axis).

    Tables

    MAF>0.01 MAF>0.05

    Model Independent 0.500 0.478 0.500 0.460 0.500 0.488 0.509 0.469

    Inverse 0.431 0.500 0.567 0.496 0.479 0.500 0.555 0.482

    Rare 0.500 0.467 0.382 0.863 0.500 0.496 0.998 0.756

    Common 0.500 0.500 0.522 0.493 0.500 0.500 0.502 0.496

    Difference 0.500 0.416 0.354 0.435 0.500 0.461 0.410 0.412

    Adversarial 0.710 0.604 0.525 0.651 0.714 0.667 0.601 0.675

    Table1:Trueandestimatedvaluesofthegeneticimpactandeffectcorrelationinsimulated

    EUR-likeandEAS-likegenotypes.Resultsaretheaverageof100simulationswith

    phenotypeheritabilityof0.5ineachpopulation.

    Table2:HeritabilityandgeneticcorrelationofRAandT2DbetweenEURandEAS

    populations.EURRAdatacontained8,875casesand29,367controlsforastudyprevalence

    of0.23.EASRAdatacontained4,873casesand17,642controlsforastudyprevalenceof

    0.22.RAdiseaseprevalencewasassumedtobe0.5%inbothpopulations7.T2DEURdata

    contained12171casesand56862controlsforastudyprevalenceof0.18.T2DEASdata

    ⇢ge ⇢gi ⇢̂ge ⇢̂gi ⇢ge ⇢gi ⇢̂ge ⇢̂gi

    hEUR2liability hEAS2liability ρge ρgi

    RA Est.(SE) 0.18(0.02) 0.22(0.03) 0.46(0.06) 0.46(0.06)

    95%CI [0.15,0.21] [0.16,0.28] [0.34, 0.58] [0.34, 0.58]

    p>0 3.90e-32 1.89e-17 1.37e-15 8.16e-16

    p0 2.41e-77 5.73e-7 1.70e-12 2.85e-13

    p

  • 17

    contained6952casesand11865controlsforastudyprevalenceof0.37.T2DEUR

    prevalencewasassumedtobe8%33whileT2DEASprevalencewasassumedtobe9%48.

    .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/

  • .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

    The copyright holder for this preprint (which wasthis version posted February 23, 2016. ; https://doi.org/10.1101/036657doi: bioRxiv preprint

    https://doi.org/10.1101/036657http://creativecommons.org/licenses/by-nc-nd/4.0/