Revealing the causative variant in Mendelian patient ...computation on encrypted data. These provide...

Post on 14-Oct-2020

0 views 0 download

Transcript of Revealing the causative variant in Mendelian patient ...computation on encrypted data. These provide...

1

TitleRevealingthecausativevariantinMendelianpatientgenomeswithoutrevealingpatientgenomes

AuthorsKarthikA.Jagadeesh1,5,DavidJ.Wu1,5,JohannesA.Birgmeier1,DanBoneh1,2,6,GillBejerano1,3,4,6

Affiliations1DepartmentofComputerScience,StanfordUniversity2DepartmentofElectricalEngineering,StanfordUniversity3DepartmentofDevelopmentalBiology,StanfordUniversity4DepartmentofPediatrics(MedicalGenetics),StanfordUniversity5Theseauthorscontributedequally6Correspondingauthors:dabo@cs.stanford.edu(D.B)andbejerano@stanford.edu(G.B)

AbstractGiventherapidlygrowingutilityofcriticalhealthinformationrevealedinthehumangenome,securegenomiccomputationisessentialtomovingforward,especiallyasgenomesequencingbecomescommonplace.Wedeviseandimplementproof-of-principlecomputationaloperationsforpreciselyidentifyingcausalvariantsinMendelianpatientsusingsecuremultipartycomputationmethodsbasedonYao’sprotocol.Weshowmultiplerealscenarios(smallpatientcohorts,trioanalysis,twohospitalcollaboration)wherethecausalvariantisdiscoveredjointly,whilekeepingupto99.7%ofallparticipants’mostsensitivegenomicinformationprivate.Allsimilaroperationsperformedtodaytodiagnosesuchcasesaredoneopenly,keeping0%ofparticipants’genomicinformationprivate.Ourworkwillhelpusherinanerawheregenomescanbebothutilizedandtrulyprotected.

IntroductionRarediseasesaffect1in33babies.ExomeandgenomesequencinghaverevolutionizedthediagnosisofthousandsofrareMendeliandiseasestothousandsofdifferenthumangenes1–3.ThousandsofadditionalrareMendeliandiseasesandhumangenesawaitdiscovery.Frequency-basedfiltershaveprovenextremelyeffectiveinprovidingdiagnosisinsuchcases4.Inessence,variantsfoundinacontrolpopulation(commonvariants)arelikelytobebenign5whilefunctionalrarevariantsnotfoundinthecontrolpopulationbutseeninmultipleaffectedindividualsarelikelytobediseasecausing6–8.Thesefiltersseekthegeneorvariantpresentinall(most)affectedindividualsbutinno(veryfew)unaffectedindividuals.

Forexample,onecantakeasmallcohortofunrelatedindividualssuspectedofsufferingfromthesamegeneticdisorder,andcomparetheirgenomestothatoftensofthousandsofunaffectedindividuals(e.g.,fromtheexomeaggregationconsortium,ExAC5).Asweshowbelow,inmultiplescenarios,thegenewithrarefunctionalmutationsinmostpatientsinoursmallcohortsisindeedcausaloftheircondition. Frequency-basedcomputationhighlightsthefundamental“serveorprotect”dilemmaofgenomicdata:“Serve:”tofindtherootcauseofapatient’sdisease,onewishestocompareapatientgenometoasmanyothergenomesaspossible,bothaffectedandunaffected,relatedandunrelated.Thus,toadvancemodernmedicine,allsequencedgenomesshouldbeshared.“Protect:”one’sgenomecontinuestorevealmoreandmoreaboutoneself,includingsusceptibilitytoavarietyofdiseases9.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

2

Sharingitwithotherscanleadtodiscriminationandbias.Toprotectitsownerandnextofkin,nosequencedgenomeshouldbeshared. Todate,thisdilemmahasbeensolvedbyallowinginstitutionsunrestrictedaccesstoallthegenomesintheirpossession.Limitedsharingbetweeninstitutionsisdonebyprovidingobfuscatedsummarystatistics10.Currentcommonlyadoptedmethodsforsharinghaveshortcomingsthatmakethemsuboptimal.Providingfullaccessatindividualinstitutionsallowsfortoomuchinformationtobesharedincertainsituations11.Disease-specificbeaconsarepronetoattackandcanendupidentifyingindividualsparticipatinginthestudy12.Beaconsalsoonlyprovideallele-presencequerycapabilitiesanddonothavetheflexibilityneededforanalyzingmultifactorialvariantinteractionswithinanindividual13.Itisalsoriskytosharegenomicdataintheclearwiththird-partyservicesspecializingingenomicanddiseaseanalysis.Weareunawareofanycryptographically-securemethodforsharinggenomicdatatoperformcomputationaloperationsthatallowidentifyingcausalvariantsinpatients. Tobetterresolvethisdilemma,wefirstnotethatwhileallofthegenomicvariantsfromallindividualsareneededtoperformthecomputation,onlyahandfulofcausalvariantsareultimatelyofinterestinthecontextofMendelianpatients(intheexampleabove,justtherarevariantsinthesinglegenemutatedinmostpatients).

Weintroducehereamodern,proof-of-conceptcryptographicimplementationwhichbothservesandprotects.Thesecurecomputationcanberunonentiregenomes(Serve),whilenopartyinvolvedinthecomputationlearnsanythingabouttheinputsoftheotherparticipantsexceptfortheoutputwhichiscomputedtogether(Protect).Weuserealpatientdatatoshowthatoursecureimplementationrevealsminimalinformationwhilediagnosingpatientgenomesthrough3differentstrategiesusingpracticalamountsofcomputetimeandmemory.Cryptographicmethodshavebeenusedindifferentgenomiccontextssuchasmicrobiomeanalysis14,GWASanalysis15andgenomicalignment16,butthisisthefirstimplementationthatweareawareofthatisgearedtowardsdiagnosingMendelianpatients,atimelyandpotentneed.

MethodsRepresentinggenomicdataasvectorsAssumeeachindividualinvolvedinastudyhasprivateaccesstotheirexome(orgenome).Ifwearelookingtoidentifyacausalvariant,wedefineavariantvector(longlistoflength28,413,589)ofallpossibleraremissense/nonsensevariantsinthehumangenomefromthefirstgeneonchromosome1tothelastgeneonchromosomeY.Weprovideacopyofthisvectortoeachindividual(affectedandunaffected),andaskthemtoprivatelydenoteTrue/Falsenexttoeachvariant(toindicatewhethertheyhavethespecificmutationornot,respectively).Ifwearelookingtoidentifyacausalgene,weprovideeachindividualagenevectorof20,663genesinthehumangenomefromA1BGtoZZZ3.Weaskthemtowrite“1”nexttoageneiftheyhaveoneormorerarefunctionalvariantsinthisgene,andotherwise,theywrite“0”.SeeSupplementaryFigure1A,B.

Definingcomputationsofinterest(MAX,INTERSECTION,SETDIFF)Wedefinethreeoperationsusedforpatientdiagnosis(SupplementaryFigure1C).Imaginetwoaffectedindividualsarerepresentedbytworarefunctionalvariantvectors(True/Falselists).Intersectingthesetwovectorswillrevealalltherarefunctionalvariantstheyshare.Formally,weperformaBooleanINTERSECTION(orAND)operation(xANDy=Trueonlyifx=y=True,andotherwise,itisFalse)betweenallpossiblepatientvariants.Next,ifwealsohaveaccesstoanunaffectedfamilymember,we

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

3

canfurtherexcludeanyvarianttheyshare.WedothiswithaBooleansetdifference(SETDIFF)operation(xSETDIFFy=Trueonlyifx=Trueandy=False,otherwiseitisFalse).Finally,imaginewehaveaccesstoasmallcohortofunrelatedpatientssharingasetofphenotypes.Wewouldliketofindthegeneaffectedbyoneormorerarefunctionalvariantsinthegreatestnumberofpatientswithinthecohort.Forthis,weusethepatientgenevectors(0/1lists).Wesum0/1sacrosspatientsforeachgene,andthenweusethemaximum(MAX)operationtofindtheentry(gene)withthegreatestnumber(ofaffectedcases;SupplementaryFigure1C).

Remarkably,moderncryptographyallowsanynumberofindividualstojointlylearnthefinalresultoftheseMAX,SETDIFF,INTERSECTIONoperationswithoutanyofthemlearninganythingelseabouteachother’sgenomes(orvectors).

EncryptionanddecryptionAnimportantcryptographicprimitivewerelyonisasecret-keyencryptionscheme.Inasecret-keyencryptionscheme,asecretkeyisusedtoencryptanddecryptmessageswiththeguaranteethattheencryptionsofanytwomessagesareindistinguishable,andyet,theycanbesuccessfullydecrypted(toobtaintheoriginalmessage)giventhekey(SupplementaryFigure2).

SecuremultipartycomputationMultiplemathematicalframeworksandcomputationalimplementationsexistforsecuremultipartycomputationonencrypteddata.Theseprovidedifferenttradeoffsincomplexityandefficiency17.Inthiswork,weuseYao’sprotocoltosecurelyevaluatefunctionsbetweentwoparties18.Abstractly,wewritethefunctionas!(#$, #&)where#$denotestheinputofthefirstpartyand#&denotestheinputofthesecondparty.Anyfunction! #$, #& canberepresentedbyacombinationofBooleanoperations(forexample,seeSupplementaryFigure3).Yao’sprotocolprovidesawayofevaluatingtheBooleancircuit(operator-by-operator)withoutrevealingtheinputs#$,#&.WeillustratethisindetailinFigure1.

WhileYao’sprotocolprovidesasimpleandefficientsolutionforsecuretwo-partycomputation,inmanyofthescenarioswedescribe,thecomputationoccursamongmultipleparties(e.g.,manyindividuals,eachwiththeirpersonalgenome).Itisverystraightforwardtoreducethegeneralproblemofsecuremultipartycomputationtothatofsecuretwo-partycomputationbyworkinginthe“two-cloud”model.Inthetwo-cloudmodel,weassumethattherearetwonon-colludingservers(e.g.,thesecouldbemanagedbytwoindependentgovernmentagencies)thataggregatetheinputsfromeachpartyinaprivacy-preservingmannerandthenperformthecomputation.EachserveronitsownhasnoknowledgeofthedataasshowninFigure1.7(seeOnlineMethodsandDiscussion).

ProtectionquotientWedefinetheProtectionQuotientasthefractionofprivateinformationthatisnotexposed(toneithertheotherparticipantsnortheentityrunningthecomputation)duringthecomputation.Usingourencryptionscheme,theProtectionQuotientequalsthetotalnumberofpatientvariantswithheldfromtheoutputdividedbythetotalnumberofpatientvariantsinputintothecomputation.Standardunencryptedpatientdiagnosisoperationshaveaprotectionquotientof0%,becauseallvaluesmustbeexposedtoperformthecomputation.Allourapplicationsbelowhaveaprotectionquotientof97.1-99.7%,maximizingprivacywhileretainingfullutility.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

4

ResultsExampleMendelianapplicationsofsecurecomputationToprovethepragmaticutilityofourapproach,wedemonstratethreedifferentsecureoperationsoverrealMendelianpatientswherewesuccessfullyidentifythecausalvariantsineachscenario(Table1):

MAXidentifiesthecausalgeneinsmallpatientcohortswithprotectionquotientabove99%Weuse4smallcohortsofunrelatedindividuals,sufferingfromverydifferentrarediseases:FreemanSheldonSyndrome(FSS),Hadju-CheneySyndrome(HCS),KabukiSyndrome(KaS)andMillerSyndrome(MiS).Eachindividualholdsaprivatelistof211-374rarefunctionalvariantsin210-356genes(total767-2,754variantspercomputation).WeusethesecureMAXfunctiontorevealonlythetopgenemutatedacrosspatientsineachcohort.Inall4cohorts,wefindthatthegenemutatedinmostindividualsistheonethathasbeenproventobethecausalgene:MYH3inFSS6,NOTCH2inHCS19,KMT2DinKaS8andDHODHinMiS7(Table1a).

Securecomputationonlyrevealsthevariantsinthemostmutatedgeneineachcohortwhileprotectingtheremaining764variantsinFSS,1845variantsinHCS,2746variantsinKaSand1055variantsinMiS.Thiscomputationhasaprotectionquotientof99.3-99.7%forall4cohortdiseasedatasets.Thecomputationisperformedoverall20,663genesandcompletesinjust5-10seconds,withoneserverontheEastCoastandtheotherontheWestCoast(Table1a).Thetotalprotocolexecutiontime,bandwidthandcomputetimeallgrowlogarithmicallywiththenumberofcohortindividualsinvolvedinthesecurecomputation(SupplementaryFigure4A).

SETDIFFidentifiesthecausalvariantinatriowithprotectionquotient99.6%Unaffectedmotherandfather,andaffectedmalechildwithfemaleexternalgenitalia,eachholdsalistof164-185(total524)rarefunctionalvariantsfoundintheirexomes.ThesecureSETDIFFoperationrevealstothefamilyandtestprovidersonly2rarevariantsfoundinthechildbutinneitherparent(Table1b).Literaturereviewprovidesadiagnosisbasedononeofthesetwovariants:theACTBgene20.

Securecomputationkeeps522variantsprivatewhilesharingonly2variantswiththetestproviderandallindividualsinvolvedinthecomputation.Thiscomputationhasaprotectionquotientof99.6%.Becausethreepartiesarenowinvolved,thetotalcomputationtimeusingasingleserverthreadoneithercoastis57minutes(Table1b).However,thevariantlistcaneasilybesplitbetweenasmallcomputerarrayoneithercoast,suchthatatypical30-nodeclusterbringscomputationtimedowntounder2minutes.Theprotocolexecutiontime,bandwidthandcomputetimeallgrowlogarithmicallywiththenumberoffamilymembersinvolvedinthesecurecomputation(SupplementaryFigure4B).

INTERSECTIONidentifiespatientsofinterestacross2hospitalswithprotectionquotient97.1%Twoormoregenomecentersmaywanttocomparetheirpatientliststoseeiftogethertheycanfindmultiplepatientswiththesamerarefunctionalmutation,andsimilarphenotypes,whilerevealingnothingelsetoeachother.Forexamplewetook928WashingtonMendelianCenter(WMC)patientsand282BaylorHopkinsCenter(BHC)patients.Foreachhospitalwepreparedalistofover5,000rarefunctionalvariantsseeninoneormoreoftheirpatients.UsingthesecureANDfunction,thetwohospitalsfindashortlistofjust159variantspresentinbothhospitals,pointingatpatientswhowouldbenefitfromphenotypecomparison.Thisshortlistincludes“positivecontrols”suchasknowndiseasevariantNOTCH1:p.E694K,associatedwithpartial/incompletepenetranceofaorticvalvedisease21.IndeedtheWMCandBHCpatientsarephenotypicallycharacterizedwithleftventricularoutflowdefectandthoracicaorticaneurysm,respectively.Thelistalsooffersexcitingnovelgene-diseaseassociations

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

5

suchasrarefunctionalvariantHCN3:p.R648H(withfrequency5.47·10-5and0inExACand1000genomesdata,respectively).HCN3isavoltage-gatedcationchannelgene,whosemouseknockoutcausesabnormalventricularactionpotentialwaveform22.Promisingly,inpatientsfromWMCandBHC,thismutationiscorrelatedwithdilatedcardiomyopathyandcoarctationoftheaorta,respectively.

Securecomputationonlyreveals2x159potentialcausativevariantswhileprotectingtheremaining10,749variantswithaprotectionquotientof97.1%.Thiscomputationisperformedoverallrarefunctionalvariantsintheexomewithatotalprotocolexecutiontimeof9.4minutesusingasingleserverthreadoneithercoast(Table1c).Becauseeveryvariantisevaluatedindependently,a30-nodecomputeclusteroneitherendwillreducetotalcomputationtimetobelow20seconds.AswelearntoappreciateMendelianmutationsoutsideoftheexome,thetotaltime,bandwidthandcomputetimescalelinearlywiththesizeofthevariantlistsharedforsecurecomputation(SupplementaryFigure4C).

DiscussionRarediseasesarecumulativelycommon(someestimatethat10%oftheUSpopulationareaffectedwithraredisorders).About7,000rareMendelianconditionshavebeendescribedtodate.Ofthese,approximately4,000havebeendefinitivelydiagnosedassinglegenediseases,mappingtoover4,000genesinthegenome.Theprocedureswedescribeareapplicableforallofthese.Thereareonlyahandfulofmedicalconditionsdiagnosedwithcertaintytotheinteractionsofjust2genes.Farlessisknownfordiseasescausedbymorethan2genes.PersonalGenomicsposesafundamental“serveorprotect”dilemma:shouldoneservetheirgenomeintheserviceofbetterdiagnosisandultimatelydiseaseeradication,orshouldoneprotectoneselfandnextofkinagainstpotentialdiscriminationbyrefusingtosharetheirgenome.ThisdilemmaisparticularlyevidentinthefieldofMendeliandiseases.Itisessentialtodeveloptoolsandmethodstoeffectivelysharegenomeswhilemaintainingtheirprivacyandsecurity.Becausegenomeprivacyisbestservedwhereadefinitivediagnosisexists,wefocusonsinglediseasegenediscoveryanddiagnosis.HerewepresentasecureapproachformultiplepartiestoperformexactcomputationsthatdiagnoseMendeliandiseases,whilekeepingallparticipatinggenomesprivate.

Thescenarioswepresentareallreal.Genomeprivacyisextremelyappealinginallofthem:Completestrangersindiseasecohorts(Table1A)learnnothingabouteachotherexcepttheirshareddisease-causinggenemutations.Forparticipantswheretheassaydoesnotprovideananswer,absolutelynothingisrevealed.Inlargerfamilytrees,moredistantlyrelatedmemberswillappreciategenomeprivacy.Eveninayoungnuclearfamily(e.g.,atrio;Table1B),thetestproviderlearnsalmostnothingexceptthelikelydisease-causingmutationintheoffspring.Moreover,theylearnvirtuallynothingabouttheparentsthemselves.Inthetwohospitalscenario(Table1C),onlyvariantsthatareworthwhilecomparingarerevealedwhilethevastmajorityofvariantsremainprivatetoeachinstitute’sresearchersandpatients.

Inallofthesecases,thequantitiesrevealedandthosethatremainprivateareaprivacyadvocate’sdreamcometrue:Wepredominantlyrevealonlyvariant/scrucialforpatientdiagnosis,familycounselingandanypotentialtreatment.Whatremainsprivateispredominantlyvariantsofunknownsignificance(VUS)thatareoflittlevaluefordiagnosingone’smedicalcondition.However,thesesameVUSvariantsalmostcertainlyuniquelyidentifyapersonasaparticipantinananalysis,andhavethepotentialtorevealnoworinthefutureotherpersonaltraitsthatmaybefurthercausefordiscrimination.

Forthisproof-of-conceptwork,weassumethattheprotocolparticipantsare“honest-but-curious,”(sometimesreferredtoas“semi-honest”)—thatis,weassumethatthepartiesareproperly

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

6

incentivizedtohonestlyfollowtheprotocol,butattheendoftheprotocolexecution,theymaytrytolearnsomeadditionalinformation(aboutotherparties’inputs)basedonthemessagestheyreceiveduringtheprotocolexecution.Wesaythataprotocolissecureiftheonlyinformationanypartylearnsbyparticipatingintheprotocolcanbeinferredjustfromthatparties’inputandtheoveralloutputofthecomputation.Inotherwords,noneofthepartiesshouldbeabletolearnsomethingaboutanotherparties’inputotherthanwhatisexplicitlyrevealedbytheoutputofthefunction.

Yao’sprotocolgivesanefficientsolutionforsecuretwo-partycomputationinthepresenceofsemi-honestadversaries23.Wenotethattherearewell-establishedwaystoextendYao’sprotocoltoadditionallyprovidesecurityagainstmaliciouspartieswhodeviatefromtheprotocoldescriptioninordertocompromisetheprivacyofotherparticipantsorcorrupttheresultsofthecomputation24.Inaddition,protectingagainstparticipantsthatsubmitmalicious(ormalformed)inputstotheprotocolcanbedonebyensuringthatifaparticipant’svariantvectordoesnotmeetcertaincriteria,orisnotaccompaniedbyanappropriatecertificate,thenthecomputationabortsanddoesnotproduceanyoutput.Furthermore,inthispaper,weintroduceanoperation-specific“protectionquotient”,anovelmetrictoassessthefractionofinformationsecuredbythecomputation.Theprotectionquotientcanbeusedtofurtherrestricttheoutputreturnedtoallpartiesifthedefinedprivacyrequirementsarenotmet.Forinstance,ifatrioanalysisresultsinmorethanafewexpecteddenovoexomemutations,onlyanerrormessagewillbeproduced.Thisapproachispreferredforexampletodifferentialprivacy25,26whichaddsrandomgenomicvariationasnoiseintoaggregatedsummarystatisticstotryandavoidindividualidentificationinpooledgenomicsdata15.

Thebasicprincipleunderlyingourdesignistoperformexactsecurecomputationonthecomplete(private)genomesofallparticipatingindividuals.Thisisindirectcontrasttothemoretraditionalandlesseffectiveroutesofpublishingobfuscatedfrequenciesaggregatedacrossmultipleindividuals.Thecomputationalresourcesweusetoretaingenomicprivacyarenotnegligible,yetareperfectlywithinthecapabilitiesofoff-the-shelfmoderncomputerstocompletetheoperationinsecondsorminutes,evenwhencommunicatingbetweentheEastandWestcoasts.Andwhilenosecuritymechanismmaybeperfectlyimpenetrable,itiscertainlypreferabletohaveasecuritymechanisminplace(especiallyifitallowsforexactcomputation)wherenonecurrentlyexist.Manyfurtherextensionsandapplicationsofourcomputationalframeworkarepossible,andaresuretoprovideincentivesforthedevelopmentofmoresecureandfastermethods.Awidespreaddeploymentofcomputerlibrariesefficientlyimplementingtheseprincipleswillencourageindividualstosecurelycontributetheirgenomesforthecommongood,andthusgreatlyfueladvancesinbothpersonalgenomicsandprivacyinthe21stcentury.

On-LineMethodsPatientdatasetsWholeexomesequencesofpatientswereobtainedfromdbGaPstudiesphs000204.v1.p16(FreemanSheldonSyndrome),phs000244.v1.p17(MillerSyndrome),phs000295.v1.p18(KabukiSyndrome),andphs000477.v1.p1(Hajdu-CheneySyndrome).Pre-processedvariantcallformat(VCF)filesforpatientsfrom2CentersforMendelianGenomicswereobtainedfromdbGaPstudiesphs000693.v4.p1(UniversityofWashington),andphs000711.v3.p1(BaylorHopkins).OurtriofamilywasobtainedfromStanfordHospital.AllhumansubjectresearchwasperformedunderguidelinesapprovedbytheStanfordInstitutionalReviewBoard.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

7

SequencingreadsweremappedtotheGRCh37/hg19assemblyofthehumangenomeusingBWAMEMv0.7.10-r78927.VariantswerecalledusingGATKv3.4-46-gbc02625followingtheHaplotypeCallerworkflowfromtheGATKBestPractices28.

VariantannotationANNOVARv527wasusedtoannotatevariantswithpredictedeffectonproteincodinggenesusinggeneisoformsfromtheENSEMBLgenesetversion75forthehg19/GRCh37assemblyofthehumangenome29,30.Allcanonicalgeneisoformswereusedwherethetranscriptstartandendaremarkedascompleteandthecodingspanisamultipleofthree.

CryptographictechniquesInasecuremultipartycomputation(MPC)protocol18,31,agroupofusers(oftencalledparties)seektojointlycomputeafunctionovertheirinputswithoutrevealinganyadditionalinformationabouttheirparticularinputs.Thefunctionthatthepartiescomputeisdeterminedbasedonthespecificscenario.Thecomputationconsistsofseveralroundsofinteraction,whereineachround,thepartiesexchangeaseriesofmessages.Attheconclusionoftheprotocol,eachparticipantlearnstheoutputofthecomputationevaluatedoneveryone’sjointinput.Noadditionalinformationbeyondtheexplicitoutputisrevealedtoanyparty(theprocessisabstractedinFigure1).

EveryarithmeticcomputationcanbeexpressedasasequenceofBooleanlogicaloperations(thatis,operationsonbits 0,1 ).Thisispreciselyhowthemoderncomputerworks.Yao’sprotocolallowstwousers,AliceandBob,tocomputearbitraryfunctionsovertheirinputs.Moreprecisely,ifAlicehasaninput#andBobhasaninput*,Yao’sprotocolallowsthemtocompute!(#, *)inawaysuchthatAlicelearnsnothingabout*andBoblearnsnothingabout#otherthantheoutputvalue!(#, *).Ingeneral,expressingafunctionintermsofBooleanoperationsgreatlyincreasesthecomputationalcostofevaluatingthefunction.TomaximizetheefficiencyofYao’sprotocol,itisimportanttochoosefunctionalitieswithsimpleorcompactrepresentationsasBooleancircuits.AnexampleofaBooleancircuitisshowninSupplementaryFigure3.

Inthiswork,wecastdiagnosingMendelianpatientsas(simple)arithmetic/logiccomputationsthatadmitefficientBooleancircuitrepresentations.Wenowdescribehowthesecurecomputationprotocolswork.Todothiswefirstintroducetwostandardtoolsfromcryptography:(1)symmetric(secret-key)encryption32and(2)oblivioustransfer33–35.

EncryptionanddecryptionAsecret-keyencryptionschemeconsistsoftwofunctions:EncryptandDecrypt.Theencryptionfunctiontakesacryptographickeykandamessagemandoutputsaciphertext4.Thedecryptionfunctiontakesthecryptographickey5andaciphertext4andoutputsamessage6.Intuitively,encryptionanddecryptionareinverseoperations:ifweencryptamessageunderakey5,decryptingtheresultingciphertextwiththesamekey5recoverstheoriginalmessage.Moreprecisely,wecansaythatforanykey5andanymessage6,Decrypt 5, Encrypt 5,6 = 6.Inasymmetric(orsecret-key)encryptionscheme,boththeencryptionandthedecryptionfunctionsrequireknowledgeofthesecretcryptographickey.Thekeyisarandomstringdrawnfromsomekey-space.Theprecisenatureofthekey-spacevariesdependingonthedetailsoftheencryptionscheme,andisimmaterialtoourpresentationinthispaper.Anencryptionschemeisconsideredtobesecureiftheciphertextdoesnotrevealanyinformationabouttheunderlyingmessagetoanyuserwhodoesnotpossessthesecretencryptionkey(certainly,auserwhoholdsthesecretkeycandecryptandlearnthemessage).Oneway

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

8

toformalizethisistosaythatauserwhodoesnothavetheencryptionkeyisunabletotellanencryptionofamessage68apartfromanencryptionofanothermessage6$.Inotherwords,ciphertextshideallinformationabouttheirunderlyingmessagetoalluserswhodonothavetheencryptionkey.WeillustratethisinSupplementaryFigure2.

Underthisdefinition,messagescanalsobeencryptedmultipletimes.Forinstance,amessage6canbe“doubleencrypted”undertwokeys5$and5&byfirstencrypting6using5$andthenencryptingtheresultingciphertextusingthesecondkey5&.Thisprocedureyieldsanotherciphertext.Decryptionproceedsbyfirstdecryptingwithkey5&,andthendecryptingtheresult(aciphertext)with5$.Inparticular,wecanwrite

Decrypt 5$, Decrypt 5&, Encrypt 5&, Encrypt 5$,6 = 6

Securityofthedoubleencryptionschemefollowsdirectlyfromthesecurityoftheunderlyingencryptionscheme.Inparticular,auserwhodoesnothaveboth5$and5&cannotlearnanyinformationabouttheunderlyingmessagethathasbeendoublyencryptedusing5$and5&.Numeroussymmetric(secret-key)encryptionschemesexistintheliterature32.

OblivioustransferAnoblivioustransfer(OT)protocol33–35isatwo-partyprotocolbetweenasenderandareceiver.AnOTprotocolenablesthereceivertoselectivelyobtainoneoftwopossiblemessagesfromthesenderwithoutrevealingtothesenderwhichmessagethereceiverrequested.Moreprecisely,thesenderholdstwomessages,denoted58and5$andthereceiverholdsaselectionbit: ∈ 0,1 .AttheendoftheOTprotocol,thereceiverobtainsthechosenmessage5<andlearnsnothingabouttheothermessage5$=<.Thesenderdoesnotlearnanythingaboutthereceiver’schoicebit:.Numerousoblivioustransferprotocolshavebeenproposedintheliterature33–35.

OverviewofstepsforsecurecomputationInasecuretwo-partycomputationprotocol,Aliceholdsaninput# ∈ 0,1 >andBobholdsaninput* ∈0,1 >.Wewrite 0,1 >todenoteabinaryinputoflength?(e.g.,forinstance,?couldbeofthebinaryrepresentationofthevariantvectororthegenevectorwedefineinourmaintext).Theirgoalistocomputeafunction!(#, *)ontheirjointinput #, * .Thecomputationisconsidered“secure”ifattheendofthecomputation,theonlyinformationthatAliceandBoblearnisthefunctionvalue!(#, *)andnothingelseabouttheotherparty’sinput.Itisimportanttonoteherethatthefunctionoutput!(#, *)couldrevealsomeinformationabouttheinputs#and*(forexample,inourtrioscenario,whateverdenovovariantwereportinthechild,wecandeducebydefinitiondoesnotexistineitherparent).AswenoteintheDiscussionsection,weworkinthehonest-but-curiousmodelwhereweassumethatAliceandBobfollowtheprotocolspecificationasdirected,butmay,attheendoftheprotocolexecution,trytoinfersomeadditionalinformationabouteachother’sprivateinput.WenowdescribehowYao’sprotocolcanbeusedtosecurelyevaluateanyfunctionovertwoinputsinthehonest-but-curiousmodel.

ToapplyYao’sprotocol,itisfirstnecessarytorepresentthefunction!asaBooleancircuitoninputs#and*.Atthemostbasiclevel,thebuildingblockswehaveareAND(xANDy=Trueonlyifx=y=True,otherwiseitisFalse)andXOR(exclusive-or,xXORy=Trueonlyifx=Trueandy=False,orifx=Falseandy=True,otherwiseitisFalse)gates.Thesebasicgatescanbecombinedtoobtaincircuitsofarbitraryexpressivefunctionalities17.Inourdescriptionbelow,wewilloftentimesrefertotheconcreteexampleofsecurelyevaluatingtheANDfunctiononsingle-bitinputs(above).AvisualizationofthecompleteprotocolisgiveninFigure1.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

9

OverviewofYao’ssecuretwopartycomputationprotocolWenowdescribeYao’sprotocol.Foreaseofpresentation,wepresentasimplified(butlessefficient)descriptionofYao’sprotocolhere.Ourimplementation(basedontheJustGarble36library)followsthehigh-levelblueprintdescribedhere,butincludesseveraloptimizations,notablythefree-XOR37andhalf-gate38optimizations.Step1:Foreachwireinthecircuit,Alicechoosestwokeys.RecallthatinaBooleancircuit,eachwireinthecircuitcantakeontwopossiblevalues(0or1;sometimesalsoreferredtoasFalseandTrue,respectively).Aliceassociatesoneofthekeyswiththewirevalue0andanotherkeywiththewire1.FortheparticularcaseofsecurelyevaluatingtheANDfunction@ = # ∧ *,Alicepicksthreepairsofkeysandassociatesonepairwitheachof#,*,and@.Wedenotethesekeys5B8, 5B$, 5C8, 5C$, 5D8, 5D$.Inthisexample,5B8isthekeyassociatedwiththeinputbit#takingonthevalue0and5D$isthekeyassociatedwiththe

outputbit@takingonthevalue1.ThisstepisshowninFigure1.1.Step2:Foreachgateinthecircuit,Aliceconstructsa“garbled”truthtable.Foreachrowinthetruthtable,thealgorithmtakesthekeyassociatedwiththevalueoftheoutputwireanddoubleencryptsitusingthetwokeysassociatedwiththevaluesofthetwoinputwires.FortheparticularcaseofevaluatingasingleANDgate,Alicewouldconstructthefollowingtableofciphertexts

• E$ = EncryptFGH(EncryptFIH 5D8 )

• E& = EncryptFGH(EncryptFIJ 5D8 )

• EK = EncryptFGJ(EncryptFIH 5D8 )

• EL = EncryptFGJ(EncryptFIJ 5D$ )

Fortheoutputwiresofthecircuit,insteadofdoubleencryptinganencryptionkey,Alicedirectlydoubleencryptsthevalueoftheoutputwire(e.g.,0or1).ThisstepisshowninFigure1.2.Step3:Aftergarblingthecircuit(Steps1-2/Figure1.1-1.2),thesecurecomputationbeginswithBobusingtheoblivioustransferprotocol(above)toobtainthekeysfortheinputwiresassociatedwithhisinput.Theoblivioustransferprotocolensuresthefollowing:Bobonlylearnsoneofthetwokeysassociatedwitheachofhisinputwires(thiswillensurethatBobcanonlyevaluatethefunctiononasinglesetofinputs),andAlicedoesnotlearnwhichwireBobrequested(thatis,AlicedoesnotlearnBob’sinput).FortheparticularcaseofevaluatingasingleANDgate,ifBob’sinputis: ∈ {0,1},thenBobwouldplaytheroleofthereceiverinanOTprotocolwithinput:.Alicewouldplaytheroleofthesenderwithmessages5C8,5C$ (thekeysassociatedwithBob’sinputwire).Attheendoftheoblivioustransferprotocol,Bobobtains5C<(thekeyassociatedwithhisinput),andlearnsnothingaboutthekeyassociatedwiththecomplementofhisinput(5C$=<).AlicelearnsnothingaboutBob’sinput:.ThisstepisshowninFigure1.3.Step4:AfterBobreceivesthekeysassociatedwithhisinputviatheoblivioustransferprotocol,AlicesendsBobthegarbledtablesassociatedwitheachgate(afterrandomlypermutingtherowsofeachtable).Additionally,AlicesendsBobthewireencodingsofherinput.FortheparticularcaseofevaluatingasingleANDgate,ifAlice’sinputisO ∈ 0,1 ,Alicewouldsend5BP(thekeyassociatedwith# = O)toBob.ThisstepisalsoshowninFigure1.4.Step5:Withallthisinformation,Bobcancompletethefunctionevaluationandcomputetheoutput.Inparticular,afterSteps3and4,BobshouldhaveasinglekeyforeachoftheinputwiresoftheBooleancircuit.Then,foreachgateinthecircuit,Bobtakestheinputkeyshehasandattemptstodecrypttherowsinthegarbledtableassociatedwiththatgate.Becausetheentriesinthegarbledtablearedouble

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

10

encryptedusingthekeysassociatedwiththeinputwirestothegateandBobonlyhasasinglekeyforeachofthewires,BobisonlyabletodecryptasinglerowinthegarbledtableasshowninFigure1.5.Step6:Indoingso,Bobisabletolearnoneofthekeysassociatedwiththegate’soutputwire(moreover,byconstructionofthegarbledtable,theoutputkeyBobobtainsispreciselytheoneassociatedwiththevaluecorrespondingtoevaluationofthegateontheinputbits).Thus,startingwiththeinputwires,Bobisabletoevaluatethecircuitgate-by-gateasseeninFigure1.6.OnceBobreachestheoutputlayerofthecircuit,heisabletodecrypttheciphertextsandobtainthevalueofeachoutputwire.Summary.Tosummarize,inYao’ssecuretwo-partycomputationprotocol,AlicebeginsbyconstructingagarbledtruthtableforeachgateintheBooleancircuit.Shedoessobydoubleencryptingeachrowinthetruthtable(usingthekeysassociatedwiththeinputbits).ShegivesBobthegarbledtruthtablesaswellasthekeysassociatedwithherinput.Usingoblivioustransfer,Bobobtainsthekeysassociatedwithhisinput.Armedwithasinglekeyforeachoftheinputwiresinthecircuit,Bobisabletoevaluatethegarbledcircuitgate-by-gate.Foreachgate,Bobtakeshisinputkeysandusesthemtodecryptoneoftherowsofthegarbledtableassociatedwiththegate.Thisyieldsthekeyassociatedwiththeparticularwire.Finally,attheendofthecomputation,Bobdecryptstheciphertextsassociatedwiththeoutputwirestolearntheoutputofthecircuit.BobthensendstheresultofthecomputationtoAlice.

ExtendingYao’sprotocoltoNpartiesYao’sprotocolallowstwopartiestosecurelyevaluateanarbitraryfunction.However,ingeneral,wedesiretocomputeacrossalargenumberofparties(e.g.,studyparticipants).Whiletherearesecuremultipartycomputationprotocolsthatsupportmorethantwoparties,(e.g.,theSPDZ39,GMW31,orBGW40protocols),akeylimitationoftheseprotocolsisthattheyrequireallparticipatingpartiestobeonlineduringtheprotocolexecution.Moreover,thenumberofroundsofcommunicationintheprotocoloftengrowswiththecomplexityofthecomputation(notethatthisisindirectcontrastwithYao’sprotocolwhichisatwo-roundprotocol,regardlessofhowcomplicatedthecomputationis).Asaresult,therearesubstantialengineeringhurdlestodeployingthesegeneralprotocolsformultipartycomputationacrossalargenumberofparties.Insomecases(e.g.,BGW40andGMW31),thetotalbandwidthalsoscalesquadraticallyinthenumberofparties,furtherlimitingthepracticalityoftheseprotocols.

Amoreefficientsolutionforgeneralmultipartycomputationthatavoidsboththerequirementthatparticipatingpartiesbeonlineduringtheprotocolexecutionaswellasthepotentialcommunicationblowupistoworkina“two-cloud”model.Inthismodel,weassumetherearetwonon-colludingcloudserversthatfacilitatetheprotocolexecution.Atthebeginningoftheprotocolexecution,eachoftheparticipatingparties“split”theirinputsandshareitwiththetwocloudservers.Aslongasthetwocloudsdonotcolludewitheachother,theydonotlearnanythingabouttheinputstothecomputation.Afterthetwocloudservershavereceivedtheinputsfromeachoftheparticipatingparties,theyengageinatwo-partysecurecomputationprotocol(suchasYao’sprotocol)tocomputethefunctionofinterest.Notably,thepartiesthatcontributedthedatadonothavetobeonlineduringthisstepoftheprotocol.Andmoreover,communicationisonlynecessarybetweenthepartiesandthecloudservers;partiesinparticulardonothavetocommunicatewitheachother.Inapracticaldeployment,thesetwocloudserversmightbemanagedbydistinctgovernmentalorganizationswithintheNIHorWHO.Thus,byworkinginthetwo-cloudmodel,itispossibletotransformanycomputationbetween?individualsintoasecuretwo-partycomputationbetweentwonon-colludingparties.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

11

Step7:Wenowdescribehowtosecureevaluateanyfunctionalityinthetwo-cloudmodel.Supposethereare?partiesparticipatingintheprotocolexecutionandlet#$, … , #>denotetheirprivateinputs.Tosecurecomputeafunction!,eachofthe?participantschoosesarandomvalueRS andsendsRS tooneofthetwocloudservers.Theythensendtotheothercloudserverthevalue#S − RS (notethatthesubtractionisperformedmoduloalargeintegerU).OnceeverypartyhassubmittedtheirinputsRS and#S − RS tothetwocloudservers,thefirstcloudserverhasavectorofrandomvaluesO = (R$, … , R>)andthesecondcloudserverhasavectorofrandomdifferences: = (#$ − R$, … , #>– R>).BecausethesubtractionistakingplacemoduloU,thevaluesin:aredistributeduniformlyandmoreimportantly,independentlyofthe#S’s.Thepair(RS, #S − RS)isoftenreferredtoasan“additivesecretsharing”oftheinput#S.Thepropertythatthisadditivesecretsharingschemesatisfiesisthatasinglesharerevealsnoinformationabouttheinput,buttwosharescompletelydefinetheinput.Thismeansthataslongasthetwocloudserversdonotcollude,theylearnnoinformationabouteachparty’sinput(sincetheyeachpossessjustoneshareofthesecret).

Tocompletethesecurecomputation(ofafunction!),thetwocloudserverssimplyapplyYao’sprotocoltothefollowingtwo-partyfunctionality:

W R$, … , R> , #$ − R$, … , #>– R> = ! R$ + #$ − R$, … , R> + #>– R> = ! #$, … , #> .Inotherwords,thetwocloudscomputethefunctionalitythattakesasinputtwovectors(eachcontaining?values)andoutputsthefunction!evaluatedonthecomponentscorrespondingtothesumofthetwoinputvectors.Sincesummingtheinputvectorsinthiscasereconstructseachparty’sinput,thisprocedurecorrespondspreciselytoevaluating!ontheparties’inputs.Moreover,thetwocloudserversdonotlearnanyadditionalinformationaboutanyparticularparty’sinputbecausetheevaluationofWisperformedusingYao’sprotocol(whichisasecuretwo-partycomputationprotocol).ThisprocedureisshowninFigure1.7.

ConstructingourBooleancircuitsAsdescribedabove,arbitraryBooleancircuitscanbeconstructedusingonlyANDandXORgates.Toefficientlyrepresentourset-intersection-basedalgorithmsasBooleancircuits,wefirstconstructsomeintermediatebuildingblocksfromthebasicANDandXORgates.Theintermediatebuildingblockswerequireincludeadditioncircuits,comparisoncircuits,andequalitycircuits.Forthesebuildingblocks,weusethecircuitsbyKolesnikovetal.41(seeSupplementaryFigure5).

SoftwareimplementationInourimplementation,weusetheJustGarblelibrary36forourimplementationofYao’sgarbledcircuits,andweusetheAsharovetal.42implementationoftheoblivioustransferprotocols.Forbetterperformance,wealsoimplementthehalf-gatesoptimization38forYao’sgarbledcircuits.Thisimplementationwillbereleaseduponpublication.Forourbenchmarks,wesetupaclientandserveronAmazonEC2(tosimulatethetwocloudproviders),andmeasurethetotalcomputetime,bandwidth,andoverallprotocolexecutiontime(takingintoaccountthenetworkcommunication).Werunourexperimentsontwomemory-optimizedEC2instances(M4.2xlarge).Eachinstancerunsan8-core2.4GHzIntelXeonE5-2676v3(Haswell)processorandhas32GBofmemory.Whileourprotocolsarenaturallyparallelizable,weuseasinglethreadofexecutioninallofourexperiments,anddonottakeadvantageoftheavailableparallelism.Tosimulatethenon-colludingtwocloudmodel,weusedawide-areanetwork(WAN)settingwherethetwoserversarefarapart.WeplacedoneoftheserversontheWestCoast(specifically,intheNorthernCaliforniaavailabilityzone)andtheotherontheEastCoast(specifically,intheNorthernVirginiaavailabilityzone).

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

12

References1. Yang,Y.etal.ClinicalWhole-ExomeSequencingfortheDiagnosisofMendelianDisorders.N.Engl.J.

Med.369,1502–1511(2013).

2. Iglesias,A.etal.Theusefulnessofwhole-exomesequencinginroutineclinicalpractice.Genet.Med.

16,922–931(2014).

3. LeeH,DeignanJL,DorraniN&etal.CLinicalexomesequencingforgeneticidentificationofrare

mendeliandisorders.JAMA312,1880–1887(2014).

4. Rehm,H.L.etal.ACMGclinicallaboratorystandardsfornext-generationsequencing.Genet.Med.

Off.J.Am.Coll.Med.Genet.15,733–747(2013).

5. Lek,M.etal.Analysisofprotein-codinggeneticvariationin60,706humans.Nature536,285–291

(2016).

6. Ng,S.B.etal.Targetedcaptureandmassivelyparallelsequencingof12humanexomes.Nature

461,272–276(2009).

7. Ng,S.B.etal.Exomesequencingidentifiesthecauseofamendeliandisorder.Nat.Genet.42,30–35

(2010).

8. Ng,S.B.etal.ExomesequencingidentifiesMLL2mutationsasacauseofKabukisyndrome.Nat.

Genet.42,790–793(2010).

9. Moreno-Estrada,A.etal.ThegeneticsofMexicorecapitulatesNativeAmericansubstructureand

affectsbiomedicaltraits.Science344,1280–1285(2014).

10. Mailman,M.D.etal.TheNCBIdbGaPdatabaseofgenotypesandphenotypes.Nat.Genet.39,

1181–1186(2007).

11. Siu,L.L.etal.Facilitatingacultureofresponsibleandeffectivesharingofcancergenomedata.Nat.

Med.22,464–471(2016).

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

13

12. Shringarpure,S.S.&Bustamante,C.D.PrivacyRisksfromGenomicData-SharingBeacons.Am.J.

Hum.Genet.97,631–646(2015).

13. Regalado,A.NetworksofGenomeDataWillTransformMedicine.MITTechnologyReviewAvailable

at:https://www.technologyreview.com/s/535016/internet-of-dna/.(Accessed:30thSeptember

2016)

14. Wagner,J.,Paulson,J.N.,Wang,X.,Bhattacharjee,B.&CorradaBravo,H.Privacy-preserving

microbiomeanalysisusingsecurecomputation.Bioinformatics32,1873–1879(2016).

15. Simmons,S.,Sahinalp,C.&Berger,B.EnablingPrivacy-PreservingGWASsinHeterogeneousHuman

Populations.CellSyst.3,54–61(2016).

16. Popic,V.&Batzoglou,S.Privacy-PreservingReadMappingUsingLocalitySensitiveHashingand

SecureKmerVoting.bioRxiv046920(2016).doi:10.1101/046920

17. Lindell,Y.&Pinkas,B.SecureMultipartyComputationforPrivacy-PreservingDataMining.J.Priv.

Confidentiality1,59–98(2009).

18. Yao,A.C.-C.ProtocolsforSecureComputations.inAnnualSymposiumonFoundationsofComputer

Science160–164(1982).

19. Simpson,M.A.etal.MutationsinNOTCH2causeHajdu-Cheneysyndrome,adisorderofsevereand

progressiveboneloss.Nat.Genet.43,303–305(2011).

20. Rivière,J.-B.etal.DenovomutationsintheactingenesACTBandACTG1causeBaraitser-Winter

syndrome.Nat.Genet.44,440–444,S1-2(2012).

21. McBride,K.L.etal.NOTCH1mutationsinindividualswithleftventricularoutflowtract

malformationsreduceligand-inducedsignaling.Hum.Mol.Genet.17,2886–2893(2008).

22. Fenske,S.etal.HCN3contributestotheventricularactionpotentialwaveforminthemurineheart.

Circ.Res.109,1015–1023(2011).

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

14

23. Lindell,Y.&Pinkas,B.AProofofSecurityofYao’sProtocolforTwo-PartyComputation.JCryptol.22,

161–188(2009).

24. Hazay,C.&Lindell,Y.EfficientSecureTwo-PartyProtocols-TechniquesandConstructions.(Springer,

2010).

25. Dwork,C.DifferentialPrivacy.inICALP1–12(2006).

26. Dinur,I.&Nissim,K.Revealinginformationwhilepreservingprivacy.inPODS202–210(2003).

27. Li,H.&Durbin,R.Fastandaccuratelong-readalignmentwithBurrows-Wheelertransform.

Bioinforma.Oxf.Engl.26,589–595(2010).

28. McKenna,A.etal.TheGenomeAnalysisToolkit:aMapReduceframeworkforanalyzingnext-

generationDNAsequencingdata.GenomeRes.20,1297–1303(2010).

29. Wang,K.,Li,M.&Hakonarson,H.ANNOVAR:functionalannotationofgeneticvariantsfromhigh-

throughputsequencingdata.NucleicAcidsRes.38,e164–e164(2010).

30. Cunningham,F.etal.Ensembl2015.NucleicAcidsRes.43,D662-669(2015).

31. Goldreich,O.,Micali,S.&Wigderson,A.HowtoPlayanyMentalGameorACompletenessTheorem

forProtocolswithHonestMajority.inAnnualACMSymposiumonTheoryofComputing218–229

(1987).

32. Katz,J.&Lindell,Y.IntroductiontoModernCryptography.(ChapmanandHall/CRCPress,2007).

33. Rabin,M.O.HowToExchangeSecretswithObliviousTransfer.IACRCryptol.EPrintArch.2005,187

(2005).

34. Kilian,J.FoundingCryptographyonObliviousTransfer.inProceedingsofthe20thAnnualACM

SymposiumonTheoryofComputing,May2-4,1988,Chicago,Illinois,USA20–31(1988).

35. Naor,M.&Pinkas,B.ObliviousTransferandPolynomialEvaluation.inProceedingsoftheThirty-First

AnnualACMSymposiumonTheoryofComputing,May1-4,1999,Atlanta,Georgia,USA245–254

(1999).

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

15

36. Bellare,M.,Hoang,V.T.,Keelveedhi,S.&Rogaway,P.EfficientGarblingfromaFixed-Key

Blockcipher.inIEEESymposiumonSecurityandPrivacy478–492(2013).

37. Kolesnikov,V.&Schneider,T.ImprovedGarbledCircuit:FreeXORGatesandApplications.in

InternationalColloquiumonAutomata,LanguagesandProgramming486–498(2008).

38. Zahur,S.,Rosulek,M.&Evans,D.TwoHalvesMakeaWhole-ReducingDataTransferinGarbled

CircuitsUsingHalfGates.inEUROCRYPT220–250(2015).

39. Damgard,I.,Pastro,V.,Smart,N.P.&Zakarias,S.MultipartyComputationfromSomewhat

HomomorphicEncryption.inCRYPTO643–662(2012).

40. Ben-Or,M.,Goldwasser,S.&Wigderson,A.CompletenessTheoremsforNon-CryptographicFault-

TolerantDistributedComputation(ExtendedAbstract).inSTOC1–10(1988).

41. Kolesnikov,V.,Sadeghi,A.-R.&Schneider,T.ImprovedGarbledCircuitBuildingBlocksand

ApplicationstoAuctionsandComputingMinima.inCryptologyandNetworkSecurity1–20(2009).

42. Asharov,G.,Lindell,Y.,Schneider,T.&Zohner,M.Moreefficientoblivioustransferandextensions

forfastersecurecomputation.inACMCCS535–548(2013).

AuthorContributionsKJ,DW,DBandGBdesignedthestudy,analyzedresultsandwrotethemanuscript.KJandJBprocessedpatientdata.KJandDWwrotesoftwarefortheanalysis.

AcknowledgementsWethankDr.JonBernsteinandmembersoftheBonehandBejeranolabsforvaluablediscussionsandprojectfeedback.WealsothankStanfordpatientsandclinicians,aswellasthepatientsandprofessionalsinvolvedinthedepositionofthedbGaPsetsweuse.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

16

Figures/Tables

Figure1

Alice prepares:

2 types of big boxes + keys(for each value she may be holding)

2 types of small boxes + keys(for each value Bob may be holding)

2 types of notes(with the 2 possible outcomes)

The notes fit in the small boxes,that lock and fit into the big boxes,that also lock.

T F

BT BF

AT AF

Alice holds a secret value a:a = True or a = False.

Bob holds a secret value b:b = True or b= False.

Alice and Bob want to compute togetherf(a,b) = a AND bwithout Bob discovering anything about a,or Alice discovering anything about b.

0 1

Alice puts on the table the two keys (labeled BT and BF) she prepared for the small boxes and leaves the room:

Bob enters the room and picks up the appropriate key “Bb”: BT if his secret value b = True, BF if b = False. After picking up his key, he leaves the room. Alice then gives Bob all four unmarked big locked boxes. She also gives him one unmarked key “Aa”: AT if her secret value a = True, AF if a = False.

BT BF

Bob now holds 4 unmarked big locked boxes,A key from Alice Aa, and his own key Bb.

He tries to get the note from all four boxes,using Aa on the big boxes, and Bb on the small ones.

By design Bob can only reach a single note.

This note holds the correct answer for a AND b,that Alice and Bob set out to compute together.

Alice has learned nothing about Bob’s value b(she has left the room before Bob picked his key).

Bob has learned nothing about Alice’s value a(he received from Alice an unlabeled key).

3

4

5

a AND bAlice

Bob

a

b

Alice

Bob

.

.

.

.

.

.f(A,B)

Instead of providing ananswer, Alice providesthe correct unmarkedkey for the next step.

A

B

G1 G2 Gn

1 0 0 … 1PrivateGenome

0 0 1 … 1RandomNumber

1 0 1 … 0

1 1 1 … 0

1 1 0 … 0

0 0 1 … 0

Computer AWest Coast

Computer BEast Coast

R1R2 Rn

…G1-R1 G2-R2

Gn-Rn

f(A,B) = f((G1-R1,G2-R2,…,Gn-Rn),(R1,R2,…,Rn)) =f(G1-R1+R1, G1-R2+R2,…,Gn-Rn+Rn) =f(G1,G2,…Gn) = secure computation

with all n genomes

……

R1 R2 Rn

76

Aa

TAlice’s a Bob’s b f(a,b) = a AND b

T T T

T F F

F T F

F F F

T

F

F

F

BT AT

=

FBF AT

=

FBT AF

=

FBF AF

=

Alice prepares four big locked boxesmatching all four possible computations:

2

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

17

Table1Operation RelevantInformationforeachOperation RunningTimeMeasurements(a) MAX

(overgenes)Scenario:Smalldiseasecohort

#unrelatedprobands(whoavoid

openlysharingtheirdata)

Rarefunctionalvariants

(genes)perproband(median)

#probandswithrarefunctionalvariant/singene(top3,descending

order)

Genename Provencausalgenefordisease

Protectionquotient

(1-#ofvariantssharedoftopgene/total#ofvariants)

Bandwidth(GB)

Compute(sec)

Network(sec)

FreemanSheldonSyndrome

3 258(253)3 MYH3

MYH3 1-3/767=99.6% 0.02 .15 4.912 DBT

1 ACADVLHajdu-Cheney

Syndrome7 278(272)

6 NOTCH2NOTCH2 1-8/1853=

99.6% 0.03 .18 7.293 HLA-DRBI3 MCC

KabukiSyndrome 10 262(257)

8 KMT2DKMT2D 1–8/2754=

99.7% 0.04 .22 9.593 COL6A13 FLNB

MillerSyndrome 4 267(258)

4 DHODHDHODH 1–8/1063=

99.3% 0.03 .18 7.293 DNAH52 ACOX2

(b) SETDIFF(overvariants)

Scenario:familial

PatientID(avoidsharingwprovider)

#rarefunctionalvariants

#probandonlyvariants(revealed)

Genename Provencausalgene

Protectionquotient

Bandwidth(GB)

Compute(min)

Network(min)

Trio

115-f 185 N/R N/R

ACTB 1-2/524=99.6% 18.1 1.7 56.7115-m 164 N/R N/R

115-a1 175 2ACTBUSH2A

(c) INTERSECTION(overvariants)

Scenario:2Hospitals

#suspiciousvariants

(notshared)

Totalintersectingvariants(forpatientphenotypecomparisonfollow-up)

Protectionquotient Bandwidth

(GB)Compute(min)

Network(min)

Washington 5,734 159 1–318/11,067=97.1% 3.1 0.37 9.4

Baylor 5,333

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

18

SupplementaryFiguresandTables

SupplementaryFigure1

SupplementaryFigure2

A GenotypeData B

PositionArray

C

SmallCohort

gene

geneposition reference

alternate

GeneArray

… F F T F …

… 0 1 0 … 0 1 0 … 0 1 0 …

m f

a1

… 0 1 0 … 0 1 0 … 0 1 0 …

… 0 0 0 … 1 1 0 … 0 0 1 …

… 0 1 0 … 0 1 0 … 1 0 0 …

… 0 2 0 … 1 3 0 … 1 1 1 …

MAX

SmallFamily

… F F … T …

… F F … F …

… F T … T …

… F T … F …

TwoHospitals

… T F F T …

… F F T T …AND

Hospital1

Hospital2

… F F F T …

VectorRepresentation

m

f

a1

SETDIFF

INTERSECTION

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

19

SupplementaryFigure3

SupplementaryFigure4

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

20

SupplementaryFigure5

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

21

FigureandTableLegendsFigure1.Yao’sprotocolforsecuremultipartycomputation.Steps0-7describetheoverallsecureprotocolforcomputinganyfunctionFbetweentwoormorepartiesF(A,B,..,Y,Z).Wefirstdescribea

securetwo-partycomputationprotocolbetweenAlice(A)andBob(B).Step0:AliceandBobaretryingtocomputeajointfunctionwithoutrevealingtheirinputstotheotherparty.Step1:Alicecreatesakey/boxforeachpossiblevalueforeachinput(0or1).Step2:Alicedoublelocks(doubleencrypts)eachofthefourpossibleoutputsbyplacingtherelevantoutputnoteintwoboxescorrespondingtoeach

combinationofthetwoinputs.Step3:AlicegivesBobtheoptionofchoosingexactlyoneoftwopossiblekeys,labeledBTandBF.Step4:BobpicksupexactlyonekeyBbwherebcorrespondstohis

hiddeninputwhichonlyheknows(theoblivioustransferprotocolensuresthatBobcanonlypickuponekey).AfterBobmakeshisselection,Aliceshufflesthedoubly-lockedboxesandhandsthemtoBobalong

withthekeyAacorrespondingtoherinputa.Steps1-4isrepeatedforeachoftheinputstothefunction.

Step5:Foreachoperatorinthefunctionthatdependsonlyoninputvalues(i.e.,thefirst“layer”ofthecircuit),Bobhasfourdoubly-lockedboxesandtwokeysAaandBbbuthedoesnotknowAlice’sinput

andAlicedoesnotknowBob’sinput.HeusesAaandBbandtriestounlockallfourboxes.Onlyoneof

thefourdoubly-lockedboxeswillsuccessfullyopen,revealingthejointoutputwithoutrevealingAlice’s

orBob’sinputs.Step6:Therevealedoutputyieldsthekeyforthenextoperation(gate)inthecircuit.Steps5and6arerepeatedforeachoperationinthefunction.Attheendofthecomputation,insteadof

keys,Bobobtainsthevaluesthatmakeuptheoutputofthecomputation.Step7:Thissecuretwo-partycomputationprocesscanbeexpandedtoNpartiesbyusingadditivesecretsharingbetweentwonon-

colludingcloudservers.TheN-inputfunctionisthustransformedintoatwo-inputfunction.

Table1.Summaryofresultsfordifferentsecuregenomicmultipartycomputationscenarios,allusingrealpatientdata.

SupplementaryFigure1.Representinggenomicdataasvectorsforsecurecomputation.(A)Eachindividualholdstheirpersonalgenomeprivate.(B)Theyareaskedtofillinapositionarray/vectorwith

TrueandFalsevaluesdependingonwhethertheyhaveararefunctionalvariantatthelistedposition,or

a0/1valueinagenearraydependingonwhethertheyhavenone/somerarefunctionalvariant/sin

eachlistedgene.(C)Theresultingposition/genevectorsareusedtoobtaintheresultsofTable1.

SupplementaryFigure2.Encryptionanddecryptionoverview.Asecret-keyencryptionschemeconsists

ofthreealgorithms:(A)asetupalgorithmwhichoutputsasecretkey(usuallyalongrandomstring);(B)

anencryptionalgorithmthattakesinasecretkey!andamessage"andproducesanencryptionof"

(calledaciphertext);and(C)adecryptionalgorithmthattakesinthesamesecretkey!andaciphertextandproducestheoriginalmessage.WewriteEnc&(")todenoteanencryptionofthemessage"under

thesecretkey!.Thecorrectnessrequirementforanencryptionschemestatesthatdecryptingthe

ciphertextoutputbyEnc&(")usingthesecretkey!shouldyieldtheoriginalmessage(plaintext)".(D)

Thesecurityrequirementforasecret-keyencryptionschemestatesthatanyonewhodoesnotpossessthesecret-key!cannotdistinguishanencryptionofamessage")fromanencryptionofamessage"*,irrespectiveofthechoiceofmessages")and"*.Inotherwords,withoutthesecretkey,theciphertextdoesnotrevealanyinformationabouttheencryptedmessage.

SupplementaryFigure3.Computationusingcircuits.(A)ABooleancircuitconsistsofasequenceoflogicgates(e.g.,ANDgates,ORgates,andNOTgates).Eachlogicgatetakesoneortwobitsasinputand

producesasinglebitofoutput.InaBooleancircuit,theoutputsofonelogicgatecanbeusedasthe

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint

22

inputtoanotherlogicgate.Werefertothesevaluesastheintermediatevaluesinthecomputation.In

thecircuitdepictedinthefigure,theinputstothecircuitaredenoted+), +*, +-, +., +/andtheoutputsofthecircuitaredenoted0), 0*.Specifically,thisparticularcircuitimplementsafunctionoverfiveinput

bitsandproducestwooutputbits.(B)EachgateintheBooleancircuitcanbedescribedbyatruthtable

thatspecifiesthemappingbetweeneachconfigurationoftheinputbitstoacorrespondingoutputbit.

InthecaseofanANDgate,therearetwoinputbits,andtheoutputis1ifandonlyifbothinputbitsare1.Otherwise,theoutputis0.

SupplementaryFigure4.Performancescaleupforsecurecomputation.Bandwidth,compute(CPU)

time,andoverallprotocolexecution(wallclock)timeforthesecureMAX,SETDIFFandINTERSECTION

scenariosofTable1,usingasinglethreadontwoservers,onelocatedontheEastCoastandtheother

ontheWestCoast.(A)Whenincreasingthenumberofunrelatedsubjectsinasmallcohortstudy,all

parametersgrowlogarithmically.(B)Whenincreasingthenumberoffamilymembersinanaffected/

nonaffectedscenario,parametersalsogrowlogarithmically.(C)Inthetwohospitalscenario,when

increasingthenumberofgenomicpositionsofpotentialinterest(e.g.,fromtheexometothenon-

codinggenome),allparametersgrowlinearly.Notethatallthreescenarios(A-C)performthebulkof

theircomputationoneachelementoftheinputvectorseparately(SupplementaryFigure1c).All

scenariosarethussimpletoparallelizeformaximumspeed-upusingmultiplethreadsandnodes.

SupplementaryFigure5.Booleanbuildingblocksforcomposingcomplexfunctions.(A)Thebasicbuildingblocksweusetobuildourcircuitsforidentifyingcommonmutationsandshareddenovo

variantsincludeaddition,comparison,equality,andmultiplexercircuits.AnadditioncircuitADD& on!bitinputstakestwo!-bitvaluesandoutputsthe!-bitrepresentationoftheirsum(additionis

performedmodulo2&).TheLT& andEQ& circuitsimplementtheless-thanandequalityoperations,

respectively,on!-bitinputs.TheMUX& circuitimplementsamultiplexercircuitwhichoninputsa

selectionbit: ∈ 0,1 andtwo!-bitvalues+>, +),outputs+?.TheindividualcircuitscanbeefficientlyconstructedusingANDgatesandXORgates,asdescribedbyKolesnikovetal

41.Thesebasiccircuit

buildingblockscanbecomposedtobuildamaxcircuitMAX& ontwoinputs(eachoflength!),whichinturncanbeusedtobuildamaxcircuiton@inputs.(B)Thiscircuitcomputestheargmaxover@additivelysecret-sharedvaluesA), … , AC.Thecircuitoperatesbyfirstcombiningthesharesandthen

takingthemaxovertheresultingvectorofvalues.Theargmaxisrepresentedbyabit-stringoflength@,whereapositionDhasvalue1ifAE isequaltothemaxvalue,and0otherwise.(C)Thiscircuitcomputes

thesetofgenes(representedbyindices)thatarepresentinatestvectorA), … , ACbutnotpresentinapoolF), … , FC.Thecountsofthemutationsappearinginthepoolareadditivelysecretshared.Thecircuit

firstcombinestheshares,andthenidentifiestheindicesDthatappearinthetestvector(AE = 1),butnotpresentinthepool(FE = 0).Thecircuitoutputs:E = 1ifthegeneindexedbyDoccursinthetargetgenomebutnotinthetestpool,and0otherwise.

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint