Revealing the causative variant in Mendelian patient ...computation on encrypted data. These provide...
Transcript of Revealing the causative variant in Mendelian patient ...computation on encrypted data. These provide...
1
TitleRevealingthecausativevariantinMendelianpatientgenomeswithoutrevealingpatientgenomes
AuthorsKarthikA.Jagadeesh1,5,DavidJ.Wu1,5,JohannesA.Birgmeier1,DanBoneh1,2,6,GillBejerano1,3,4,6
Affiliations1DepartmentofComputerScience,StanfordUniversity2DepartmentofElectricalEngineering,StanfordUniversity3DepartmentofDevelopmentalBiology,StanfordUniversity4DepartmentofPediatrics(MedicalGenetics),StanfordUniversity5Theseauthorscontributedequally6Correspondingauthors:[email protected](D.B)[email protected](G.B)
AbstractGiventherapidlygrowingutilityofcriticalhealthinformationrevealedinthehumangenome,securegenomiccomputationisessentialtomovingforward,especiallyasgenomesequencingbecomescommonplace.Wedeviseandimplementproof-of-principlecomputationaloperationsforpreciselyidentifyingcausalvariantsinMendelianpatientsusingsecuremultipartycomputationmethodsbasedonYao’sprotocol.Weshowmultiplerealscenarios(smallpatientcohorts,trioanalysis,twohospitalcollaboration)wherethecausalvariantisdiscoveredjointly,whilekeepingupto99.7%ofallparticipants’mostsensitivegenomicinformationprivate.Allsimilaroperationsperformedtodaytodiagnosesuchcasesaredoneopenly,keeping0%ofparticipants’genomicinformationprivate.Ourworkwillhelpusherinanerawheregenomescanbebothutilizedandtrulyprotected.
IntroductionRarediseasesaffect1in33babies.ExomeandgenomesequencinghaverevolutionizedthediagnosisofthousandsofrareMendeliandiseasestothousandsofdifferenthumangenes1–3.ThousandsofadditionalrareMendeliandiseasesandhumangenesawaitdiscovery.Frequency-basedfiltershaveprovenextremelyeffectiveinprovidingdiagnosisinsuchcases4.Inessence,variantsfoundinacontrolpopulation(commonvariants)arelikelytobebenign5whilefunctionalrarevariantsnotfoundinthecontrolpopulationbutseeninmultipleaffectedindividualsarelikelytobediseasecausing6–8.Thesefiltersseekthegeneorvariantpresentinall(most)affectedindividualsbutinno(veryfew)unaffectedindividuals.
Forexample,onecantakeasmallcohortofunrelatedindividualssuspectedofsufferingfromthesamegeneticdisorder,andcomparetheirgenomestothatoftensofthousandsofunaffectedindividuals(e.g.,fromtheexomeaggregationconsortium,ExAC5).Asweshowbelow,inmultiplescenarios,thegenewithrarefunctionalmutationsinmostpatientsinoursmallcohortsisindeedcausaloftheircondition. Frequency-basedcomputationhighlightsthefundamental“serveorprotect”dilemmaofgenomicdata:“Serve:”tofindtherootcauseofapatient’sdisease,onewishestocompareapatientgenometoasmanyothergenomesaspossible,bothaffectedandunaffected,relatedandunrelated.Thus,toadvancemodernmedicine,allsequencedgenomesshouldbeshared.“Protect:”one’sgenomecontinuestorevealmoreandmoreaboutoneself,includingsusceptibilitytoavarietyofdiseases9.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
2
Sharingitwithotherscanleadtodiscriminationandbias.Toprotectitsownerandnextofkin,nosequencedgenomeshouldbeshared. Todate,thisdilemmahasbeensolvedbyallowinginstitutionsunrestrictedaccesstoallthegenomesintheirpossession.Limitedsharingbetweeninstitutionsisdonebyprovidingobfuscatedsummarystatistics10.Currentcommonlyadoptedmethodsforsharinghaveshortcomingsthatmakethemsuboptimal.Providingfullaccessatindividualinstitutionsallowsfortoomuchinformationtobesharedincertainsituations11.Disease-specificbeaconsarepronetoattackandcanendupidentifyingindividualsparticipatinginthestudy12.Beaconsalsoonlyprovideallele-presencequerycapabilitiesanddonothavetheflexibilityneededforanalyzingmultifactorialvariantinteractionswithinanindividual13.Itisalsoriskytosharegenomicdataintheclearwiththird-partyservicesspecializingingenomicanddiseaseanalysis.Weareunawareofanycryptographically-securemethodforsharinggenomicdatatoperformcomputationaloperationsthatallowidentifyingcausalvariantsinpatients. Tobetterresolvethisdilemma,wefirstnotethatwhileallofthegenomicvariantsfromallindividualsareneededtoperformthecomputation,onlyahandfulofcausalvariantsareultimatelyofinterestinthecontextofMendelianpatients(intheexampleabove,justtherarevariantsinthesinglegenemutatedinmostpatients).
Weintroducehereamodern,proof-of-conceptcryptographicimplementationwhichbothservesandprotects.Thesecurecomputationcanberunonentiregenomes(Serve),whilenopartyinvolvedinthecomputationlearnsanythingabouttheinputsoftheotherparticipantsexceptfortheoutputwhichiscomputedtogether(Protect).Weuserealpatientdatatoshowthatoursecureimplementationrevealsminimalinformationwhilediagnosingpatientgenomesthrough3differentstrategiesusingpracticalamountsofcomputetimeandmemory.Cryptographicmethodshavebeenusedindifferentgenomiccontextssuchasmicrobiomeanalysis14,GWASanalysis15andgenomicalignment16,butthisisthefirstimplementationthatweareawareofthatisgearedtowardsdiagnosingMendelianpatients,atimelyandpotentneed.
MethodsRepresentinggenomicdataasvectorsAssumeeachindividualinvolvedinastudyhasprivateaccesstotheirexome(orgenome).Ifwearelookingtoidentifyacausalvariant,wedefineavariantvector(longlistoflength28,413,589)ofallpossibleraremissense/nonsensevariantsinthehumangenomefromthefirstgeneonchromosome1tothelastgeneonchromosomeY.Weprovideacopyofthisvectortoeachindividual(affectedandunaffected),andaskthemtoprivatelydenoteTrue/Falsenexttoeachvariant(toindicatewhethertheyhavethespecificmutationornot,respectively).Ifwearelookingtoidentifyacausalgene,weprovideeachindividualagenevectorof20,663genesinthehumangenomefromA1BGtoZZZ3.Weaskthemtowrite“1”nexttoageneiftheyhaveoneormorerarefunctionalvariantsinthisgene,andotherwise,theywrite“0”.SeeSupplementaryFigure1A,B.
Definingcomputationsofinterest(MAX,INTERSECTION,SETDIFF)Wedefinethreeoperationsusedforpatientdiagnosis(SupplementaryFigure1C).Imaginetwoaffectedindividualsarerepresentedbytworarefunctionalvariantvectors(True/Falselists).Intersectingthesetwovectorswillrevealalltherarefunctionalvariantstheyshare.Formally,weperformaBooleanINTERSECTION(orAND)operation(xANDy=Trueonlyifx=y=True,andotherwise,itisFalse)betweenallpossiblepatientvariants.Next,ifwealsohaveaccesstoanunaffectedfamilymember,we
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
3
canfurtherexcludeanyvarianttheyshare.WedothiswithaBooleansetdifference(SETDIFF)operation(xSETDIFFy=Trueonlyifx=Trueandy=False,otherwiseitisFalse).Finally,imaginewehaveaccesstoasmallcohortofunrelatedpatientssharingasetofphenotypes.Wewouldliketofindthegeneaffectedbyoneormorerarefunctionalvariantsinthegreatestnumberofpatientswithinthecohort.Forthis,weusethepatientgenevectors(0/1lists).Wesum0/1sacrosspatientsforeachgene,andthenweusethemaximum(MAX)operationtofindtheentry(gene)withthegreatestnumber(ofaffectedcases;SupplementaryFigure1C).
Remarkably,moderncryptographyallowsanynumberofindividualstojointlylearnthefinalresultoftheseMAX,SETDIFF,INTERSECTIONoperationswithoutanyofthemlearninganythingelseabouteachother’sgenomes(orvectors).
EncryptionanddecryptionAnimportantcryptographicprimitivewerelyonisasecret-keyencryptionscheme.Inasecret-keyencryptionscheme,asecretkeyisusedtoencryptanddecryptmessageswiththeguaranteethattheencryptionsofanytwomessagesareindistinguishable,andyet,theycanbesuccessfullydecrypted(toobtaintheoriginalmessage)giventhekey(SupplementaryFigure2).
SecuremultipartycomputationMultiplemathematicalframeworksandcomputationalimplementationsexistforsecuremultipartycomputationonencrypteddata.Theseprovidedifferenttradeoffsincomplexityandefficiency17.Inthiswork,weuseYao’sprotocoltosecurelyevaluatefunctionsbetweentwoparties18.Abstractly,wewritethefunctionas!(#$, #&)where#$denotestheinputofthefirstpartyand#&denotestheinputofthesecondparty.Anyfunction! #$, #& canberepresentedbyacombinationofBooleanoperations(forexample,seeSupplementaryFigure3).Yao’sprotocolprovidesawayofevaluatingtheBooleancircuit(operator-by-operator)withoutrevealingtheinputs#$,#&.WeillustratethisindetailinFigure1.
WhileYao’sprotocolprovidesasimpleandefficientsolutionforsecuretwo-partycomputation,inmanyofthescenarioswedescribe,thecomputationoccursamongmultipleparties(e.g.,manyindividuals,eachwiththeirpersonalgenome).Itisverystraightforwardtoreducethegeneralproblemofsecuremultipartycomputationtothatofsecuretwo-partycomputationbyworkinginthe“two-cloud”model.Inthetwo-cloudmodel,weassumethattherearetwonon-colludingservers(e.g.,thesecouldbemanagedbytwoindependentgovernmentagencies)thataggregatetheinputsfromeachpartyinaprivacy-preservingmannerandthenperformthecomputation.EachserveronitsownhasnoknowledgeofthedataasshowninFigure1.7(seeOnlineMethodsandDiscussion).
ProtectionquotientWedefinetheProtectionQuotientasthefractionofprivateinformationthatisnotexposed(toneithertheotherparticipantsnortheentityrunningthecomputation)duringthecomputation.Usingourencryptionscheme,theProtectionQuotientequalsthetotalnumberofpatientvariantswithheldfromtheoutputdividedbythetotalnumberofpatientvariantsinputintothecomputation.Standardunencryptedpatientdiagnosisoperationshaveaprotectionquotientof0%,becauseallvaluesmustbeexposedtoperformthecomputation.Allourapplicationsbelowhaveaprotectionquotientof97.1-99.7%,maximizingprivacywhileretainingfullutility.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
4
ResultsExampleMendelianapplicationsofsecurecomputationToprovethepragmaticutilityofourapproach,wedemonstratethreedifferentsecureoperationsoverrealMendelianpatientswherewesuccessfullyidentifythecausalvariantsineachscenario(Table1):
MAXidentifiesthecausalgeneinsmallpatientcohortswithprotectionquotientabove99%Weuse4smallcohortsofunrelatedindividuals,sufferingfromverydifferentrarediseases:FreemanSheldonSyndrome(FSS),Hadju-CheneySyndrome(HCS),KabukiSyndrome(KaS)andMillerSyndrome(MiS).Eachindividualholdsaprivatelistof211-374rarefunctionalvariantsin210-356genes(total767-2,754variantspercomputation).WeusethesecureMAXfunctiontorevealonlythetopgenemutatedacrosspatientsineachcohort.Inall4cohorts,wefindthatthegenemutatedinmostindividualsistheonethathasbeenproventobethecausalgene:MYH3inFSS6,NOTCH2inHCS19,KMT2DinKaS8andDHODHinMiS7(Table1a).
Securecomputationonlyrevealsthevariantsinthemostmutatedgeneineachcohortwhileprotectingtheremaining764variantsinFSS,1845variantsinHCS,2746variantsinKaSand1055variantsinMiS.Thiscomputationhasaprotectionquotientof99.3-99.7%forall4cohortdiseasedatasets.Thecomputationisperformedoverall20,663genesandcompletesinjust5-10seconds,withoneserverontheEastCoastandtheotherontheWestCoast(Table1a).Thetotalprotocolexecutiontime,bandwidthandcomputetimeallgrowlogarithmicallywiththenumberofcohortindividualsinvolvedinthesecurecomputation(SupplementaryFigure4A).
SETDIFFidentifiesthecausalvariantinatriowithprotectionquotient99.6%Unaffectedmotherandfather,andaffectedmalechildwithfemaleexternalgenitalia,eachholdsalistof164-185(total524)rarefunctionalvariantsfoundintheirexomes.ThesecureSETDIFFoperationrevealstothefamilyandtestprovidersonly2rarevariantsfoundinthechildbutinneitherparent(Table1b).Literaturereviewprovidesadiagnosisbasedononeofthesetwovariants:theACTBgene20.
Securecomputationkeeps522variantsprivatewhilesharingonly2variantswiththetestproviderandallindividualsinvolvedinthecomputation.Thiscomputationhasaprotectionquotientof99.6%.Becausethreepartiesarenowinvolved,thetotalcomputationtimeusingasingleserverthreadoneithercoastis57minutes(Table1b).However,thevariantlistcaneasilybesplitbetweenasmallcomputerarrayoneithercoast,suchthatatypical30-nodeclusterbringscomputationtimedowntounder2minutes.Theprotocolexecutiontime,bandwidthandcomputetimeallgrowlogarithmicallywiththenumberoffamilymembersinvolvedinthesecurecomputation(SupplementaryFigure4B).
INTERSECTIONidentifiespatientsofinterestacross2hospitalswithprotectionquotient97.1%Twoormoregenomecentersmaywanttocomparetheirpatientliststoseeiftogethertheycanfindmultiplepatientswiththesamerarefunctionalmutation,andsimilarphenotypes,whilerevealingnothingelsetoeachother.Forexamplewetook928WashingtonMendelianCenter(WMC)patientsand282BaylorHopkinsCenter(BHC)patients.Foreachhospitalwepreparedalistofover5,000rarefunctionalvariantsseeninoneormoreoftheirpatients.UsingthesecureANDfunction,thetwohospitalsfindashortlistofjust159variantspresentinbothhospitals,pointingatpatientswhowouldbenefitfromphenotypecomparison.Thisshortlistincludes“positivecontrols”suchasknowndiseasevariantNOTCH1:p.E694K,associatedwithpartial/incompletepenetranceofaorticvalvedisease21.IndeedtheWMCandBHCpatientsarephenotypicallycharacterizedwithleftventricularoutflowdefectandthoracicaorticaneurysm,respectively.Thelistalsooffersexcitingnovelgene-diseaseassociations
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
5
suchasrarefunctionalvariantHCN3:p.R648H(withfrequency5.47·10-5and0inExACand1000genomesdata,respectively).HCN3isavoltage-gatedcationchannelgene,whosemouseknockoutcausesabnormalventricularactionpotentialwaveform22.Promisingly,inpatientsfromWMCandBHC,thismutationiscorrelatedwithdilatedcardiomyopathyandcoarctationoftheaorta,respectively.
Securecomputationonlyreveals2x159potentialcausativevariantswhileprotectingtheremaining10,749variantswithaprotectionquotientof97.1%.Thiscomputationisperformedoverallrarefunctionalvariantsintheexomewithatotalprotocolexecutiontimeof9.4minutesusingasingleserverthreadoneithercoast(Table1c).Becauseeveryvariantisevaluatedindependently,a30-nodecomputeclusteroneitherendwillreducetotalcomputationtimetobelow20seconds.AswelearntoappreciateMendelianmutationsoutsideoftheexome,thetotaltime,bandwidthandcomputetimescalelinearlywiththesizeofthevariantlistsharedforsecurecomputation(SupplementaryFigure4C).
DiscussionRarediseasesarecumulativelycommon(someestimatethat10%oftheUSpopulationareaffectedwithraredisorders).About7,000rareMendelianconditionshavebeendescribedtodate.Ofthese,approximately4,000havebeendefinitivelydiagnosedassinglegenediseases,mappingtoover4,000genesinthegenome.Theprocedureswedescribeareapplicableforallofthese.Thereareonlyahandfulofmedicalconditionsdiagnosedwithcertaintytotheinteractionsofjust2genes.Farlessisknownfordiseasescausedbymorethan2genes.PersonalGenomicsposesafundamental“serveorprotect”dilemma:shouldoneservetheirgenomeintheserviceofbetterdiagnosisandultimatelydiseaseeradication,orshouldoneprotectoneselfandnextofkinagainstpotentialdiscriminationbyrefusingtosharetheirgenome.ThisdilemmaisparticularlyevidentinthefieldofMendeliandiseases.Itisessentialtodeveloptoolsandmethodstoeffectivelysharegenomeswhilemaintainingtheirprivacyandsecurity.Becausegenomeprivacyisbestservedwhereadefinitivediagnosisexists,wefocusonsinglediseasegenediscoveryanddiagnosis.HerewepresentasecureapproachformultiplepartiestoperformexactcomputationsthatdiagnoseMendeliandiseases,whilekeepingallparticipatinggenomesprivate.
Thescenarioswepresentareallreal.Genomeprivacyisextremelyappealinginallofthem:Completestrangersindiseasecohorts(Table1A)learnnothingabouteachotherexcepttheirshareddisease-causinggenemutations.Forparticipantswheretheassaydoesnotprovideananswer,absolutelynothingisrevealed.Inlargerfamilytrees,moredistantlyrelatedmemberswillappreciategenomeprivacy.Eveninayoungnuclearfamily(e.g.,atrio;Table1B),thetestproviderlearnsalmostnothingexceptthelikelydisease-causingmutationintheoffspring.Moreover,theylearnvirtuallynothingabouttheparentsthemselves.Inthetwohospitalscenario(Table1C),onlyvariantsthatareworthwhilecomparingarerevealedwhilethevastmajorityofvariantsremainprivatetoeachinstitute’sresearchersandpatients.
Inallofthesecases,thequantitiesrevealedandthosethatremainprivateareaprivacyadvocate’sdreamcometrue:Wepredominantlyrevealonlyvariant/scrucialforpatientdiagnosis,familycounselingandanypotentialtreatment.Whatremainsprivateispredominantlyvariantsofunknownsignificance(VUS)thatareoflittlevaluefordiagnosingone’smedicalcondition.However,thesesameVUSvariantsalmostcertainlyuniquelyidentifyapersonasaparticipantinananalysis,andhavethepotentialtorevealnoworinthefutureotherpersonaltraitsthatmaybefurthercausefordiscrimination.
Forthisproof-of-conceptwork,weassumethattheprotocolparticipantsare“honest-but-curious,”(sometimesreferredtoas“semi-honest”)—thatis,weassumethatthepartiesareproperly
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
6
incentivizedtohonestlyfollowtheprotocol,butattheendoftheprotocolexecution,theymaytrytolearnsomeadditionalinformation(aboutotherparties’inputs)basedonthemessagestheyreceiveduringtheprotocolexecution.Wesaythataprotocolissecureiftheonlyinformationanypartylearnsbyparticipatingintheprotocolcanbeinferredjustfromthatparties’inputandtheoveralloutputofthecomputation.Inotherwords,noneofthepartiesshouldbeabletolearnsomethingaboutanotherparties’inputotherthanwhatisexplicitlyrevealedbytheoutputofthefunction.
Yao’sprotocolgivesanefficientsolutionforsecuretwo-partycomputationinthepresenceofsemi-honestadversaries23.Wenotethattherearewell-establishedwaystoextendYao’sprotocoltoadditionallyprovidesecurityagainstmaliciouspartieswhodeviatefromtheprotocoldescriptioninordertocompromisetheprivacyofotherparticipantsorcorrupttheresultsofthecomputation24.Inaddition,protectingagainstparticipantsthatsubmitmalicious(ormalformed)inputstotheprotocolcanbedonebyensuringthatifaparticipant’svariantvectordoesnotmeetcertaincriteria,orisnotaccompaniedbyanappropriatecertificate,thenthecomputationabortsanddoesnotproduceanyoutput.Furthermore,inthispaper,weintroduceanoperation-specific“protectionquotient”,anovelmetrictoassessthefractionofinformationsecuredbythecomputation.Theprotectionquotientcanbeusedtofurtherrestricttheoutputreturnedtoallpartiesifthedefinedprivacyrequirementsarenotmet.Forinstance,ifatrioanalysisresultsinmorethanafewexpecteddenovoexomemutations,onlyanerrormessagewillbeproduced.Thisapproachispreferredforexampletodifferentialprivacy25,26whichaddsrandomgenomicvariationasnoiseintoaggregatedsummarystatisticstotryandavoidindividualidentificationinpooledgenomicsdata15.
Thebasicprincipleunderlyingourdesignistoperformexactsecurecomputationonthecomplete(private)genomesofallparticipatingindividuals.Thisisindirectcontrasttothemoretraditionalandlesseffectiveroutesofpublishingobfuscatedfrequenciesaggregatedacrossmultipleindividuals.Thecomputationalresourcesweusetoretaingenomicprivacyarenotnegligible,yetareperfectlywithinthecapabilitiesofoff-the-shelfmoderncomputerstocompletetheoperationinsecondsorminutes,evenwhencommunicatingbetweentheEastandWestcoasts.Andwhilenosecuritymechanismmaybeperfectlyimpenetrable,itiscertainlypreferabletohaveasecuritymechanisminplace(especiallyifitallowsforexactcomputation)wherenonecurrentlyexist.Manyfurtherextensionsandapplicationsofourcomputationalframeworkarepossible,andaresuretoprovideincentivesforthedevelopmentofmoresecureandfastermethods.Awidespreaddeploymentofcomputerlibrariesefficientlyimplementingtheseprincipleswillencourageindividualstosecurelycontributetheirgenomesforthecommongood,andthusgreatlyfueladvancesinbothpersonalgenomicsandprivacyinthe21stcentury.
On-LineMethodsPatientdatasetsWholeexomesequencesofpatientswereobtainedfromdbGaPstudiesphs000204.v1.p16(FreemanSheldonSyndrome),phs000244.v1.p17(MillerSyndrome),phs000295.v1.p18(KabukiSyndrome),andphs000477.v1.p1(Hajdu-CheneySyndrome).Pre-processedvariantcallformat(VCF)filesforpatientsfrom2CentersforMendelianGenomicswereobtainedfromdbGaPstudiesphs000693.v4.p1(UniversityofWashington),andphs000711.v3.p1(BaylorHopkins).OurtriofamilywasobtainedfromStanfordHospital.AllhumansubjectresearchwasperformedunderguidelinesapprovedbytheStanfordInstitutionalReviewBoard.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
7
SequencingreadsweremappedtotheGRCh37/hg19assemblyofthehumangenomeusingBWAMEMv0.7.10-r78927.VariantswerecalledusingGATKv3.4-46-gbc02625followingtheHaplotypeCallerworkflowfromtheGATKBestPractices28.
VariantannotationANNOVARv527wasusedtoannotatevariantswithpredictedeffectonproteincodinggenesusinggeneisoformsfromtheENSEMBLgenesetversion75forthehg19/GRCh37assemblyofthehumangenome29,30.Allcanonicalgeneisoformswereusedwherethetranscriptstartandendaremarkedascompleteandthecodingspanisamultipleofthree.
CryptographictechniquesInasecuremultipartycomputation(MPC)protocol18,31,agroupofusers(oftencalledparties)seektojointlycomputeafunctionovertheirinputswithoutrevealinganyadditionalinformationabouttheirparticularinputs.Thefunctionthatthepartiescomputeisdeterminedbasedonthespecificscenario.Thecomputationconsistsofseveralroundsofinteraction,whereineachround,thepartiesexchangeaseriesofmessages.Attheconclusionoftheprotocol,eachparticipantlearnstheoutputofthecomputationevaluatedoneveryone’sjointinput.Noadditionalinformationbeyondtheexplicitoutputisrevealedtoanyparty(theprocessisabstractedinFigure1).
EveryarithmeticcomputationcanbeexpressedasasequenceofBooleanlogicaloperations(thatis,operationsonbits 0,1 ).Thisispreciselyhowthemoderncomputerworks.Yao’sprotocolallowstwousers,AliceandBob,tocomputearbitraryfunctionsovertheirinputs.Moreprecisely,ifAlicehasaninput#andBobhasaninput*,Yao’sprotocolallowsthemtocompute!(#, *)inawaysuchthatAlicelearnsnothingabout*andBoblearnsnothingabout#otherthantheoutputvalue!(#, *).Ingeneral,expressingafunctionintermsofBooleanoperationsgreatlyincreasesthecomputationalcostofevaluatingthefunction.TomaximizetheefficiencyofYao’sprotocol,itisimportanttochoosefunctionalitieswithsimpleorcompactrepresentationsasBooleancircuits.AnexampleofaBooleancircuitisshowninSupplementaryFigure3.
Inthiswork,wecastdiagnosingMendelianpatientsas(simple)arithmetic/logiccomputationsthatadmitefficientBooleancircuitrepresentations.Wenowdescribehowthesecurecomputationprotocolswork.Todothiswefirstintroducetwostandardtoolsfromcryptography:(1)symmetric(secret-key)encryption32and(2)oblivioustransfer33–35.
EncryptionanddecryptionAsecret-keyencryptionschemeconsistsoftwofunctions:EncryptandDecrypt.Theencryptionfunctiontakesacryptographickeykandamessagemandoutputsaciphertext4.Thedecryptionfunctiontakesthecryptographickey5andaciphertext4andoutputsamessage6.Intuitively,encryptionanddecryptionareinverseoperations:ifweencryptamessageunderakey5,decryptingtheresultingciphertextwiththesamekey5recoverstheoriginalmessage.Moreprecisely,wecansaythatforanykey5andanymessage6,Decrypt 5, Encrypt 5,6 = 6.Inasymmetric(orsecret-key)encryptionscheme,boththeencryptionandthedecryptionfunctionsrequireknowledgeofthesecretcryptographickey.Thekeyisarandomstringdrawnfromsomekey-space.Theprecisenatureofthekey-spacevariesdependingonthedetailsoftheencryptionscheme,andisimmaterialtoourpresentationinthispaper.Anencryptionschemeisconsideredtobesecureiftheciphertextdoesnotrevealanyinformationabouttheunderlyingmessagetoanyuserwhodoesnotpossessthesecretencryptionkey(certainly,auserwhoholdsthesecretkeycandecryptandlearnthemessage).Oneway
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
8
toformalizethisistosaythatauserwhodoesnothavetheencryptionkeyisunabletotellanencryptionofamessage68apartfromanencryptionofanothermessage6$.Inotherwords,ciphertextshideallinformationabouttheirunderlyingmessagetoalluserswhodonothavetheencryptionkey.WeillustratethisinSupplementaryFigure2.
Underthisdefinition,messagescanalsobeencryptedmultipletimes.Forinstance,amessage6canbe“doubleencrypted”undertwokeys5$and5&byfirstencrypting6using5$andthenencryptingtheresultingciphertextusingthesecondkey5&.Thisprocedureyieldsanotherciphertext.Decryptionproceedsbyfirstdecryptingwithkey5&,andthendecryptingtheresult(aciphertext)with5$.Inparticular,wecanwrite
Decrypt 5$, Decrypt 5&, Encrypt 5&, Encrypt 5$,6 = 6
Securityofthedoubleencryptionschemefollowsdirectlyfromthesecurityoftheunderlyingencryptionscheme.Inparticular,auserwhodoesnothaveboth5$and5&cannotlearnanyinformationabouttheunderlyingmessagethathasbeendoublyencryptedusing5$and5&.Numeroussymmetric(secret-key)encryptionschemesexistintheliterature32.
OblivioustransferAnoblivioustransfer(OT)protocol33–35isatwo-partyprotocolbetweenasenderandareceiver.AnOTprotocolenablesthereceivertoselectivelyobtainoneoftwopossiblemessagesfromthesenderwithoutrevealingtothesenderwhichmessagethereceiverrequested.Moreprecisely,thesenderholdstwomessages,denoted58and5$andthereceiverholdsaselectionbit: ∈ 0,1 .AttheendoftheOTprotocol,thereceiverobtainsthechosenmessage5<andlearnsnothingabouttheothermessage5$=<.Thesenderdoesnotlearnanythingaboutthereceiver’schoicebit:.Numerousoblivioustransferprotocolshavebeenproposedintheliterature33–35.
OverviewofstepsforsecurecomputationInasecuretwo-partycomputationprotocol,Aliceholdsaninput# ∈ 0,1 >andBobholdsaninput* ∈0,1 >.Wewrite 0,1 >todenoteabinaryinputoflength?(e.g.,forinstance,?couldbeofthebinaryrepresentationofthevariantvectororthegenevectorwedefineinourmaintext).Theirgoalistocomputeafunction!(#, *)ontheirjointinput #, * .Thecomputationisconsidered“secure”ifattheendofthecomputation,theonlyinformationthatAliceandBoblearnisthefunctionvalue!(#, *)andnothingelseabouttheotherparty’sinput.Itisimportanttonoteherethatthefunctionoutput!(#, *)couldrevealsomeinformationabouttheinputs#and*(forexample,inourtrioscenario,whateverdenovovariantwereportinthechild,wecandeducebydefinitiondoesnotexistineitherparent).AswenoteintheDiscussionsection,weworkinthehonest-but-curiousmodelwhereweassumethatAliceandBobfollowtheprotocolspecificationasdirected,butmay,attheendoftheprotocolexecution,trytoinfersomeadditionalinformationabouteachother’sprivateinput.WenowdescribehowYao’sprotocolcanbeusedtosecurelyevaluateanyfunctionovertwoinputsinthehonest-but-curiousmodel.
ToapplyYao’sprotocol,itisfirstnecessarytorepresentthefunction!asaBooleancircuitoninputs#and*.Atthemostbasiclevel,thebuildingblockswehaveareAND(xANDy=Trueonlyifx=y=True,otherwiseitisFalse)andXOR(exclusive-or,xXORy=Trueonlyifx=Trueandy=False,orifx=Falseandy=True,otherwiseitisFalse)gates.Thesebasicgatescanbecombinedtoobtaincircuitsofarbitraryexpressivefunctionalities17.Inourdescriptionbelow,wewilloftentimesrefertotheconcreteexampleofsecurelyevaluatingtheANDfunctiononsingle-bitinputs(above).AvisualizationofthecompleteprotocolisgiveninFigure1.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
9
OverviewofYao’ssecuretwopartycomputationprotocolWenowdescribeYao’sprotocol.Foreaseofpresentation,wepresentasimplified(butlessefficient)descriptionofYao’sprotocolhere.Ourimplementation(basedontheJustGarble36library)followsthehigh-levelblueprintdescribedhere,butincludesseveraloptimizations,notablythefree-XOR37andhalf-gate38optimizations.Step1:Foreachwireinthecircuit,Alicechoosestwokeys.RecallthatinaBooleancircuit,eachwireinthecircuitcantakeontwopossiblevalues(0or1;sometimesalsoreferredtoasFalseandTrue,respectively).Aliceassociatesoneofthekeyswiththewirevalue0andanotherkeywiththewire1.FortheparticularcaseofsecurelyevaluatingtheANDfunction@ = # ∧ *,Alicepicksthreepairsofkeysandassociatesonepairwitheachof#,*,[email protected], 5B$, 5C8, 5C$, 5D8, 5D$.Inthisexample,5B8isthekeyassociatedwiththeinputbit#takingonthevalue0and5D$isthekeyassociatedwiththe
[email protected]:Foreachgateinthecircuit,Aliceconstructsa“garbled”truthtable.Foreachrowinthetruthtable,thealgorithmtakesthekeyassociatedwiththevalueoftheoutputwireanddoubleencryptsitusingthetwokeysassociatedwiththevaluesofthetwoinputwires.FortheparticularcaseofevaluatingasingleANDgate,Alicewouldconstructthefollowingtableofciphertexts
• E$ = EncryptFGH(EncryptFIH 5D8 )
• E& = EncryptFGH(EncryptFIJ 5D8 )
• EK = EncryptFGJ(EncryptFIH 5D8 )
• EL = EncryptFGJ(EncryptFIJ 5D$ )
Fortheoutputwiresofthecircuit,insteadofdoubleencryptinganencryptionkey,Alicedirectlydoubleencryptsthevalueoftheoutputwire(e.g.,0or1).ThisstepisshowninFigure1.2.Step3:Aftergarblingthecircuit(Steps1-2/Figure1.1-1.2),thesecurecomputationbeginswithBobusingtheoblivioustransferprotocol(above)toobtainthekeysfortheinputwiresassociatedwithhisinput.Theoblivioustransferprotocolensuresthefollowing:Bobonlylearnsoneofthetwokeysassociatedwitheachofhisinputwires(thiswillensurethatBobcanonlyevaluatethefunctiononasinglesetofinputs),andAlicedoesnotlearnwhichwireBobrequested(thatis,AlicedoesnotlearnBob’sinput).FortheparticularcaseofevaluatingasingleANDgate,ifBob’sinputis: ∈ {0,1},thenBobwouldplaytheroleofthereceiverinanOTprotocolwithinput:.Alicewouldplaytheroleofthesenderwithmessages5C8,5C$ (thekeysassociatedwithBob’sinputwire).Attheendoftheoblivioustransferprotocol,Bobobtains5C<(thekeyassociatedwithhisinput),andlearnsnothingaboutthekeyassociatedwiththecomplementofhisinput(5C$=<).AlicelearnsnothingaboutBob’sinput:.ThisstepisshowninFigure1.3.Step4:AfterBobreceivesthekeysassociatedwithhisinputviatheoblivioustransferprotocol,AlicesendsBobthegarbledtablesassociatedwitheachgate(afterrandomlypermutingtherowsofeachtable).Additionally,AlicesendsBobthewireencodingsofherinput.FortheparticularcaseofevaluatingasingleANDgate,ifAlice’sinputisO ∈ 0,1 ,Alicewouldsend5BP(thekeyassociatedwith# = O)toBob.ThisstepisalsoshowninFigure1.4.Step5:Withallthisinformation,Bobcancompletethefunctionevaluationandcomputetheoutput.Inparticular,afterSteps3and4,BobshouldhaveasinglekeyforeachoftheinputwiresoftheBooleancircuit.Then,foreachgateinthecircuit,Bobtakestheinputkeyshehasandattemptstodecrypttherowsinthegarbledtableassociatedwiththatgate.Becausetheentriesinthegarbledtablearedouble
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
10
encryptedusingthekeysassociatedwiththeinputwirestothegateandBobonlyhasasinglekeyforeachofthewires,BobisonlyabletodecryptasinglerowinthegarbledtableasshowninFigure1.5.Step6:Indoingso,Bobisabletolearnoneofthekeysassociatedwiththegate’soutputwire(moreover,byconstructionofthegarbledtable,theoutputkeyBobobtainsispreciselytheoneassociatedwiththevaluecorrespondingtoevaluationofthegateontheinputbits).Thus,startingwiththeinputwires,Bobisabletoevaluatethecircuitgate-by-gateasseeninFigure1.6.OnceBobreachestheoutputlayerofthecircuit,heisabletodecrypttheciphertextsandobtainthevalueofeachoutputwire.Summary.Tosummarize,inYao’ssecuretwo-partycomputationprotocol,AlicebeginsbyconstructingagarbledtruthtableforeachgateintheBooleancircuit.Shedoessobydoubleencryptingeachrowinthetruthtable(usingthekeysassociatedwiththeinputbits).ShegivesBobthegarbledtruthtablesaswellasthekeysassociatedwithherinput.Usingoblivioustransfer,Bobobtainsthekeysassociatedwithhisinput.Armedwithasinglekeyforeachoftheinputwiresinthecircuit,Bobisabletoevaluatethegarbledcircuitgate-by-gate.Foreachgate,Bobtakeshisinputkeysandusesthemtodecryptoneoftherowsofthegarbledtableassociatedwiththegate.Thisyieldsthekeyassociatedwiththeparticularwire.Finally,attheendofthecomputation,Bobdecryptstheciphertextsassociatedwiththeoutputwirestolearntheoutputofthecircuit.BobthensendstheresultofthecomputationtoAlice.
ExtendingYao’sprotocoltoNpartiesYao’sprotocolallowstwopartiestosecurelyevaluateanarbitraryfunction.However,ingeneral,wedesiretocomputeacrossalargenumberofparties(e.g.,studyparticipants).Whiletherearesecuremultipartycomputationprotocolsthatsupportmorethantwoparties,(e.g.,theSPDZ39,GMW31,orBGW40protocols),akeylimitationoftheseprotocolsisthattheyrequireallparticipatingpartiestobeonlineduringtheprotocolexecution.Moreover,thenumberofroundsofcommunicationintheprotocoloftengrowswiththecomplexityofthecomputation(notethatthisisindirectcontrastwithYao’sprotocolwhichisatwo-roundprotocol,regardlessofhowcomplicatedthecomputationis).Asaresult,therearesubstantialengineeringhurdlestodeployingthesegeneralprotocolsformultipartycomputationacrossalargenumberofparties.Insomecases(e.g.,BGW40andGMW31),thetotalbandwidthalsoscalesquadraticallyinthenumberofparties,furtherlimitingthepracticalityoftheseprotocols.
Amoreefficientsolutionforgeneralmultipartycomputationthatavoidsboththerequirementthatparticipatingpartiesbeonlineduringtheprotocolexecutionaswellasthepotentialcommunicationblowupistoworkina“two-cloud”model.Inthismodel,weassumetherearetwonon-colludingcloudserversthatfacilitatetheprotocolexecution.Atthebeginningoftheprotocolexecution,eachoftheparticipatingparties“split”theirinputsandshareitwiththetwocloudservers.Aslongasthetwocloudsdonotcolludewitheachother,theydonotlearnanythingabouttheinputstothecomputation.Afterthetwocloudservershavereceivedtheinputsfromeachoftheparticipatingparties,theyengageinatwo-partysecurecomputationprotocol(suchasYao’sprotocol)tocomputethefunctionofinterest.Notably,thepartiesthatcontributedthedatadonothavetobeonlineduringthisstepoftheprotocol.Andmoreover,communicationisonlynecessarybetweenthepartiesandthecloudservers;partiesinparticulardonothavetocommunicatewitheachother.Inapracticaldeployment,thesetwocloudserversmightbemanagedbydistinctgovernmentalorganizationswithintheNIHorWHO.Thus,byworkinginthetwo-cloudmodel,itispossibletotransformanycomputationbetween?individualsintoasecuretwo-partycomputationbetweentwonon-colludingparties.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
11
Step7:Wenowdescribehowtosecureevaluateanyfunctionalityinthetwo-cloudmodel.Supposethereare?partiesparticipatingintheprotocolexecutionandlet#$, … , #>denotetheirprivateinputs.Tosecurecomputeafunction!,eachofthe?participantschoosesarandomvalueRS andsendsRS tooneofthetwocloudservers.Theythensendtotheothercloudserverthevalue#S − RS (notethatthesubtractionisperformedmoduloalargeintegerU).OnceeverypartyhassubmittedtheirinputsRS and#S − RS tothetwocloudservers,thefirstcloudserverhasavectorofrandomvaluesO = (R$, … , R>)andthesecondcloudserverhasavectorofrandomdifferences: = (#$ − R$, … , #>– R>).BecausethesubtractionistakingplacemoduloU,thevaluesin:aredistributeduniformlyandmoreimportantly,independentlyofthe#S’s.Thepair(RS, #S − RS)isoftenreferredtoasan“additivesecretsharing”oftheinput#S.Thepropertythatthisadditivesecretsharingschemesatisfiesisthatasinglesharerevealsnoinformationabouttheinput,buttwosharescompletelydefinetheinput.Thismeansthataslongasthetwocloudserversdonotcollude,theylearnnoinformationabouteachparty’sinput(sincetheyeachpossessjustoneshareofthesecret).
Tocompletethesecurecomputation(ofafunction!),thetwocloudserverssimplyapplyYao’sprotocoltothefollowingtwo-partyfunctionality:
W R$, … , R> , #$ − R$, … , #>– R> = ! R$ + #$ − R$, … , R> + #>– R> = ! #$, … , #> .Inotherwords,thetwocloudscomputethefunctionalitythattakesasinputtwovectors(eachcontaining?values)andoutputsthefunction!evaluatedonthecomponentscorrespondingtothesumofthetwoinputvectors.Sincesummingtheinputvectorsinthiscasereconstructseachparty’sinput,thisprocedurecorrespondspreciselytoevaluating!ontheparties’inputs.Moreover,thetwocloudserversdonotlearnanyadditionalinformationaboutanyparticularparty’sinputbecausetheevaluationofWisperformedusingYao’sprotocol(whichisasecuretwo-partycomputationprotocol).ThisprocedureisshowninFigure1.7.
ConstructingourBooleancircuitsAsdescribedabove,arbitraryBooleancircuitscanbeconstructedusingonlyANDandXORgates.Toefficientlyrepresentourset-intersection-basedalgorithmsasBooleancircuits,wefirstconstructsomeintermediatebuildingblocksfromthebasicANDandXORgates.Theintermediatebuildingblockswerequireincludeadditioncircuits,comparisoncircuits,andequalitycircuits.Forthesebuildingblocks,weusethecircuitsbyKolesnikovetal.41(seeSupplementaryFigure5).
SoftwareimplementationInourimplementation,weusetheJustGarblelibrary36forourimplementationofYao’sgarbledcircuits,andweusetheAsharovetal.42implementationoftheoblivioustransferprotocols.Forbetterperformance,wealsoimplementthehalf-gatesoptimization38forYao’sgarbledcircuits.Thisimplementationwillbereleaseduponpublication.Forourbenchmarks,wesetupaclientandserveronAmazonEC2(tosimulatethetwocloudproviders),andmeasurethetotalcomputetime,bandwidth,andoverallprotocolexecutiontime(takingintoaccountthenetworkcommunication).Werunourexperimentsontwomemory-optimizedEC2instances(M4.2xlarge).Eachinstancerunsan8-core2.4GHzIntelXeonE5-2676v3(Haswell)processorandhas32GBofmemory.Whileourprotocolsarenaturallyparallelizable,weuseasinglethreadofexecutioninallofourexperiments,anddonottakeadvantageoftheavailableparallelism.Tosimulatethenon-colludingtwocloudmodel,weusedawide-areanetwork(WAN)settingwherethetwoserversarefarapart.WeplacedoneoftheserversontheWestCoast(specifically,intheNorthernCaliforniaavailabilityzone)andtheotherontheEastCoast(specifically,intheNorthernVirginiaavailabilityzone).
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
12
References1. Yang,Y.etal.ClinicalWhole-ExomeSequencingfortheDiagnosisofMendelianDisorders.N.Engl.J.
Med.369,1502–1511(2013).
2. Iglesias,A.etal.Theusefulnessofwhole-exomesequencinginroutineclinicalpractice.Genet.Med.
16,922–931(2014).
3. LeeH,DeignanJL,DorraniN&etal.CLinicalexomesequencingforgeneticidentificationofrare
mendeliandisorders.JAMA312,1880–1887(2014).
4. Rehm,H.L.etal.ACMGclinicallaboratorystandardsfornext-generationsequencing.Genet.Med.
Off.J.Am.Coll.Med.Genet.15,733–747(2013).
5. Lek,M.etal.Analysisofprotein-codinggeneticvariationin60,706humans.Nature536,285–291
(2016).
6. Ng,S.B.etal.Targetedcaptureandmassivelyparallelsequencingof12humanexomes.Nature
461,272–276(2009).
7. Ng,S.B.etal.Exomesequencingidentifiesthecauseofamendeliandisorder.Nat.Genet.42,30–35
(2010).
8. Ng,S.B.etal.ExomesequencingidentifiesMLL2mutationsasacauseofKabukisyndrome.Nat.
Genet.42,790–793(2010).
9. Moreno-Estrada,A.etal.ThegeneticsofMexicorecapitulatesNativeAmericansubstructureand
affectsbiomedicaltraits.Science344,1280–1285(2014).
10. Mailman,M.D.etal.TheNCBIdbGaPdatabaseofgenotypesandphenotypes.Nat.Genet.39,
1181–1186(2007).
11. Siu,L.L.etal.Facilitatingacultureofresponsibleandeffectivesharingofcancergenomedata.Nat.
Med.22,464–471(2016).
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
13
12. Shringarpure,S.S.&Bustamante,C.D.PrivacyRisksfromGenomicData-SharingBeacons.Am.J.
Hum.Genet.97,631–646(2015).
13. Regalado,A.NetworksofGenomeDataWillTransformMedicine.MITTechnologyReviewAvailable
at:https://www.technologyreview.com/s/535016/internet-of-dna/.(Accessed:30thSeptember
2016)
14. Wagner,J.,Paulson,J.N.,Wang,X.,Bhattacharjee,B.&CorradaBravo,H.Privacy-preserving
microbiomeanalysisusingsecurecomputation.Bioinformatics32,1873–1879(2016).
15. Simmons,S.,Sahinalp,C.&Berger,B.EnablingPrivacy-PreservingGWASsinHeterogeneousHuman
Populations.CellSyst.3,54–61(2016).
16. Popic,V.&Batzoglou,S.Privacy-PreservingReadMappingUsingLocalitySensitiveHashingand
SecureKmerVoting.bioRxiv046920(2016).doi:10.1101/046920
17. Lindell,Y.&Pinkas,B.SecureMultipartyComputationforPrivacy-PreservingDataMining.J.Priv.
Confidentiality1,59–98(2009).
18. Yao,A.C.-C.ProtocolsforSecureComputations.inAnnualSymposiumonFoundationsofComputer
Science160–164(1982).
19. Simpson,M.A.etal.MutationsinNOTCH2causeHajdu-Cheneysyndrome,adisorderofsevereand
progressiveboneloss.Nat.Genet.43,303–305(2011).
20. Rivière,J.-B.etal.DenovomutationsintheactingenesACTBandACTG1causeBaraitser-Winter
syndrome.Nat.Genet.44,440–444,S1-2(2012).
21. McBride,K.L.etal.NOTCH1mutationsinindividualswithleftventricularoutflowtract
malformationsreduceligand-inducedsignaling.Hum.Mol.Genet.17,2886–2893(2008).
22. Fenske,S.etal.HCN3contributestotheventricularactionpotentialwaveforminthemurineheart.
Circ.Res.109,1015–1023(2011).
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
14
23. Lindell,Y.&Pinkas,B.AProofofSecurityofYao’sProtocolforTwo-PartyComputation.JCryptol.22,
161–188(2009).
24. Hazay,C.&Lindell,Y.EfficientSecureTwo-PartyProtocols-TechniquesandConstructions.(Springer,
2010).
25. Dwork,C.DifferentialPrivacy.inICALP1–12(2006).
26. Dinur,I.&Nissim,K.Revealinginformationwhilepreservingprivacy.inPODS202–210(2003).
27. Li,H.&Durbin,R.Fastandaccuratelong-readalignmentwithBurrows-Wheelertransform.
Bioinforma.Oxf.Engl.26,589–595(2010).
28. McKenna,A.etal.TheGenomeAnalysisToolkit:aMapReduceframeworkforanalyzingnext-
generationDNAsequencingdata.GenomeRes.20,1297–1303(2010).
29. Wang,K.,Li,M.&Hakonarson,H.ANNOVAR:functionalannotationofgeneticvariantsfromhigh-
throughputsequencingdata.NucleicAcidsRes.38,e164–e164(2010).
30. Cunningham,F.etal.Ensembl2015.NucleicAcidsRes.43,D662-669(2015).
31. Goldreich,O.,Micali,S.&Wigderson,A.HowtoPlayanyMentalGameorACompletenessTheorem
forProtocolswithHonestMajority.inAnnualACMSymposiumonTheoryofComputing218–229
(1987).
32. Katz,J.&Lindell,Y.IntroductiontoModernCryptography.(ChapmanandHall/CRCPress,2007).
33. Rabin,M.O.HowToExchangeSecretswithObliviousTransfer.IACRCryptol.EPrintArch.2005,187
(2005).
34. Kilian,J.FoundingCryptographyonObliviousTransfer.inProceedingsofthe20thAnnualACM
SymposiumonTheoryofComputing,May2-4,1988,Chicago,Illinois,USA20–31(1988).
35. Naor,M.&Pinkas,B.ObliviousTransferandPolynomialEvaluation.inProceedingsoftheThirty-First
AnnualACMSymposiumonTheoryofComputing,May1-4,1999,Atlanta,Georgia,USA245–254
(1999).
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
15
36. Bellare,M.,Hoang,V.T.,Keelveedhi,S.&Rogaway,P.EfficientGarblingfromaFixed-Key
Blockcipher.inIEEESymposiumonSecurityandPrivacy478–492(2013).
37. Kolesnikov,V.&Schneider,T.ImprovedGarbledCircuit:FreeXORGatesandApplications.in
InternationalColloquiumonAutomata,LanguagesandProgramming486–498(2008).
38. Zahur,S.,Rosulek,M.&Evans,D.TwoHalvesMakeaWhole-ReducingDataTransferinGarbled
CircuitsUsingHalfGates.inEUROCRYPT220–250(2015).
39. Damgard,I.,Pastro,V.,Smart,N.P.&Zakarias,S.MultipartyComputationfromSomewhat
HomomorphicEncryption.inCRYPTO643–662(2012).
40. Ben-Or,M.,Goldwasser,S.&Wigderson,A.CompletenessTheoremsforNon-CryptographicFault-
TolerantDistributedComputation(ExtendedAbstract).inSTOC1–10(1988).
41. Kolesnikov,V.,Sadeghi,A.-R.&Schneider,T.ImprovedGarbledCircuitBuildingBlocksand
ApplicationstoAuctionsandComputingMinima.inCryptologyandNetworkSecurity1–20(2009).
42. Asharov,G.,Lindell,Y.,Schneider,T.&Zohner,M.Moreefficientoblivioustransferandextensions
forfastersecurecomputation.inACMCCS535–548(2013).
AuthorContributionsKJ,DW,DBandGBdesignedthestudy,analyzedresultsandwrotethemanuscript.KJandJBprocessedpatientdata.KJandDWwrotesoftwarefortheanalysis.
AcknowledgementsWethankDr.JonBernsteinandmembersoftheBonehandBejeranolabsforvaluablediscussionsandprojectfeedback.WealsothankStanfordpatientsandclinicians,aswellasthepatientsandprofessionalsinvolvedinthedepositionofthedbGaPsetsweuse.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
16
Figures/Tables
Figure1
Alice prepares:
2 types of big boxes + keys(for each value she may be holding)
2 types of small boxes + keys(for each value Bob may be holding)
2 types of notes(with the 2 possible outcomes)
The notes fit in the small boxes,that lock and fit into the big boxes,that also lock.
T F
BT BF
AT AF
Alice holds a secret value a:a = True or a = False.
Bob holds a secret value b:b = True or b= False.
Alice and Bob want to compute togetherf(a,b) = a AND bwithout Bob discovering anything about a,or Alice discovering anything about b.
0 1
Alice puts on the table the two keys (labeled BT and BF) she prepared for the small boxes and leaves the room:
Bob enters the room and picks up the appropriate key “Bb”: BT if his secret value b = True, BF if b = False. After picking up his key, he leaves the room. Alice then gives Bob all four unmarked big locked boxes. She also gives him one unmarked key “Aa”: AT if her secret value a = True, AF if a = False.
BT BF
Bob now holds 4 unmarked big locked boxes,A key from Alice Aa, and his own key Bb.
He tries to get the note from all four boxes,using Aa on the big boxes, and Bb on the small ones.
By design Bob can only reach a single note.
This note holds the correct answer for a AND b,that Alice and Bob set out to compute together.
Alice has learned nothing about Bob’s value b(she has left the room before Bob picked his key).
Bob has learned nothing about Alice’s value a(he received from Alice an unlabeled key).
3
4
5
a AND bAlice
Bob
a
b
Alice
Bob
.
.
.
.
.
.f(A,B)
Instead of providing ananswer, Alice providesthe correct unmarkedkey for the next step.
A
B
G1 G2 Gn
1 0 0 … 1PrivateGenome
0 0 1 … 1RandomNumber
1 0 1 … 0
1 1 1 … 0
1 1 0 … 0
0 0 1 … 0
…
Computer AWest Coast
Computer BEast Coast
R1R2 Rn
…G1-R1 G2-R2
Gn-Rn
…
f(A,B) = f((G1-R1,G2-R2,…,Gn-Rn),(R1,R2,…,Rn)) =f(G1-R1+R1, G1-R2+R2,…,Gn-Rn+Rn) =f(G1,G2,…Gn) = secure computation
with all n genomes
……
…
R1 R2 Rn
76
Aa
TAlice’s a Bob’s b f(a,b) = a AND b
T T T
T F F
F T F
F F F
T
F
F
F
BT AT
=
FBF AT
=
FBT AF
=
FBF AF
=
Alice prepares four big locked boxesmatching all four possible computations:
2
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
17
Table1Operation RelevantInformationforeachOperation RunningTimeMeasurements(a) MAX
(overgenes)Scenario:Smalldiseasecohort
#unrelatedprobands(whoavoid
openlysharingtheirdata)
Rarefunctionalvariants
(genes)perproband(median)
#probandswithrarefunctionalvariant/singene(top3,descending
order)
Genename Provencausalgenefordisease
Protectionquotient
(1-#ofvariantssharedoftopgene/total#ofvariants)
Bandwidth(GB)
Compute(sec)
Network(sec)
FreemanSheldonSyndrome
3 258(253)3 MYH3
MYH3 1-3/767=99.6% 0.02 .15 4.912 DBT
1 ACADVLHajdu-Cheney
Syndrome7 278(272)
6 NOTCH2NOTCH2 1-8/1853=
99.6% 0.03 .18 7.293 HLA-DRBI3 MCC
KabukiSyndrome 10 262(257)
8 KMT2DKMT2D 1–8/2754=
99.7% 0.04 .22 9.593 COL6A13 FLNB
MillerSyndrome 4 267(258)
4 DHODHDHODH 1–8/1063=
99.3% 0.03 .18 7.293 DNAH52 ACOX2
(b) SETDIFF(overvariants)
Scenario:familial
PatientID(avoidsharingwprovider)
#rarefunctionalvariants
#probandonlyvariants(revealed)
Genename Provencausalgene
Protectionquotient
Bandwidth(GB)
Compute(min)
Network(min)
Trio
115-f 185 N/R N/R
ACTB 1-2/524=99.6% 18.1 1.7 56.7115-m 164 N/R N/R
115-a1 175 2ACTBUSH2A
(c) INTERSECTION(overvariants)
Scenario:2Hospitals
#suspiciousvariants
(notshared)
Totalintersectingvariants(forpatientphenotypecomparisonfollow-up)
Protectionquotient Bandwidth
(GB)Compute(min)
Network(min)
Washington 5,734 159 1–318/11,067=97.1% 3.1 0.37 9.4
Baylor 5,333
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
18
SupplementaryFiguresandTables
SupplementaryFigure1
SupplementaryFigure2
A GenotypeData B
PositionArray
C
SmallCohort
gene
geneposition reference
alternate
GeneArray
… F F T F …
… 0 1 0 … 0 1 0 … 0 1 0 …
m f
a1
… 0 1 0 … 0 1 0 … 0 1 0 …
… 0 0 0 … 1 1 0 … 0 0 1 …
… 0 1 0 … 0 1 0 … 1 0 0 …
… 0 2 0 … 1 3 0 … 1 1 1 …
MAX
SmallFamily
… F F … T …
… F F … F …
… F T … T …
… F T … F …
TwoHospitals
… T F F T …
… F F T T …AND
Hospital1
Hospital2
… F F F T …
VectorRepresentation
m
f
a1
SETDIFF
INTERSECTION
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
19
SupplementaryFigure3
SupplementaryFigure4
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
20
SupplementaryFigure5
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
21
FigureandTableLegendsFigure1.Yao’sprotocolforsecuremultipartycomputation.Steps0-7describetheoverallsecureprotocolforcomputinganyfunctionFbetweentwoormorepartiesF(A,B,..,Y,Z).Wefirstdescribea
securetwo-partycomputationprotocolbetweenAlice(A)andBob(B).Step0:AliceandBobaretryingtocomputeajointfunctionwithoutrevealingtheirinputstotheotherparty.Step1:Alicecreatesakey/boxforeachpossiblevalueforeachinput(0or1).Step2:Alicedoublelocks(doubleencrypts)eachofthefourpossibleoutputsbyplacingtherelevantoutputnoteintwoboxescorrespondingtoeach
combinationofthetwoinputs.Step3:AlicegivesBobtheoptionofchoosingexactlyoneoftwopossiblekeys,labeledBTandBF.Step4:BobpicksupexactlyonekeyBbwherebcorrespondstohis
hiddeninputwhichonlyheknows(theoblivioustransferprotocolensuresthatBobcanonlypickuponekey).AfterBobmakeshisselection,Aliceshufflesthedoubly-lockedboxesandhandsthemtoBobalong
withthekeyAacorrespondingtoherinputa.Steps1-4isrepeatedforeachoftheinputstothefunction.
Step5:Foreachoperatorinthefunctionthatdependsonlyoninputvalues(i.e.,thefirst“layer”ofthecircuit),Bobhasfourdoubly-lockedboxesandtwokeysAaandBbbuthedoesnotknowAlice’sinput
andAlicedoesnotknowBob’sinput.HeusesAaandBbandtriestounlockallfourboxes.Onlyoneof
thefourdoubly-lockedboxeswillsuccessfullyopen,revealingthejointoutputwithoutrevealingAlice’s
orBob’sinputs.Step6:Therevealedoutputyieldsthekeyforthenextoperation(gate)inthecircuit.Steps5and6arerepeatedforeachoperationinthefunction.Attheendofthecomputation,insteadof
keys,Bobobtainsthevaluesthatmakeuptheoutputofthecomputation.Step7:Thissecuretwo-partycomputationprocesscanbeexpandedtoNpartiesbyusingadditivesecretsharingbetweentwonon-
colludingcloudservers.TheN-inputfunctionisthustransformedintoatwo-inputfunction.
Table1.Summaryofresultsfordifferentsecuregenomicmultipartycomputationscenarios,allusingrealpatientdata.
SupplementaryFigure1.Representinggenomicdataasvectorsforsecurecomputation.(A)Eachindividualholdstheirpersonalgenomeprivate.(B)Theyareaskedtofillinapositionarray/vectorwith
TrueandFalsevaluesdependingonwhethertheyhaveararefunctionalvariantatthelistedposition,or
a0/1valueinagenearraydependingonwhethertheyhavenone/somerarefunctionalvariant/sin
eachlistedgene.(C)Theresultingposition/genevectorsareusedtoobtaintheresultsofTable1.
SupplementaryFigure2.Encryptionanddecryptionoverview.Asecret-keyencryptionschemeconsists
ofthreealgorithms:(A)asetupalgorithmwhichoutputsasecretkey(usuallyalongrandomstring);(B)
anencryptionalgorithmthattakesinasecretkey!andamessage"andproducesanencryptionof"
(calledaciphertext);and(C)adecryptionalgorithmthattakesinthesamesecretkey!andaciphertextandproducestheoriginalmessage.WewriteEnc&(")todenoteanencryptionofthemessage"under
thesecretkey!.Thecorrectnessrequirementforanencryptionschemestatesthatdecryptingthe
ciphertextoutputbyEnc&(")usingthesecretkey!shouldyieldtheoriginalmessage(plaintext)".(D)
Thesecurityrequirementforasecret-keyencryptionschemestatesthatanyonewhodoesnotpossessthesecret-key!cannotdistinguishanencryptionofamessage")fromanencryptionofamessage"*,irrespectiveofthechoiceofmessages")and"*.Inotherwords,withoutthesecretkey,theciphertextdoesnotrevealanyinformationabouttheencryptedmessage.
SupplementaryFigure3.Computationusingcircuits.(A)ABooleancircuitconsistsofasequenceoflogicgates(e.g.,ANDgates,ORgates,andNOTgates).Eachlogicgatetakesoneortwobitsasinputand
producesasinglebitofoutput.InaBooleancircuit,theoutputsofonelogicgatecanbeusedasthe
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint
22
inputtoanotherlogicgate.Werefertothesevaluesastheintermediatevaluesinthecomputation.In
thecircuitdepictedinthefigure,theinputstothecircuitaredenoted+), +*, +-, +., +/andtheoutputsofthecircuitaredenoted0), 0*.Specifically,thisparticularcircuitimplementsafunctionoverfiveinput
bitsandproducestwooutputbits.(B)EachgateintheBooleancircuitcanbedescribedbyatruthtable
thatspecifiesthemappingbetweeneachconfigurationoftheinputbitstoacorrespondingoutputbit.
InthecaseofanANDgate,therearetwoinputbits,andtheoutputis1ifandonlyifbothinputbitsare1.Otherwise,theoutputis0.
SupplementaryFigure4.Performancescaleupforsecurecomputation.Bandwidth,compute(CPU)
time,andoverallprotocolexecution(wallclock)timeforthesecureMAX,SETDIFFandINTERSECTION
scenariosofTable1,usingasinglethreadontwoservers,onelocatedontheEastCoastandtheother
ontheWestCoast.(A)Whenincreasingthenumberofunrelatedsubjectsinasmallcohortstudy,all
parametersgrowlogarithmically.(B)Whenincreasingthenumberoffamilymembersinanaffected/
nonaffectedscenario,parametersalsogrowlogarithmically.(C)Inthetwohospitalscenario,when
increasingthenumberofgenomicpositionsofpotentialinterest(e.g.,fromtheexometothenon-
codinggenome),allparametersgrowlinearly.Notethatallthreescenarios(A-C)performthebulkof
theircomputationoneachelementoftheinputvectorseparately(SupplementaryFigure1c).All
scenariosarethussimpletoparallelizeformaximumspeed-upusingmultiplethreadsandnodes.
SupplementaryFigure5.Booleanbuildingblocksforcomposingcomplexfunctions.(A)Thebasicbuildingblocksweusetobuildourcircuitsforidentifyingcommonmutationsandshareddenovo
variantsincludeaddition,comparison,equality,andmultiplexercircuits.AnadditioncircuitADD& on!bitinputstakestwo!-bitvaluesandoutputsthe!-bitrepresentationoftheirsum(additionis
performedmodulo2&).TheLT& andEQ& circuitsimplementtheless-thanandequalityoperations,
respectively,on!-bitinputs.TheMUX& circuitimplementsamultiplexercircuitwhichoninputsa
selectionbit: ∈ 0,1 andtwo!-bitvalues+>, +),outputs+?.TheindividualcircuitscanbeefficientlyconstructedusingANDgatesandXORgates,asdescribedbyKolesnikovetal
41.Thesebasiccircuit
buildingblockscanbecomposedtobuildamaxcircuitMAX& ontwoinputs(eachoflength!),whichinturncanbeusedtobuildamaxcircuiton@inputs.(B)Thiscircuitcomputestheargmaxover@additivelysecret-sharedvaluesA), … , AC.Thecircuitoperatesbyfirstcombiningthesharesandthen
takingthemaxovertheresultingvectorofvalues.Theargmaxisrepresentedbyabit-stringoflength@,whereapositionDhasvalue1ifAE isequaltothemaxvalue,and0otherwise.(C)Thiscircuitcomputes
thesetofgenes(representedbyindices)thatarepresentinatestvectorA), … , ACbutnotpresentinapoolF), … , FC.Thecountsofthemutationsappearinginthepoolareadditivelysecretshared.Thecircuit
firstcombinestheshares,andthenidentifiestheindicesDthatappearinthetestvector(AE = 1),butnotpresentinthepool(FE = 0).Thecircuitoutputs:E = 1ifthegeneindexedbyDoccursinthetargetgenomebutnotinthetestpool,and0otherwise.
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted January 27, 2017. ; https://doi.org/10.1101/103655doi: bioRxiv preprint