Introduc)ontoPLINK
GBIO0009KridsadakornChaichoompu
UniversityofLiege
19/10/16 GBIO0015 1
PLINK:WhyPLINK?• PLINKisawholegenomeassocia)onanalysissoKware,anditis
FREE!hQp://pngu.mgh.harvard.edu/~purcell/plink/
• PLINKhasawell-documentedmanualtoexplainallfeatures• PLINKisavailableforLinux,MacOS,andMS-DOS• PLINKhas2versions,thestableversion(1.07)andthebetaversion
(1.9)– PLINK1.9worksmuchfasterthan1.07– PLINK1.9hasmanynewfeatures
• gPLINKistheotherversionofPLINKthatprovidesgraphicaluserinterface.PleasebeawarethatusingPLINKforawhilegenomeanalysisusuallytakesalong)me,itisbeQertouseacommand-lineversion
• RecommendtousePLINK1.07
19/10/16 GBIO0015 2
PLINK:Let’sgetstarted
• TodownloadPLINK:hQp://pngu.mgh.harvard.edu/~purcell/plink/dist/plink-1.07-i686.zip• Inplink-1.07-xxx.zip,thereisanexamplesetofinputfileswhichisagoodpointtoexplore– test.mapcontainsthemarkerinforma)on– test.pedcontainsgenotypedataandsampleinforma)on
• Checkwhatareinsidetheexamplefiles!plink--filetest
19/10/16 GBIO0015 3
Exampledata
• Downloadtheexampledatafromthecoursewebsite– TSI_JPT_chr20_case_control.bed– TSI_JPT_chr20_case_control.bim– TSI_JPT_chr20_case_control.fam– TSI_JPT_chr20_pheno_header.txt– TSI_JPT_chr20_pheno.txt
19/10/16 GBIO0015 4
PLINK:FileFormats
PLINKmainlysupports3typesofformats• Standardtextformat(PEDandMAP)Notethatallfilesmusthavethesamename,otherwiseweneedtoclearlyindicatebyusing--pedand–map plink--filetest
• Binaryformat(BED,BIM,andFAM) plink--bfiletest
• Transposedtextformat(TPED,andTFAM)Notethatallfilesmusthavethesamename,otherwiseweneedtoclearlyindicatebyusing--tpedand--4am plink--hiletest
19/10/16 GBIO0015 5
Formatconversion• Toconvertortoindicateoutputastextformat(PEDandMAP
plink--filetest--recode--outtest_ped• ToconvertortoindicateoutputasBinaryformat(BED,BIM,and
FAM)plink--filetest--make-bed--outtest_bin
• ToconvertortoindicateoutputasTransposedtextformat(TPED,andTFAM)plink--filetest--transpose--recode--outtest_tp
• Alterna)vely,itispossibletorecodedataas1/2encodingplink--filetest--recode12--outtest_12
• Toconverttoaddi)veencodingplink--filetest--recodeAD--outtest_12
• ItispossibletoswitchbetweenA,T,G,Cencodingto1,2,3,4encodingbyusing--allele1234or--alleleACGTviceversa
19/10/16 GBIO0015 6
AlternatephenotypefilesTospecifyanalternatephenotypeforanalysis,i.e.otherthantheoneinthe*.pedfile(or,ifusingabinaryfileset,the*.famfile),usethe--phenoop)on:
plink--filemydata--phenopheno.txtwherepheno.txtisafilethatcontains3columns(onerowperindividual):FamilyIDIndividualIDPhenotypeTheoriginalPEDfilemusts)llcontainaphenotypeincolumn6,unlessthe--no-phenoflagisgiven.Theorderofthealternatephenotypefileneednotbethesameasfortheoriginalfile.Ifthephenotypefilecontainsmorethanonephenotype,thenusethe--mphenoNop)ontospecifytheNthphenotypeistheonetobeused:
plink--filemydata--phenopheno2.txt--mpheno4wherepheno2.txtcontains5differentphenotypes,thiscommandwillusethe4thforanalysis(phenotypeD):FamilyIDIndividualIDPhenotypeAPhenotypeBPhenotypeCPhenotypeDPhenotypeEIfyourfileiscoded0/1torepresentunaffected/affected,thenusethe--1flag:
plink--filemydata--1
19/10/16 GBIO0015 7
Datamanipula)on:SNPs(1/3)TogetasetofSNPs,youcanspecifyasingleSNPand,op)onally,alsoaskforallSNPsinthesurroundingregion,withthe--windowop)on:
plink--bfilemydata--snprs652423--window20whichextractsonlySNPswithin+/-20kbofrs652423basedonmul)pleSNPsandranges(--snps)The--snpscommandwillacceptacomma-delimitedlistofSNPs,includingrangesbasedonphysicalposi)on.Forexample,
plink--bfilemydata--snpsrs273744-rs89883,rs12345-rs67890,rs999,rs222Basedonphysicalposi)on(--from-kb,etc)
plink--bfilemydata--chr2--from-kb5000--to-kb10000toselectallSNPswithinthis5000kbregiononchromosome2.
19/10/16 GBIO0015 8
Datamanipula)on:SNPs(2/3)
Tomergemorethantwostandardand/orbinaryfilesets,itisoKenmoreconvenienttospecifyasinglefilethatcontainsalistofPED/MAPForexample,considerwehad4PED/MAPfilesets(labelledfA.*throughfD.*)and4binaryfilesets,labelledfE.*throughfH.*).Thenusingthecommand:plink--filefA--merge-listallfiles.txt--make-
bed--outmynewdata
19/10/16 GBIO0015 9
Datamanipula)on:SNPs(3/3)
ToexcludesomesetsofSNPsplink--filedata--excludemysnps.txtwherethefilemysnps.txtis,asforthe--extractcommand,justalistofSNPs,oneperline.
19/10/16 GBIO0015 10
Datamanipula)on:individuals(1/3)
Togetasetofindividualsplink--filedata--keepmylist.txt
wherethefilemylist.txtis,asforthe--removecommand,justalistofFamilyID/IndividualIDpairs,onesetperline,i.e.onepersonperline.(fieldscanoccuraKerthe2ndcolumnbuttheywillbeignored--i.e.youcoulduseaFAMfileastheparameterofthe--keepcommand,orhavecommentsinthefile.Forexample
F1011F10012_BF30331_ADropthisindividualbecauseofconsentissuesF444222
19/10/16 GBIO0015 11
Datamanipula)on:individuals(2/3)
Toexcludeasetofindividualsplink--filedata--removemylist.txt
wherethefilemylist.txtis,asforthe--keepcommand,justalistofFamilyID/IndividualIDpairs,onesetperline,i.e.onepersonperline(although,asfor--keep,fieldsaKerthe2ndcolumnareallowedbuttheywillbeignored).
19/10/16 GBIO0015 12
Datamanipula)on:individuals(3/3)Filtersomeindividuals
plink--filedata--filtermyfile.raw1--freqimpliesafilemyfile.rawexistswhichhasasimilarformattophenotypeandclusterfiles:thatis,thefirsttwocolumnsarefamilyandindividualIDs;thethirdcolumnisexpectedtobeanumericvalue(althoughthefilecanhavemorethan3columns),andonlyindividualswhohaveavalueof1forthiswouldbeincludedinanysubsequentanalysisorfilegenera)onprocedure.e.g.ifmyfile.rawwere
F1I12F2I17F3I11F3I21F3I33
Becausefilteringoncasesorcontrols,oronsex,oronposi)onwithinthefamily,willbecommonopera)ons,therearesomeshortcutop)onsthatcanbeusedinsteadof--filter.Theseare:
--filter-cases--filter-controls--filter-males--filter-females--filter-founders--filter-nonfounders
19/10/16 GBIO0015 13
Qualitycontrolprocesses
• Missinggenotype• Hardy-WeinbergEquilibrium• MinorAllelefrequency• Linkagedisequilibriumpruning• Mendelerrors
19/10/16 GBIO0015 14
MissinggenotypeTogeneratealistgenotyping/missingnessratesta)s)cs:
plink--filedata--missingThisop)oncreatestwofiles: plink.imiss plink.lmisswhichdetailmissingnessbyindividualandbySNP(locus),respec)vely.Forindividuals,theformatis:
FID FamilyIDIID IndividualIDMISS_PHENO Missingphenotype?(Y/N)N_MISS NumberofmissingSNPsN_GENO Numberofnon-obligatorymissinggenotypesF_MISS Propor)onofmissingSNPs
ForeachSNP,theformatis:SNP SNPiden)fierCHR ChromosomenumberN_MISS NumberofindividualsmissingthisSNPN_GENO Numberofnon-obligatorymissinggenotypesF_MISS Propor)onofsamplemissingforthisSNP
19/10/16 GBIO0015 15
Clusteringbasedonmissinggenotypes
Systema)cbatcheffectsthatinducemissingnessinpartsofthesamplewillinducecorrela)onbetweenthepaQernsofmissingdatathatdifferentindividualsdisplay.Oneapproachtodetec)ngcorrela)oninthesepaQerns,thatmightpossiblyidenitysuchbiases,istoclusterindividualsbasedontheiriden)ty-by-missingness(IBM).
plink--filedata--cluster-missingwhichcreatesthefiles:
plink.matrix.missingplink.cluster3.missing
whichhavesimilarformatstothecorrespondingIBSclusteringfiles.
19/10/16 GBIO0015 16
MissingrateperpersonTheini)alstepinalldataanalysisistoexcludeindividualswithtoomuchmissinggenotypedata.Thisop)onissetasfollows:
plink--filemydata--mind0.1whichmeansexcludewithmorethan10%missinggenotypes.Alineintheterminaloutputwillappear,indica)nghowmanyindividualswereremovedduetolowgenotyping.Ifanyindividualswereremoved,afilecalled
plink.iremwillbecreated,lis)ngtheFamilyandIndividualIDsoftheseremovedindividuals.Anysubsequentanalysisalsospecifeidonthesamecommandlinewillbeperformedwithouttheseindividuals.
19/10/16 GBIO0015 17
MissingrateperSNP
Subsequentanalysescanbesettoautoma)callyexcludeSNPsonthebasisofmissinggenotyperate,withthe--genoop)on:thedefaultistoincludeallSNPS(i.e.--geno1).ToincludeonlySNPswitha90%genotypingrate(10%missing)useplink--filemydata--geno0.1
Aswiththe--mafop)on,thesecountsarecalculatedaKerremovingindividualswithhighmissinggenotyperates.
19/10/16 GBIO0015 18
Hardy-WeinbergEquilibrium(1/2)TogeneratealistofgenotypecountsandHardy-Weinbergteststa)s)csforeachSNP,usetheop)on:
plink--filedata--hardywhichcreatesafile: plink.hweThisfilehasthefollowingformat
SNP SNPiden)fierTEST Codeindica)ngsampleA1 MinorallelecodeA2 MajorallelecodeGENOGenotypecounts:11/12/22O(HET)ObservedheterozygosityE(HET) ExpectedheterozygosityP H-Wp-value
19/10/16 GBIO0015 19
Hardy-WeinbergEquilibrium(2/2)ToexcludemarkersthatfailuretheHardy-Weinbergtestataspecifiedsignificancethreshold,usetheop)on:
plink--filemydata--hwe0.001Bydefaultthisfilterusesanexacttest.Thestandardasympto)c(1dfgenotypicchi-squaredtest)canberequestedwiththe--hwe2op)oninsteadof--hwe.Thefollowingoutputwillappearintheconsolewindowandinplink.log,detailinghowmanySNPsfailedtheHardy-Weinbergtest,forthesampleasawhole,and(whenPLINKhasdetectedadiseasephenotype)forcasesandcontrolsseparately:Wri)ngHardy-Weinbergtests(founders-only)to[plink.hwe]30markersfailedHWEtest(p<=0.05)andhavebeenexcluded34markersfailedHWEtestincases30markersfailedHWEtestincontrolsThistestwillonlybebasedonfounders(iffamily-baseddataarebeinganalysed)unlessthe--nonfoundersop)onisalsospecified.
19/10/16 GBIO0015 20
AllelefrequencyTogeneratealistofminorallelefrequencies(MAF)foreachSNP,basedonallfoundersinthesample:
plink--filedata--freqwillcreateafile: plink.frqwithfivecolumns:CHR ChromosomeSNP SNPiden)fierA1 Allele1code(minorallele)A2 Allele2code(majorallele)MAF MinorallelefrequencyNCHROBS Non-missingallelecount
19/10/16 GBIO0015 21
MinorAllelefrequencyOnceindividualswithtoomuchmissinggenotypedatahavebeenexcluded,subsequentanalysescanbesettoautoma)callyexcludeSNPsonthebasisofMAF(minorallelefrequency):plink--filemydata--maf0.05
meansonlyincludeSNPswithMAF>=0.05.Thedefaultvalueis0.01.Thisquan)tyisbasedonlyonfounders(i.e.individualsforwhomthepaternalandmaternalindividualcodesandboth0).Thisop)onisappropriatelycountsallelesforXandYchromosomeSNPs.
19/10/16 GBIO0015 22
Linkagedisequilibriumpruning(1/2)Some)mesitisusefultogenerateaprunedsubsetofSNPsthatareinapproximatelinkageequilibriumwitheachother.Thiscanbeachievedviatwocommands:--indepwhichprunesbasedonthevarianceinfla)onfactor(VIF),whichrecursivelyremovesSNPswithinaslidingwindow;second,--indep-pairwisewhichissimilar,exceptitisbasedonlyonpairwisegenotypiccorrela)on.TheVIFpruningrou)neisperformed:
plink--filedata--indep5052willcreatefiles plink.prune.in plink.prune.outEachisasimlpelistofSNPIDs;boththesefilescansubsequentlybespecifiedastheargumentfora--extractor--excludecommand.Theparametersfor--indepare:windowsizeinSNPs(e.g.50),thenumberofSNPstoshiKthewindowateachstep(e.g.5),theVIFthreshold.TheVIFis1/(1-R^2)whereR^2isthemul)plecorrela)oncoefficientforaSNPbeingregressedonallotherSNPssimultaneously.Thatis,thisconsidersthecorrela)onsbetweenSNPsbutalsobetweenlinearcombina)onsofSNPs.
19/10/16 GBIO0015 23
Linkagedisequilibriumpruning(2/2)
Thesecondprocedureisperformed:plink--filedata--indep-pairwise5050.5
Thisgeneratesthesameoutputfilesasthefirstop)on;theonlydifferenceisthatasimplepairwisethresholdisused.Thefirsttwoparameters(50and5)arethesameasabove(windowsizeandstep);thethirdparameterrepresentsther^2threshold.Togiveaconcreteexample:thecommandabovethatspecifies5050.5woulda)considerawindowof50SNPs,b)calculateLDbetweeneachpairofSNPsinthewindow,b)removeoneofapairofSNPsiftheLDisgreaterthan0.5,c)shiKthewindow5SNPsforwardandrepeattheprocedure.Tomakeanew,prunedfile,thenusesomethinglike(inthisexample,wealsoconvertthestandardPEDfilesettoabinaryone):plink--filedata--extractplink.prune.in--make-bed--outpruneddata
19/10/16 GBIO0015 24
MendelerrorsTogeneratealistofMendelerrorsforSNPsandfamilies,usetheop)on:
plink--filedata--mendelwhichwillcreatefiles: plink.mendel plink.imendel plink.fmendel plink.lmendelThe*.mendelfilecontainsallMendelerrors(i.e.onelinepererror);the*.imendelfilecontainsasummaryofper-individualerrorrates;the*.fmendelfilecontainsasummaryofper-familyerrorrates;the*.lmendelfilecontainsasummaryofper-SNPerrorrates.The*.mendelfilehasthefollowingcolumns:FID FamilyIDKID ChildindividualIDCHR ChromosomeSNP SNPIDCODE Anumericalcodeindica)ngthetypeoferror(seebelow)ERROR Descrip)onoftheactualerror
19/10/16 GBIO0015 25
Associa)onAnalysis
• Case/control• Fisher'sexact• Fullmodel• Quan)ta)vetrait• Linearandlogis)cmodels• Mul)ple-testcorrec)on
19/10/16 GBIO0015 26
ManhaQanplotusingGWASTools
manhattanPlot(assoc$P,chromosome=assoc$CHR)
05/10/16 KC-ULg 27
QQplotusingGWASToolsqqPlot(pval=assoc$P,truncate=TRUE, main="QQ Plot of P-values")
05/10/16 KC-ULg 28
Basiccase/controlassocia)ontestToperformastandardcase/controlassocia)onanalysis,usetheop)on:
plink--filemydata--assocwhichgeneratesafile plink.assoc whichcontainsthefields:CHR ChromosomeSNP SNPIDBP Physicalposi)on(base-pair)A1 Minorallelename(basedonwholesample)F_A FrequencyofthisalleleincasesF_U FrequencyofthisalleleincontrolsA2 MajorallelenameCHISQ Basicallelictestchi-square(1df)P Asympto)cp-valueforthistestOR Es)matedoddsra)o(forA1,i.e.A2isreference)
19/10/16 GBIO0015 29
Fisher'sExacttest(allelicassocia)on)Toperformastandardcase/controlassocia)onanalysisusingFisher'sexacttesttogeneratesignificance,usetheop)on:
plink--filemydata--fisherwhichgeneratesafile plink.fisherwhichcontainsthefields:CHR ChromosomeSNP SNPIDBP Physicalposi)on(base-pair)A1 Minorallelename(basedonwholesample)F_A FrequencyofthisalleleincasesF_U FrequencyofthisalleleincontrolsA2 MajorallelenameP Exactp-valueforthistestOR Es)matedoddsra)o(forA1)Asdescribedbelow,if--fisherisspecifiedwith--modelaswell,PLINKwillperformgenotypictestsusingFisher'sexacttest.
19/10/16 GBIO0015 30
Alternate/fullmodelassocia)ontestsItispossibletoperformtestsofassocia)onbetweenadiseaseandavariantotherthanthebasicallelictest(whichcomparesfrequenciesofallelesincasesversuscontrols),byusingthe--modelop)on.Thetestsofferedhereare(inaddi)ontothebasicallelictest):
Cochran-ArmitagetrendtestGenotypic(2df)testDominantgeneac)on(1df)testRecessivegeneac)on(1df)test
Thegenotypictestprovidesageneraltestofassocia)oninthe2-by-3tableofdisease-by-genotype.Thedominantandrecessivemodelsaretestsfortheminorallele(whichistheminorallelecanbefoundintheoutputofeitherthe--assocorthe--freqcommands.Thatis,ifDistheminorallele(anddisthemajorallele):Allelic: DversusdDominant: (DD,Dd)versusddRecessive: DDversus(Dd,dd)Genotypic:DDversusDdversusddAsmen)onedabove,thesetestsaregeneratedwithop)on:
plink--filemydata--modelwhichgeneratesafile plink.modelwhichcontainsthefollowingfields:CHR ChromosomenumberSNP SNPiden)fierTEST TypeoftestAFF Genotypes/allelesincasesUNAFF Genotypes/allelesincontrolsCHISQ Chi-squatedsta)s)cDF DegreesoffreedomfortestP Asympto)cp-value
31
Quan)ta)vetraitassocia)onQuan)ta)vetraitscanbetestedforassocia)onalso,usingeitherasympto)corempiricalsignificancevalues.Ifthephenotype(column6ofthePEDfileorthephenotypeasspecifiedwiththe--phenoop)on)isquan)ta)ve,thenPLINKwillautoma)callytreattheanalysisasaquan)ta)vetraitanalysis.
plink--filemydata--assocwillgeneratethefile
plink.qassocwithfieldsasfollows:CHR ChromosomenumberSNP SNPiden)fierBP Physicalposi)on(base-pair)NMISS Numberofnon-missinggenotypesBETA RegressioncoefficientSE StandarderrorR2 Regressionr-squaredT Waldtest(basedont-distrib)on)P Waldtestasympto)cp-valueIfpermuta)onswerealsorequested,thenanextrafile,eitherplink.assoc.permorplink.assoc.mpermwillbegenerated,dependingonwhetheradap)veormax(T)permuta)onwasused(seethenextsec)onformoredetails).Theempiricalp-valuesarebasedontheWaldsta)s)c. 32
Linearandlogis)cmodelsThesetwofeaturesallowformul)plecovariateswhentes)ngforbothquan)ta)vetraitanddiseasetraitSNPassocia)on,andforinterac)onswiththosecovariates.Thecovariatescaneitherbecon)nuousorbinary(i.e.forcategoricalcovariates,youmustfirstmakeasetofbinarydummyvariables).Inthissec)onweconsider:
BasicuasgeCovariateandinterac)onsFlexiblyspecifyingtheprecisemodelFlexiblyspecifyingjointtestsBasicusage
Forquan)ta)vetraits,useplink--bfilemydata--linear
Fordiseasetraits,specifylogis)cregressionwithplink--bfilemydaya–logis)c
Thesecommandswilleithergeneratetheoutputfileplink.assoc.linearorplink.assoc.logis)cdependingonthephenotype/commandused.Thebasicformatis:CHR ChromosomeSNP SNPiden)fierBP Physicalposi)on(base-pair)A1 Testedallele(minorallelebydefault)TEST Codeforthetest(seebelow)NMISS Numberofnon-missingindividualsincludedinanalysisBETA/OR Regressioncoefficient(--linear)oroddsra)o(--logis)c)STAT Coefficientt-sta)s)cP Asympto)cp-valuefort-sta)s)c
33
Adjustmentformul)pletes)ngTogenerateafileofadjustedsignificancevaluesthatcorrectforalltestsperformedandothermetrics,usetheop)on:
plink--filemydata--assoc--adjustwhichgeneratesthefile plink.adjustwhichcontainsthefieldsCHR ChromosomenumberSNP SNPiden)ferUNADJ Unadjustedp-valueGC Genomic-controlcorrectedp-valuesBONF Bonferronisingle-stepadjustedp-valuesHOLM Holm(1979)step-downadjustedp-valuesSIDAK_SS Sidaksingle-stepadjustedp-valuesSIDAK_SD Sidakstep-downadjustedp-valuesFDR_BH Benjamini&Hochberg(1995)step-upFDRcontrolFDR_BY Benjamini&Yeku)eli(2001)step-upFDRcontrolThisfileissortedbysignificancevalueratherthangenomicloca)on,themostsignificantresultsbeingatthetop.
19/10/16 GBIO0015 34
Top Related