Admixture Correction in the Outgroup f3 Statistic
Transcript of Admixture Correction in the Outgroup f3 Statistic
AdmixtureCorrectionintheOutgroupf3Statistic
PresentedbyNitaTunga
InpartialfulfilmentoftherequirementsforgraduationwiththeDean’sScholarsHonorsDegreeintheDepartmentof
MathematicsUniversityofTexasatAustin
May,2018
Prof.JenniferMannAustin,Ph.DSupervisingProfessor
Prof.KathrynDabbs,Ph.DSecondReader
1
TableofContents
Introduction...................................................................................................................................2
Chapter1:Background....................................................................................................................4
Chapter2:Project...........................................................................................................................9
Chapter3:Dataset........................................................................................................................10
Chapter4:MethodsandResults....................................................................................................11
Section4.1:CorrectionAttempt1.....................................................................................14
Section4.2:CorrectionAttempt2.....................................................................................14
Section4.3:CorrectionAttempt3.....................................................................................17
Section4.4:CorrectionAttempt4.....................................................................................19
Conclusion....................................................................................................................................23
AppendixA:Glossary....................................................................................................................25
References....................................................................................................................................26
2
Introduction
Geneticinheritancecanbestudiedwithinapurelygeneticscope.However,this
eliminatespartofthepicture.Thefieldofgeneticsisoftenthoughtofasanaturalscience
withlittleincommonwithfieldsofsocialscience.However,inhumangeneticsandthe
geneticsoftheorganismswhichhumansimpact,theroleofculturalandsocietalforces
cannotbeignored.Forinstance,lactaseisanenzymeusedtodigestlactoseinmilk.Assuch,
itisanenzymewhoseactivityreducessignificantlyafterweaning.Nonetheless,ashumans
havebeguntoingestmoredairyproductsintoadulthood,lactasepersistencehasevolvedto
enablehumanstodigestthesedairyproducts.
Myresearchinvolvesmathematicallyrepresentingthegeneticsimilarityoftwo
populationsaccuratelyviathef3statistic.Theoutgroup-f3statisticisusefulin
understandingapopulation’sgenetichistoryandhowgeneticallyrelatedtwopopulations
are.Itshowshowclosetwopopulationsarecomparedtoathirdpopulationthatisequally
distantgeneticallyfromthefirsttwo.However,iftwopopulationssharearecentgenetic
interactionwithanotherpopulation,theoutgroup-f3statisticcouldshowthosetwo
populationsasbeingclosertogetherthantheytrulyare.Thisgeneticinteractionoftwoor
morepreviouslyisolatedpopulationsinterbreedingisreferredtoasadmixture.Admixture
skews,oreveninhibits,anunderstandingofthosepopulations’genetichistories.
Toavoidthisproblem,Ihaveattemptedtodeviseamodifiedversionofthe
outgroup-f3statistictoensureanaccuraterepresentationofgeneticrelatedness.Formy
project,artificialadmixturewasintroducedinsixunadmixedhumanpopulations.
Dependingontherelationshipbetweenincreasedcontaminationandthef3statistic,we
proposedandadjustedsolutionsforacorrectedf3accordingly.
3
Itestedmyproposedcorrectionsbyapplyingittopopulationsthatcontain
individualswithandwithoutrecenthistoriesofgeneticadmixture.Aftercorrectingforthe
proportionofadmixtureinthepopulation,Icomparedthiscorrectedoutgroup-f3statistic
totheoutgroup-f3valuecalculatedfortheoriginalunadmixedpopulation.Thegoalofthis
workistohaveacorrectedstatisticthatonecanapplytotwopopulations,independentof
admixtureproportions.Ultimately,thiswillhelpustobetterunderstandtheevolutionary
historiesofpopulations.Moreover,acorrectedstatisticwillaidotherresearchersasthey
analysedemographichistoriesfurtherinthepast.
4
Background
Fstatisticswerefirstproposedinthepaper,“ReconstructingIndianpopulation
history”,publishedinNaturein2009.Inthispaper,Reichandcolleaguesoutlinethewayf2,
f3,andf4statisticscanbeusedtomeasuregeneticdriftbetweentwo,three,andfourtaxa
respectively.Thef3statisticproposedinthispaperisusefulfordetectingadmixture
betweengroups.Tosummarise,thef3statisticassumesanullhypothesisofnoadmixture,
whichimpliesanonnegativef3statistic.F3isbestusedtodetectadmixturewhenthetime
betweenoriginalsplitandsecondarycontactislarge,coalescencebeforeadmixtureis
unlikely,andtheadmixtureproportioniscloseto50%.
Inregardtof3statisticsinparticular,Reichetal.proposeanequationtobeusedto
measurethegeneticdriftbetweenthreepopulations,PopulationsX,A,andB.Thisequation
is,inasimplifiedform,𝑓3 = 𝑥 − 𝑎 ∗ (𝑥 − 𝑏),wherex,a,andbrepresentallele
frequenciesintheirrespectivepopulations.Bysimplifyingtheequation,weseethatthereis
aproportionalrelationshipbetweenthef3statisticandthegeneticdriftbetween
PopulationsAandXandPopulationsBandX.Geneticdriftisdefinedtobethechangein
allelefrequencyalongagraphedgeonaphylogenetictree.Phylogenetictreesaregraphical
representationsofthegeneticrelationshipbetweenagroupofindividualsorpopulations
basedonphysicalorgeneticcharacteristics.Thelengthofthebranchesonthetreeoften
representthegeneticdistance,numberofgeneticdifferences,betweenindividualsor
populations.
Morespecifically,thecalculatedf3statisticistheproductofthefrequency
differencebetweenthosepopulations.Thistestisusefultoseeifcertaingroupshave
inheritedgenesfromdifferentancestries.Whentherehasbeennoadmixture,thef3
statisticisexpectedtobepositive.Whentherehasbeenadmixture,thef3statisticcouldbe
5
negative.Furthermore,lowerf3valuesareindicativeoflesscloselyrelatedpopulations,
whereashigherf3valuesareindicativeofmorecloselyrelatedpopulations.Thefarther
aparttwopopulationsare,thesmallerthetwoterms(x-aandx-b)intheequation,and
therefore,thelowerthef3statistic.Similarly,whentwopopulationsareclosertogether,the
twotermsintheequationarelarger,resultinginalargerf3statistic.
Tobetterunderstandwhatthef3statisticcanbeusedfor,werefertothefigure
below.
Hereweseethattherearetwopopulationsthatareclosertogether(PopulationsA
andB),thantheyaretothethirdpopulation(PopulationX).Inthecontextoftheequation,
𝑓3 = 𝑥 − 𝑎 ∗ (𝑥 − 𝑏),weseethatwearecomparingtheallelefrequenciesinPopulations
AandB,inrelationtotheallelefrequenciesinPopulationX.IfweseehowfarPopulation
A’sallelefrequenciesarefromPopulationX’sallelefrequenciesandcomparethistothe
distancebetweenPopulationB’sallelefrequenciesandPopulationX’s,wecanevaluatethe
geneticdistancebetweenPopulationAandPopulationB.Tothinkaboutthisinadifferent
way,bysubtractingoutPopulationA’sallelefrequenciesfromthoseofPopulationX,weare
seeinghowmuchlongerorshorteronebranchlengthiscomparedtotheother.Doingso
enablesustoanalysethedistanceofeachofthethreepopulationsinrelationtothevertex
thatconnectsallthreeofthem.However,ifwehaveanunknownPopulationYthat
PA PB PX
6
integratesitsDNAintobothPopulationAandPopulationB,itwouldappearthatthesetwo
populationsareclosergeneticallythanonewouldexpect.Intermsoftheequation,this
wouldmakebothterms(x-a)and(x-b)increaseordecreasetogether.Assuch,theresulting
f3valuewillbeinordinatelyhigherorlower.Thisisaninterestingresultifoneisconcerned
withtherelationshipofPopulationYtoPopulationsAandB.However,ifyouareinterested
inthegeneticrelationshipofPopulationsAandBbeforetheiradmixturewithPopulationY,
thiscanbeaconfoundingfactor.
NickPattersonwasabletoworkthroughmoreofthemathbehindtheFstatistics
tests,whichhedocumentedinhispaper“AncientAdmixtureinHumanHistory,”published
inGeneticsin2012.Healsodiscussestheoutgroupcase,whichisfurtherdiscussedin
MaanasaRaghavan’spaper,“UpperPalaeolithicSiberiangenomerevealsdualancestryof
NativeAmericans,”publishedinNaturein2014.Inthispaper,theconceptofoutgroup-f3
statisticsisintroduced.Outgroup-f3statisticsinvolvecomparingtwopopulationstoathird,
“outgroup,”population,whichisequallygeneticallyremovedfromtheothertwo
populations.Bydoingso,theoutgrouppopulationservesasareferencegroupfor
measuringgeneticrelatednessofthepopulationsinquestion.Soinsteadoflookingfor
admixturebetweenPopulationXandtheotherpopulations,theoutgroup-f3statisticisa
measureofthegeneticsimilaritybetweenPopulationsAandB.
InBenjaminPeter’spaper,“Admixture,PopulationStructure,andF-Statistics,”he
providesaclearoverviewofFandDstatistics(Genetics,2017).Healsomakesthepointthat
f3statisticscanbeusedasatestforadmixture,notjustforhowcloselyrelatedtwo
populationsare.Healsopointsoutthatinthehistoryofhumans,manyofthecalculatedf3
valuesarenegative,whichcouldshowthatpopulationphylogeniesarenotalwaysthebest
waytodiscusshumanevolution.
7
F3statisticshavebeenusefulindeterminingavarietyofgeneticrelatedness
questionsandarewidelyusedinthefieldofhumanpopulationgeneticsandevolutionary
biologymorebroadly.Forinstance,outgroup-f3statisticswereusedtotestrelatedness
betweenLevantineandsouthernArabianpopulationstoAfricanpopulationsalongthe
NorthernandSouthernDispersalRoutesoutofAfrica.HumansevolvedinAfricaoverthe
past2millionyears.AmajordispersalofhumansoutofAfricaoccurredaround50thousand
yearsagoandledtothemajorityofhumangeneticvariationweseeacrosstheworldtoday.
AnthropologistsandgeneticistshavelongdebatedwhethertheprimaryrouteoutofAfrica
wastheNorthernRouteortheSouthernRoute.In“Testingsupportforthenorthernand
southerndispersalroutesoutofAfrica:ananalysisofLevantineandsouthernArabian
populations,”Vyasandcolleaguesattemptedtoanswerthatquestionusingf3statistics
(AmericanJournalofPhysicalAnthropology,2017).TheNorthernDispersalRouteledinto
Levant,whereastheSouthernDispersalRouteledintosouthernArabia.Byusingf3statistics
toseehowlinkedthepopulationswerepairwise,itwasfoundthatneitherdispersalroute
wasfavouredovertheother.TheMbuti,agroupofpeoplecurrentlylivingincentralAfrica,
wasusedastheoutgrouppopulationforthistest.Theresultsshowedthatboththe
LevantineandArabianpopulationswereequallyrelatedtotheAfricanpopulation.
Thef3testwastakenfurtherandusedtoshowthatboththeLevantineandArabian
populationssharedrelativelysimilarrelatednesstonon-Africanpopulations.Withineach
region,somegroupshadmoresub-Saharanancestry,whichledtolowerf3values.Another
reasonforalowerf3statisticcouldbeanearlierdivergencefromnon-Africanpopulations,
whichwouldbeusefulindeterminingwhichroutewasusedbyearlierpopulations.The
statisticwasusedtoshowthatbothpopulationsweregenerallyequallyrelatedtoallthe
8
Africanpopulationsaswell.Therefore,theresearcherswerenotabletodistinguishwhich
dispersalroutewasusedmore.
Thef3statisticshavealsobeenusedinexploringtherelatednessofvarious
subspeciesofgrapes.Incontrasttothepreviousexampleofoutgroup-f3statistics,thistest
usednormalf3statisticstoseewhatsortofadmixturehasoccurredinthehistoryofthe
grape.Whilethisinvolvesunderstandinghowrelatedtwospeciesofgrapesare,theprimary
purposeofthisstudywastoseehowtomaximallyutilisethegeneticdiversityofgrapes.
Thegrape’shistoryofdomesticationbeganaround6000-8000yearsago,whenthe
domesticatedgrape,Vitisviniferavinifera,wascultivatedfromthewildgrape,Vitisvinifera
sylvestris.Thef3statisticwasusedtotestformixturebetweenviniferawest,viniferaeast,
andsylvestriswest(f3=-0.00481);f3statisticswerealsousedtotestformixturebetween
sylvestriswest,viniferawest,andsylvestriseast(f3=0.0268).
Theresearchersfoundthatwesternviniferaismostlikelyacombinationofeastern
viniferaandwesternsylvestris.Nonetheless,theredoesnotseemtobeagenetictransfer
betweenwesternviniferaandwesternsylvestris.Thissupportsthatviniferaoriginatedin
theNearEastandunderwentintrogressionintoviniferafromwildsylvestrisinEurope.
Thisanalysisfoundthatlittleofthepotentialgeneticdiversityofthegrapehasbeen
explored.Theresearchersusethisfindingtosuggestthattoovercomethegrape’s
significantpathogenpressures,itsgeneticdiversitymustbeutilisedtoitsadvantage.The
domesticatedgrapecontainsgeneticvariationmuchlargerthanthatofhumans,thus
makingitidealtomanipulateforitspolymorphismsandgeneticdiversity.
9
Project
Thegoalofthisprojectistocorrectforadmixturewhencalculatingtheoutgroup-f3
statisticsoitisanaccuratemeasureofgeneticrelatedness.Ifirstproposedasimilar
correctiontothatusedbyLindoetal.fortheDstatistic.
TheDstatisticcanbeusedtotestforadmixtureacrossfourpopulations.Inhis
paper,“AncientindividualsfromtheNorthAmericanNorthwestCoastreveal10,000years
ofregionalgeneticcontinuity,”JohnLindoproposedacontaminationcorrectiontoaccount
forsimilaradmixturehistoriesforthisstatistic(ProceedingsoftheNationalAcademyof
SciencesoftheUnitedStatesofAmerica,2017).ThecontaminationcorrectionfactorLindo
proposesisbasedoncontaminationofanancientgenomewithmodernDNAfroma
distantlyrelatedpopulation,thoughtheoneweproposeforf3statisticswillbebasedon
thelevelofartificiallyinducedadmixture.Nonetheless,Lindousedacorrectedformulato
calculateanewDstatistic,withadmixturecorrectedforusingthecontaminationcorrection.
DShukáKáaisthecontaminatedsample’sDstatistic;DGBRistheDstatistic,substitutingan
individualrepresentativeofthepopulationthatcontaminatedShukáKáa;cisthe
contaminationrate.Forthef3statistic,thisequationwouldlooklike𝑓3∗ = +,-.∗+,/0-.
,where
f3isthecontaminatedsample’sf3statistic,f3aisthef3statisticwithanoutgroupasthe
populationthatcontaminatedtheoriginalgroup,andaistheadmixtureproportion.
10
Dataset
OurresearchgrouputilisedpopulationdatafromNorthandSouthAmerican
indigenouspopulations.ThefirststepofthisprojectwastogatherusableremovedSNPs
thatweremissinginmorethan90%ofthepopulation,andpruningSNPsbasedonlinkage
disequilibrium.InextusedtheADMIXTUREprogramtoidentifyindividualswithevidenceof
Europeanadmixture.Populationswerethensplitintothreegroups:thosethathadno
evidenceofEuropeanAdmixture(Cabecar,Mixe,Surui,GuaraniKW,Xaltocan,and
Xavante),thosewhereanumberofindividualswereadmixedandanumberwerenot
(JaltocanHidalgo,Pima,Xaltocan),andthosewheretheentirepopulationhadEuropean
admixture(AleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUSNative
American).
PopulationsintowhichAdmixturewasArtificially
Introduced
PopulationwithAdmixedandUnadmixed
Individuals
AdmixedPopulationsonwhichtoTestCorrection
Cabecar JaltocanHidalgo AleutRaff
Mixe Pima Algonquin
Surui Xaltocan Cree
GuaraniKW Chipewyan
Xaltocan Inupiat
Xavante Ojibwa
SouthernUSNativeAmerican
11
Methods
Formyproject,IusedsixcompletelyunadmixedhumanpopulationsfromNorthand
SouthAmerica-Cabecar,Mixe,Surui,GuaraniKW,Xaltocan,andXavante.Iintroduced
artificialadmixtureinconstant5%intervalsfrom5%to95%admixturefromaEuropean
population.ThiswasdoneviaaprograminRthatarbitraryreplaced5to95%ofthe
population’sgenomewiththecorrespondingsegmentofaEuropeangenome.Belowisan
exampleofthecodeusedtoinduceadmixtureinthepopulationCabecarusingafor-loop.
v ADM=(0.050.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.85
0.90.95)
v forjin"${ADM[@]}";doRscriptadmixer.R--file./final_dataset_cleanest2.vcf--
donorSpanish--recipCabecar--p$j--subs5--outfinal_admix_Cabecar_$j.vcf;
done
Aftersimulatingadmixtureinthesepopulations,Iobtainedoutgroup-f3valuesfor
eachofthesepopulationsandeachoftheadmixturelevelswithinthemusingtheprogram
popstats.Ialsoobtainedanf3statisticbyswappingouttheEnglishpopulationforthe
Yorubapopulation,awestAfricangroupassumedtobeequallydistantlyrelatedtoallthese
populations,astheoutgroup.Thisoutgroupservesasareferencegrouptocomparethe
desiredpopulationandtheingroupto.Karitianawasusedastheingroupforbothtests.
Then,wecanseehowincreasedadmixtureaffectsthestatistic.Thiswasdoneusingthe
commandsbelow,wherejspannedtheadmixtureproportionsmentionedpreviously:
v python~/Desktop/project/bin/popstats/popstats.py--file
final_admix_Cabecar_$j--f3--popsC,Karitiana,Yoruba--informative>
final_admix_Cabecar_$j_f3.txt
12
v python~/Desktop/project/bin/popstats/popstats.py--file
final_admix_Cabecar_$j--f3--popsC,Karitiana,English--informative>
final_admix_Cabecar_$j_f3a.txt
Comparingthesevaluestotheadmixturelevels,Iwasabletore-evaluatethe
suggestedsolutionasneeded.Thenbygettinganf3statisticforthesepopulationsand
settingtheoutgroupasthepopulationassumedtohavecontaminatedthem(English
population),Icalculatedanewf3statistic,whichwashopefullycorrectedforadmixture.
Tofurthertestifthiscorrectionworked,Itookpopulationsthatcontained
individualswithandwithoutadmixturedgenomes.Bycorrectingfortheportionofthe
populationthatwasadmixed,Isawifthiscorrectedf3statisticmatchedtheunadmixed
portion’sf3statistic.IdidthisinindividualsfromtheJaltocanHidalgo,PimaandXaltocan
populations.Ithencomputedabaselinef3statisticcomparingthewholepopulations,with
Karitianaastheingroup,andYorubaastheoutgroup.Afterdoingso,Igotanf3statistic
fromtheadmixedindividualsinthesepopulationsinrelationtoYoruba,andthengotanf3
statisticfromtheadmixedindividualsinthesepopulationsinrelationtoanEnglish
population.
Ifthef3statisticwassuccessfullycorrected,wecouldmakeinferencesaboutthe
genetichistoriesofothercontaminatedpopulations.Ithenappliedthef3statistictothe
populationsAleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUS
NativeAmerican.IobtainedtheadmixtureproportionfromtheamountofEuropeanDNAin
theseindividuals.ThenIcorrectedforthef3statisticbygettinganf3usingYorubafirst,and
thenusingEnglishancestrytocomparetheirgenomesto.
Basedonpreliminaryresults,thesolutioncouldtaketheformofacorrected
equationforoutgroup-f3statistics.Ontheotherhand,itcouldstartwithanequationtoget
13
acorrectedf3value,whichisthenmanipulatedfurther.ThisiswhereIcouldcomeupwith
atableofvaluesthatcorrespondtodifferentlevelsofadmixture.These“differences”
betweenthesemi-correctedf3andthebaselinef3arethentobesubtractedfromthesemi-
correctedf3.Sinceothersattemptingtousethiscorrectionwillnothaveabaselinef3for
comparison,ourgoalistocomeupwithauniversalsetofdifferencesthatcanbeused
dependingsolelyontheadmixturelevels.
14
CorrectionAttempt1
Usingthetwof3statistics,Ipositedacorrectionequationtogetthecorrectedf3
valuestolooksimilartothebaselinef3valueswhengraphed.Asimilarcorrectionasthat
proposedfortheDstatisticbyLindowasattemptedfirst.However,thiswasunsuccessful.A
newequationwasthensuggestedandtested.Thisequationtooktheformof
𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓3,wheref3wasthestatisticcalculatedwithYorubaastheoutgroup,
f3awasthestatisticcalculatedwithEnglishastheoutgroup,andawastheadmixture
proportionthatweintroducedintothepopulation.Usingthesevalues,Igraphedthe
relationbetweenadmixtureproportionandthecorrectedf3statistic.Allthepopulations’
graphsexhibitedsimilartrends.BelowisagraphusingCabecar’sf3valuestobeusedasa
reference.
CorrectionAttempt2
Clearly,thetwosetsofpointsarenotthatsimilar.Assuch,Iattemptedtoagain
correcttheequation.Lookingatthetrendoff3valuesdippingaround20-40%admixture
levels,itseemedthatperhapsIwasovercorrectingthef3valuesbyusingf3valuesthat
changewiththeadmixtureproportion.Assuch,Iproposedthefollowingequationinstead:
𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓32.345674,wheref3baselinewasthevaluecalculatedforeachofthe
0.2150.22
0.2250.23
0.2350.24
0.2450.25
0 0.2 0.4 0.6 0.8 1
F3VALUE
S
ADMIXTUREPROPORTION
CABECAR
Correctedf3values
Baselinef3values
Linear(Baselinef3values)
15
populationswhentherewasnoartificialadmixtureintroduced.Thisappearedtoatleast
presentabettercorrelationbetweenadmixtureandcorrectedf3valueswhengraphed.
Belowisagraphofthenewlycorrectedf3valuesplottedagainstadmixtureproportions
again.
Thesenewf3valueslookrelativelylinear,andassuch,I
seemedtobeontherighttrack.Tofurthercorrectthef3values,I
attemptedtofindthedifferencebetweenthenewlycorrectedf3
valuesandthebaselinef3values.Ididthisforeachpopulation,
andthenfoundtheaveragesofthedifferencesforeach
admixtureproportion.Totherightisatableoftheresults.
Ithenplottedtheadmixtureproportionsandtheaverage
differences,astheylookedquitesimilar.Ihopedtoseeifthere
wasacorrelationusingalinearrelationship.TheR2valuewas
0.9973,indicatingthatthereisasignificantrelationshipbetween
thesetwovalues.Thus,Iattemptedtousetheequationforthe
linearregressionlineasacorrectionforthef3values.Iusedthe
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0 . 2 0 . 4 0 . 6 0 . 8 1
F3VALUE
S
ADMIXTUREPROPORTION
CABECAR
Correctedf3values
Baselinef3values
Linear(Baselinef3values)
AdmixtureProportion
0.05
Average
Differences0.004970747
0.1 0.0101043520.15 0.0153668690.2 0.0208302490.25 0.0263991740.3 0.0321738640.35 0.0381413220.4 0.0442220140.45 0.0504215940.5 0.0566984520.55 0.0631453990.6 0.069525620.65 0.0764070520.7 0.0831501270.75 0.0899637640.8 0.0971640070.85 0.1046993370.9 0.1118768870.95 0.119257046
16
valuesthatIhadcorrectedusingtheequation 𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓32.345674 andthen
subtractedthedifference,calculatedusingthefollowingequation:𝑦 = 0.1272𝑥 − 0.005.
Givenacertainadmixtureproportion,Iwouldplugthatvalueinforxintheequationtoget
thedifferencetobesubtractedfromthecorrectedf3value.Thisresultedinaparabolic
lookinggraphofthef3valuesplottedagainsttheadmixtureproportion,shownbelow(again
withthebaselinef3valuesplottedasareferenceforthedesiredvalues).
Clearly,thiswasnotanidealcorrectionofthef3valuesagain.Iattemptedtoputthis
intheperspectiveoftheconfidenceintervalsofthebaselinef3values.Assuch,theupper
boundandlowerboundrepresentativeofonestandarddeviationaboveandbelowthe
baselinef3areshownonthegraph(thestandarddeviationwascalculatedbythepopstats
programusedtogetthebaselinef3value).Therefore,Iattemptedtofixtheregression
equationwehadgottenfromtheaveragedifferences.Assuch,Iusedthefollowingpower
equationinstead:𝑦 = 0.122𝑥[email protected],
indicatingthatthisequationmightworkasacorrection.Nonetheless,onceIusedthis
0.224
0.225
0.226
0.227
0.228
0.229
0.23
0.231
0 0.2 0.4 0.6 0.8 1
F3Value
s
AdmixtureProportion
F3VersusAdmixture
Correctedf3values
Baselinef3values
upperbound
lowerbound
Linear(Baselinef3values)
17
equationwiththedifferentadmixtureproportionstosubtractfromthecorrectedf3values,
Istillhadagraphthatdidnotlookideal(below).
CorrectionAttempt3:
Then,Iattemptedtojustusetheaveragedifferencestosubtractfromthecorrected
f3.Ihopedtogetthesedifferencesformoreadmixturevalueswherethelinearregression
line/powerlinedidnotmatchthedatawell,ifthisattemptworked.Icalculatednewf3
valueswiththiscorrectionandgotthefollowinggraph.
0.22
0.225
0.23
0.235
0.24
0.245
0.25
0.255
0 0.2 0.4 0.6 0.8 1
F3Value
s
AdmixtureProportion
F3VersusAdmixture
Correctedf3values
Baselinef3values
upperbound
lowerbound
Linear(Baselinef3values)
0.22450.2250.22550.2260.22650.2270.22750.2280.22850.2290.2295
0 0.2 0.4 0.6 0.8 1
F3Value
s
AdmixtureProportion
F3VersusAdmixture
Correctedf3values
Baselinef3values
upperbound
lowerbound
Linear(Baselinef3values)
18
Thisgraphclearlylookedalotbetterthanpreviousattempts.Furthermore,itwas
theonlysolutionthusfarthatyieldedcorrectedf3valueswithintheboundsofonestandard
deviationaboveandbelowaref3statistic.Nonetheless,itwasnotaperfectfit.
Tomakethisgraphevenbetter,Igotintervalsthatwereclosertogether(intervalsof
1%admixture)between75and85%ofadmixture.Thiswasanareathatlookedtohavea
largedegreeofvariancebetweenthebaselinesandthecorrectedf3values.Assuch,ifthese
newdifferencesthatwerecalculatedwerebetterindicatorsofthedifferencetosubtract
fromthecorrectedf3,thenIcouldusethesevaluesforthecorrection.
Afterfindingintervalsthatwereclosertogether,Inoticedthatthisdidnot
significantlyimpactthecorrectionfactor.Assuch,Itriedtouseasecondorderpolynomial
equation,andgotthehighestR2valueyet(R2=0.99999).Belowisthegraphwhenusingthe
quadraticequationtocorrectthef3valuestobaseline.
WhenIcontinuedwithmyresults,Iquicklyranintoasnag.Ihadusedthebaseline
f3tofindacorrectionequationtogettothebaselinef3.Inotherwords,Iusedtheresultto
forcethedesiredresult.However,Iwasunabletofactoroutthebaselinef3valuestogeta
correctionindependentofthem.
0.224
0.225
0.226
0.227
0.228
0.229
0.23
0 0.2 0.4 0.6 0.8 1
f3value
s
Admixtureproportion
F3versusadmixture
Correctedf3values
baselinef3
upperbound
lowerbound
Linear(baselinef3)
19
CorrectionAttempt4
Assuch,Iwasbacktosquareoneandattemptedtoworkwiththeinitialcorrection
equationforf3( 𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓3).Ithenwentbackandgotthedifferencesbetween
thebaselinef3valuesandthesef3values.Afterdoingthis,Iplottedthebaselinef3values
againstthe“corrected”f3values.Thereappearedtobeafairlylineartrendamongstthef3
valuesusingtheequationabove,acrossallsixpopulations.Ialsonoticedthatallthef3
valueswerelessthanthebaselinef3,whichreinforcedthetrendofdecreasingf3values
withincreasedadmixturelevels.BelowisasamplegraphfromthepopulationCabecar(with
thebaselinef3valuesinorange,andthepreliminarilycorrectedf3valuesinblue).The
equationgivenisforthelinearregressionlineforthepreliminarilycorrectedf3values.
Thisledustobelievethatwecouldusethedifferencesbetweenthebaselineand
thepreliminarilycorrectedf3.Afterdoingthisforthesixpopulations,Igottheaverageof
thesixdifferencesforeachadmixturelevel.Forinstance,Igottheaveragedifferenceforan
admixtureproportionof5%acrossallsixpopulations.Afterdoingso,Iusedtheaverage
differencestogetanewlycorrectedf3byaddingthemtothepreliminarilycorrectedf3.I
noticedthatthesenewf3wererelativelysimilartothebaselinef3,thoughtheywerenot
y=-0.1036x+0.221R²=0.99645
0
0.05
0.1
0.15
0.2
0.25
0 0.2 0.4 0.6 0.8 1
F3Value
s
AdmixtureProportion
BaselineVersusF3Values
f3values
Baselinef3values
Linear(f3values)
20
ideal.Assuch,Idecideditwouldbebeneficialtogeta95%confidenceintervalforthe
differences,toseeiftheseconfidenceintervalsofdifferenceswouldgiveussomethinginan
appropriaterangearoundthebaselinewhenaddedtothepreliminarilycorrectedf3.
Todoso,Iwantedtouseat-test,butthedatawasnotapproximatelynormally
distributed.Therefore,IusedaWilcoxonsigned-ranktest,whichisanon-parametrical
statisticalhypothesistestthatallowsustoperformaversionofthet-testwithoutnormally
distributeddata.ItisoftenreferredtoastheWilcoxonTTest.UpondoingsoinR,Inoticed
thatthe95%confidenceintervalsforthedifferencesforeachadmixturelevelacrossthesix
populationwouldgiveusarangeofdifferences.Belowisatableoftheseconfidence
intervals.
AdmixtureProportion0.05
WilcoxConfidenceIntervals(0.00750608,0.02297737)
0.1 (0.01441939,0.03037686)0.15 (0.02137041,0.03316990)0.2 (0.02569456,0.04308678)0.25 (0.03300473,0.04942423)0.3 (0.03840717,0.05224841)0.35 (0.04300735,0.05772577)0.4 (0.04815174,0.06442764)0.45 (0.05378068,0.06854280)0.5 (0.06017016,0.07539880)0.55 (0.06489406,0.08105843)0.6 (0.07016284,0.08537045)0.65 (0.07382676,0.09035722)0.7 (0.07968564,0.09463120)0.75 (0.08441375,0.10056375)0.8 (0.08823818,0.10527647)0.85 (0.09257367,0.11016625)0.9 (0.09765934,0.11534444)0.95 (0.1012127,0.1194585)
Whentheuncorrectedf3valueswereaddedtothelowerandupperboundsofthe
confidenceintervals,Igotintervalsfornewlycorrectedf3values.OnceIdidthis,Inoticed
21
thatthisintervaloff3valuesincludedthebaselinef3values.Atfirst,Ihopedtogetthe
baselinef3valuestoalignwiththenewlycorrectedf3valueswhenusingonestandard
deviationaboveandbelowthebaselineinconjunctionwiththeconfidenceintervalofnewly
correctedf3values.However,thecorrectionusingthesedifferencesworkedwellenough
thatwedidnotneedtoconsideronestandarddeviationaboveandbelowthebaselinef3.
Simplyusingtheconfidenceintervalsforthedifferencestogetconfidenceintervalsfor
correctedf3valueswassufficientasacorrection.
Ithenappliedthiscorrectiontothenaturalpopulations,JaltocanHidalgo,Pima,and
Xaltocan.Ididsobyroundingtheadmixtureproportionforthesepopulationstothenearest
fivehundredths,suchthatIwouldbeabletousethedifferences(sinceweonlyhadthese
foradmixturesthatweremultiplesof0.05).Upondoingso,Iusedtheconfidenceintervals
forthedifferencesandaddedthelowerandupperboundstotheinitial,uncorrectedf3
value.OnceIdidthis,Inoticedthatthebaselinef3statisticfellinthisrangeofnewf3values
inthePimapopulationandintheXaltocanpopulation.However,thiscorrectiondidnot
workforJaltocanHidalgo.Therangeofnewf3valuesendedupbeing(0.247343959,
0.262815249),whereasthebaselinef3valuewas0.227966338.
Regardless,Ithenappliedthiscorrectiontothepopulationsthathadadmixture,
AleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUSNativeAmerican.
Iusedtheconfidenceintervalsforthedifferencesagainandroundedtheadmixture
proportionforeachpopulationtothenearestfivehundredths.Upondoingso,Icalculated
anintervaloff3valuesthatthebaselinef3ispresumedtofallin.
ToseeifIwasabletogetabettercorrection,Iplottedtheaveragedifferences.Iwas
abletouseapolynomialregressionlinesincetheR2valueswereallabove0.99993.Ithen
gottheequationforthiscurve,whichIthenusedtogetavalue(usingtheadmixture
22
proportionasthexvalue)toaddtothepreliminarilycorrectedf3.Thisresultsinf3values
thataresimilartothef3valuesIgotfrommerelyaddingbackintheaveragedifferencefor
theadmixtureproportion5%increments.However,theydonotfallwithinonestandard
deviationofthebaselinef3values,justasadding5%admixtureincrementeddifferencesdid
notyieldf3valuesthatfellwithinthatrangeeither.
Assuch,Iplottedthelowerboundsandupperboundsofthe95%Wilcoxconfidence
intervalsseparatelyandfoundregressionlinesforeach.Ifoundthatsecond-order
polynomialequationsfitthedatabest(highestR2value)andwasabletousethese
equationstoaddbackinthedifferencetothebaselinef3value.Thisallowedacontinuous
correctionofthef3statistic,ratherthanjustatdiscreteadmixtureintervalsof5%.
23
Conclusion
Throughthecourseofthisresearchproject,Ihavedevelopedacrudeadmixture
correctionfortheoutgroup-f3statistic.Byfirstfindingthef3valueofthecontaminated
population,a“correctionfactor”canbeaddedbackintobringthatvaluewithinaballpark
aroundthebaselinef3statistic.Thiscorrectionfactorcomesintheformofalowerbound
quadraticequationandanupperboundquadraticequation.Whenbothoftheseareadded
tothef3statistic,theresultisarangeoff3values.Comparingtheseresultstothebaseline
f3statistics,Iconcludethatthiscorrectionworkswithinamarginoferror.Sincethe
correctiononlyworkedintwooutofthethreepopulationswithadmixedandunadmixed
individuals,wecannotconcludeirrefutablythatthiscorrectionworks.
Nonetheless,thecorrectionworkedforalladmixturelevelsinallsixoftheartificially
admixedpopulations(6×19 = 114cases).Therefore,Iappliedthecorrectiontothe
sevenpopulationsthatwerecompletelyadmixedwithEuropeanDNA.Thisresultedina
rangeoff3valuesthatresembledappropriatef3values.However,thereisnowaytocheck
forwhichofthesesevenpopulationsthecorrectionactuallyworked.
Inthefuture,researchersmightbeabletofine-tuneourcorrectionusingdatafrom
morepopulations.Forinstance,ourconfidenceintervalsfortheWilcoxonsigned-ranktest
wouldlikelyspanashorterrangeiftherewasmoredatatopullfrom.Furthermore,itis
possiblethatresearchersmightbeabletofurthermanipulatethepostulatedequations
mentionedpreviously.GiventhatLindoandcolleagueswereabletofindaneatcorrection
equationfortheDstatistic,itispossiblethatthereexistsoneforthef3statisticaswell.It
wasalsoobservedduringthisprojectthatcertaincorrectionsthatweresuggestedworked
betteratloweradmixtureproportions.Justasthenormalf3statisticismostaccurateunder
24
certainconditions,oneofwhichisthattheadmixtureproportionbecloseto50%,itis
possiblethattheoutgroup-f3statisticworksbestatloweradmixtureproportions.
Regardless,thiscorrectionisusefulforresearchershopingtostudythegenetic
relatednessofdifferentpopulations.Inparticular,thispotentialsolutionismostusefulfor
thosehopingtoperformoutgroup-f3statisticsinpopulationsthathaveindividualswith
geneticadmixture.
25
Glossary
• Admixture:geneticinteractionoftwoormorepreviouslyisolatedpopulations
interbreeding
• Dstatistic:afour-populationtestforadmixture
• Fstatistic:measuressharedgeneticdriftbetweensetsofpopulations
o Normalf3statistic:testsforadmixturebetweenthreepopulations
o Outgroup-f3statistic:proportionaltoamountofsharedgenetichistory
betweentwopopulations
• For-loop:acontrolflowstatementthatspecifiesiterationtoexecuteacode
repeatedly
• Geneticdrift:thechangeinallelefrequenciesinapopulationovergenerationsasa
mechanismofevolution
• Geneticrelatedness:probabilitythattwoindividualsshareanallelefromcommon
ancestry
• Linkagedisequilibrium:non-randomassociationofallelesatvariousloci
• Outgroup:referencegroupoforganismsnotinthepopulationsbeingstudied
• Phylogenetictrees:branchingdiagramrepresentingevolutionaryrelationships
amongstorganisms
• SNPs:singlenucleotidepolymorphisms;changeinasinglenucleotideataspecific
genomeposition
26
Bibliography
Alexander,DavidH.,etal.“FastModel-BasedEstimationofAncestryinUnrelated
Individuals.”GenomeResearch,ColdSpringHarborLab,22June2009,
genome.cshlp.org/content/early/2009/07/31/gr.094052.109.
Lindo,John,etal.“AncientIndividualsfromtheNorthAmericanNorthwestCoastReveal
10,000YearsofRegionalGeneticContinuity.”PNAS,NationalAcademyofSciences,
18Apr.2017,www.pnas.org/content/114/16/4093.
Myles,Sean,etal.“GeneticStructureandDomesticationHistoryoftheGrape.”PNAS,
NationalAcademyofSciences,1Mar.2011,
www.pnas.org/content/108/9/3530.abstract.
Patterson,Nick,etal.“AncientAdmixtureinHumanHistory.”Genetics,Genetics,1Nov.
2012,www.genetics.org/content/192/3/1065.
Peter,BenjaminM.“Admixture,PopulationStructure,andF-Statistics.”Genetics,Genetics,
1Apr.2016,www.genetics.org/content/202/4/1485.
Pontussk.“Pontussk/Popstats.”GitHub,GitHub,30July2015,
github.com/pontussk/popstats.
Raghavan,Maanasa,etal.“UpperPalaeolithicSiberianGenomeRevealsDualAncestryof
NativeAmericans.”Nature,MacmillanPublishersLimited,2Jan.2014,
www.academia.edu/7110954/Upper_Palaeolithic_Siberian_genome_reveals_dual_
ancestry_of_Native_Americans.
Reich,David,etal.“ReconstructingIndianPopulationHistory.”Nature,U.S.NationalLibrary
ofMedicine,24Sept.2009,www.ncbi.nlm.nih.gov/pmc/articles/PMC2842210/.
Vyas,DevenN.,etal.“TestingSupportfortheNorthernandSouthernDispersalRoutesout
ofAfrica:anAnalysisofLevantineandSouthernArabianPopulations.”American
JournalofPhysicalAnthropology,Wiley-Blackwell,15Sept.2017,
onlinelibrary.wiley.com/doi/10.1002/ajpa.23312/full.