Post on 12-Jul-2020
ChalkTalk
TandyWarnowDepartmentsofComputerScienceand
BioengineeringUniversityofIllinoisatUrbana-Champaign
• Large-scale statistical phylogeny estimation • Ultra-large multiple-sequence alignment • Estimating species trees from incongruent gene trees • Supertree estimation • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
Applications areas: • metagenomics • protein structure and function prediction • trait evolution • detection of co-evolution • systems biology
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
Techniques: • Graph theory (especially chordal graphs) • Probability theory and statistics • Hidden Markov models • Combinatorial optimization • Heuristics • Supercomputing
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
Overview• Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon
• SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon
• Biologicaldataanalysis:refiningmethodsandenablingdiscovery
• OpensourcesoOwaredevelopment• HighperformancecompuLng• ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)
PastWork(highlights)• GenetreeesLmaLon(theoreLcalresultsunderstochasLcmodelsofsequenceevoluLon)
• MulLplesequencealignmentonlargedatasets,andco-esLmaLonofalignmentsandtrees
• PhylogeneLcnetworksandspeciestreesfrommulL-locusdatasets
• Genomerearrangementphylogeny• Supertreemethods• Metagenomics• HistoricallinguisLcs
Futurework
Theory,methods,andempiricalstudiesfor• Genome-scalephylogenyesLmaLonaddressingmulLplesourcesforgenetreeheterogeneity
• Microbiomeanalysis• Ultra-largemulLplesequencealignmentandtreeesLmaLon
AndapplicaLonsofthesetechniquesoutsidebiology
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
MajorAreas• Phylogenomics:SpeciestreeandnetworkesLmaLonusing
wholegenomes(andgenetreeesLmaLoninthecontextofwholegenomes)
• Mul*pleSequenceAlignment:InferringrelaLonshipsbetweenleeersinmolecularsequences,especiallyonverylargedatasets(upto1,000,000sequences)
• Metagenomics:Analysisofmolecularsequencesobtainedfromenvironmentalsamples(jointwithMihaiPopandBillGropp)
• Scalingcomputa*onallyintensivemethodstolargedatasets:CombiningdiscretemathandstaLsLcalmethodstoenablehighlyaccurateanalysisofultra-largedatasets(jointwithChandraChekuriandSaLshRao)
Phylogenomics = Species trees from whole genomes
“NothinginbiologymakessenseexceptinthelightofevoluLon”-Dobhzansky
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
IncompleteLineageSorLng(ILS)isadominantcauseofgenetreeheterogeneity
IncompleteLineageSorLng(ILS)
• ConfoundsphylogeneLcanalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.
• ThereissubstanLaldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundstaLsLcalconsistencyguarantees(theory)andperformanceondata.
. . .
Analyzeseparately
Summary Method
MaincompeLngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
StaLsLcalConsistency
error
Data
. . .
Analyzeseparately
Summary Method
MaincompeLngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]
• Optimization Problem (NP-Hard):
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
8
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
Set of quartet trees induced by T
a gene tree
Score(T ) =X
t2TQ(T ) \Q(t)
all input gene trees
ConstrainedMQST(MaximumQuartetSupportTree)
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbiparLLons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT,subjecttoTdrawingitsbiparLLonsfromX.
Theorems(Mirarabetal.,2014):• IfXcontainsthebiparLLonsfromtheinputgenetrees(andperhaps
others),thenanexactsoluLontothisproblemisstaLsLcallyconsistentundertheMSC.
• TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)Lme.(Weusedynamicprogramming,andbuildtheunrootedtreefromtheboeom-up,basedon“allowedclades”–halvesoftheallowedbiparLLons.)
200 Estimated Gene Trees
Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27
7Price, Dehal, Arkin 20158Fletcher, Yang 2009
12
ASTRALisfairlyrobusttoHGT+ILS
Davidsonetal.,RECOMB-CG,BMCGenomics2015
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
ContribuLons(sample)MethodsforesLmaLngspeciestreesfromgenome-scaledata:
• ASTRAL(Mirarabetal.,BioinformaLcs2014,2015)andASTRID(VachaspaLandWarnow,BMCGenomics2015):polynomialLmemethodsthatarestaLsLcallyconsistentundertheMSC.Bothcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.
• StaLsLcalbinning(Mirarabetal.,Science2014,Bayzidetal.PLOSOne2015)canreducegenetreeesLmaLonerror,andleadtoimprovedspeciestreeesLmaLons(topology,branchlengths,andincidenceoffalseposiLves)
• BBCA(Zimmermannetal.,BMCGenomics2014)enablesBayesianco-esLmaLonmethodstoscaletolargenumbersofgenes
• DCM-boosLng(Bayzidetal.,BMCGenomics2014)enablescomputaLonallyintensivemethodstoscaletolargenumbersofspecies
MathemaLcaltheory:
• RochandWarnow,SystemaLcBiology2015)regardingstaLsLcalconsistencyundertheMSCgivenfinitelengthsequences.
• Uricchioetal.,BMCBioinformaLcs2016,numberoflocineededtorecoverallthesplitswithhighprobability
Biologicaldataanalyses:
• Avianphylogenomicsproject(Jarvis,Mirarabetal.,Science2014)
• ThousandPlantTranscriptomeProject(Wickee,Mirarabetal.PNAS2014)
• Tarveretal.GenomeBiologyandEvoluLon2016,Mammalianphylogeny
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
Metagenomictaxonomiciden*fica*onandphylogene*cprofiling
Metagenomics,Venteretal.,ExploringtheSargassoSea:Scien*stsDiscoverOneMillionNewGenesinOceanMicrobes
1. Whatisthisfragment?(Classifyeachfragmentaswellaspossible.)
2.WhatisthetaxonomicdistribuLoninthedataset?(Note:helpfultousemarkergenes.)
3.Whataretheorganismsinthismetagenomicsampledoingtogether?
BasicQuesLons
This talk
• SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs)
• Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)
PhylogeneLcPlacement
Input:Backbonealignmentandtreeonfull-lengthsequences,andasetofhomologousquerysequences(e.g.,readsinametagenomicsampleforthesamegene)
Output:Placementofquerysequencesonbackbonetree
PhylogeneLcplacementcanbeusedinsideapipeline,aOerdeterminingthegenesforeachofthereadsinthemetagenomicsample.
Marker-based Taxon Identification
ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT
ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT
Fragmentarysequencesfromsomegene
Full-lengthsequencesforsamegene,andanalignmentandatree
AlignSequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC
AlignSequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
PlaceSequence
S1
S4
S2
S3Q1
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
PhylogeneLcPlacement• Aligneachquerysequencetobackbonealignment
– HMMALIGN(Eddy,BioinformaLcs1998)– PaPaRa(BergerandStamatakis,BioinformaLcs2011)
• Placeeachquerysequenceintobackbonetree– Pplacer(Matsenetal.,BMCBioinformaLcs,2011)– EPA(BergerandStamatakis,SystemaLcBiology2011)
Note:pplacerandEPAusemaximumlikelihood
HMMERvs.PaPaRaAlignments
Increasing rate of evolution
0.0
One Hidden Markov Model for the entire alignment?
Or2HMMs?
HMM1
HMM2
HMM1
HMM3 HMM4
HMM2
Or4HMMs?
SEPPParameterExploraLon
§ Alignmentsubsetsizeandplacementsubsetsizeimpacttheaccuracy,runningLme,andmemoryofSEPP
§ 10%rule(subsetsizes10%ofbackbone)hadbestoverallperformance
SEPP(10%-rule)onsimulateddata
0.0
0.0
Increasing rate of evolution
Marker-based Taxon Identification
ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT
ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT
Fragmentarysequencesfromsomegene
Full-lengthsequencesforsamegene,andanalignmentandatree
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes
TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference
alignment 4. Modifies SEPP by considering statistical uncertainty in the
extended alignment and placement within the tree
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: The distribution of the sample at the species-level is:
50% species A
20% species B
15% species C
14% species D
1% species E
Abundance Profiling
Highindeldatasetscontainingknowngenomes
Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.
“Novel”genomedatasets
Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.
TIPPvs.otherabundanceprofilers
• TIPPishighlyaccurate,eveninthepresenceofhighindelratesandnovelgenomes,andforbothshortandlongreads.
• Allothermethodshavesomevulnerability(e.g.,mOTUisonlyaccurateforshortreadsandisimpactedbyhighindelrates).
• ImprovedaccuracyisduetotheuseofeHMMs;singleHMMsdonotprovidethesameadvantages,especiallyinthepresenceofhighindelrates.
SEPPandeHMMs
AnensembleofHMMsprovidesabeeermodelofamulLplesequencealignmentthanasingleHMM,andisbeeerableto• detecthomologybetweenfulllengthsequencesandfragmentarysequences
• addfragmentarysequencesintoanexisLngalignment
especiallywhentherearemanyindelsand/orsubsLtuLons.
OurPublicaLonsusingeHMMs• S.Mirarab,N.Nguyen,andT.Warnow."SEPP:SATé-EnabledPhylogeneLc
Placement."Proceedingsofthe2012PacificSymposiumonBiocompuLng(PSB2012)17:247-258.
• N.Nguyen,S.Mirarab,B.Liu,M.Pop,andT.Warnow"TIPP:TaxonomicIdenLficaLonandPhylogeneLcProfiling."BioinformaLcs(2014)30(24):3548-3555.
• N.Nguyen,S.Mirarab,K.Kumar,andT.Warnow,"Ultra-largealignmentsusingphylogenyawareprofiles".ProceedingsRECOMB2015andGenomeBiology(2015)16:124
• N.Nguyen,M.Nute,S.Mirarab,andT.Warnow,HIPPI:HighlyaccurateproteinfamilyclassificaLonwithensemblesofHMMs.BMCGenomics(2016):17(Suppl10):765
Allcodesareavailableinopensourceformatheps://github.com/smirarab/sepp
Overview• Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon
• SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon
• Biologicaldataanalysis:refiningmethodsandenablingdiscovery
• OpensourcesoOwaredevelopment• HighperformancecompuLng• ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)
Computational Phylogenomics
NP-hardproblemsLargedatasetsComplexstaLsLcalesLmaLonproblems
MetagenomicsProteinstructureandfuncLonpredicLonMedicalforensicsSystemsbiologyPopulaLongeneLcs
FutureWork-Phylogenomics• Beeertheory,addressingimpactofgenetreeesLmaLonerrorandmissingdata
• Fastgenome-scalephylogeneLctreeesLmaLon(highperformancecompuLng,staLsLcally-basedesLmaLontakingmulLplesourcesofdiscordintoaccount)
• PhylogeneLcnetworkconstrucLononlargedatasets(staLsLcalmethodswithindivide-and-conquerframework)
• BeeerstaLsLcalmodelsofsequenceevoluLon,addressingheterotachy
• Co-esLmaLonofgenetreesandspeciestrees/networks
Futurework-Metagenomics
• Improvedmarker-basedanalyses,andaddressinggenetreeheterogeneity
• RigorousmethodsfordetecLngnovelgenesandspecies
• HighthroughputanalysiswithhighsensiLvity• Metagenomeassembly• HPCimplementaLons• CollaboraLonswithbiologistsandbiomedicalresearchers
Futurework–MulLpleSequenceAlignment
• Improvedlarge-scaleMSA(e.g.,PASTAandUPP)• ExtendingstaLsLcalco-esLmaLonoftreesandMSAtolargedatasets(e.g.,NuteandWarnow2016)
• EfficientandusefulsamplingofMSAs• MSAesLmaLoninthepresenceofduplicaLonsandrearrangements(e.g.,wholegenomealignment)
• BeeerHMM+phylogenymodelsthatareusefulforesLmaLngalignmentsandtrees
Futurework-Theory
• Basicalgorithmicchallenges:– supertrees– compuLngtreesfromdistancematrices– usingchordalgraphsfordivide-and-conquer– Consensustrees
• Appliedprobability:– Trade-offbetweendataqualityandquanLty(e.g.,
staLsLcalbinning)– IdenLfiabilityoftreemodelswithnoisydata– UnderstandingensemblesofHMMs