Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...

Post on 12-Jul-2020

2 views 0 download

Transcript of Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...

ChalkTalk

TandyWarnowDepartmentsofComputerScienceand

BioengineeringUniversityofIllinoisatUrbana-Champaign

•  Large-scale statistical phylogeny estimation •  Ultra-large multiple-sequence alignment •  Estimating species trees from incongruent gene trees •  Supertree estimation •  Genome rearrangement phylogeny •  Reticulate evolution •  Visualization of large trees and alignments •  Data mining techniques to explore multiple optima

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Applications areas: •  metagenomics •  protein structure and function prediction •  trait evolution •  detection of co-evolution •  systems biology

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Techniques: •  Graph theory (especially chordal graphs) •  Probability theory and statistics •  Hidden Markov models •  Combinatorial optimization •  Heuristics •  Supercomputing

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Overview•  Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon

•  SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon

•  Biologicaldataanalysis:refiningmethodsandenablingdiscovery

•  OpensourcesoOwaredevelopment•  HighperformancecompuLng•  ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)

PastWork(highlights)•  GenetreeesLmaLon(theoreLcalresultsunderstochasLcmodelsofsequenceevoluLon)

•  MulLplesequencealignmentonlargedatasets,andco-esLmaLonofalignmentsandtrees

•  PhylogeneLcnetworksandspeciestreesfrommulL-locusdatasets

•  Genomerearrangementphylogeny•  Supertreemethods•  Metagenomics•  HistoricallinguisLcs

Futurework

Theory,methods,andempiricalstudiesfor•  Genome-scalephylogenyesLmaLonaddressingmulLplesourcesforgenetreeheterogeneity

•  Microbiomeanalysis•  Ultra-largemulLplesequencealignmentandtreeesLmaLon

AndapplicaLonsofthesetechniquesoutsidebiology

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

MajorAreas•  Phylogenomics:SpeciestreeandnetworkesLmaLonusing

wholegenomes(andgenetreeesLmaLoninthecontextofwholegenomes)

•  Mul*pleSequenceAlignment:InferringrelaLonshipsbetweenleeersinmolecularsequences,especiallyonverylargedatasets(upto1,000,000sequences)

•  Metagenomics:Analysisofmolecularsequencesobtainedfromenvironmentalsamples(jointwithMihaiPopandBillGropp)

•  Scalingcomputa*onallyintensivemethodstolargedatasets:CombiningdiscretemathandstaLsLcalmethodstoenablehighlyaccurateanalysisofultra-largedatasets(jointwithChandraChekuriandSaLshRao)

Phylogenomics = Species trees from whole genomes

“NothinginbiologymakessenseexceptinthelightofevoluLon”-Dobhzansky

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

IncompleteLineageSorLng(ILS)isadominantcauseofgenetreeheterogeneity

IncompleteLineageSorLng(ILS)

•  ConfoundsphylogeneLcanalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.

•  ThereissubstanLaldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundstaLsLcalconsistencyguarantees(theory)andperformanceondata.

. . .

Analyzeseparately

Summary Method

MaincompeLngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

StaLsLcalConsistency

error

Data

. . .

Analyzeseparately

Summary Method

MaincompeLngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]

• Optimization Problem (NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

8

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree

Score(T ) =X

t2TQ(T ) \Q(t)

all input gene trees

ConstrainedMQST(MaximumQuartetSupportTree)

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbiparLLons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT,subjecttoTdrawingitsbiparLLonsfromX.

Theorems(Mirarabetal.,2014):•  IfXcontainsthebiparLLonsfromtheinputgenetrees(andperhaps

others),thenanexactsoluLontothisproblemisstaLsLcallyconsistentundertheMSC.

•  TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)Lme.(Weusedynamicprogramming,andbuildtheunrootedtreefromtheboeom-up,basedon“allowedclades”–halvesoftheallowedbiparLLons.)

200 Estimated Gene Trees

Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27

7Price, Dehal, Arkin 20158Fletcher, Yang 2009

12

ASTRALisfairlyrobusttoHGT+ILS

Davidsonetal.,RECOMB-CG,BMCGenomics2015

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

ContribuLons(sample)MethodsforesLmaLngspeciestreesfromgenome-scaledata:

•  ASTRAL(Mirarabetal.,BioinformaLcs2014,2015)andASTRID(VachaspaLandWarnow,BMCGenomics2015):polynomialLmemethodsthatarestaLsLcallyconsistentundertheMSC.Bothcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.

•  StaLsLcalbinning(Mirarabetal.,Science2014,Bayzidetal.PLOSOne2015)canreducegenetreeesLmaLonerror,andleadtoimprovedspeciestreeesLmaLons(topology,branchlengths,andincidenceoffalseposiLves)

•  BBCA(Zimmermannetal.,BMCGenomics2014)enablesBayesianco-esLmaLonmethodstoscaletolargenumbersofgenes

•  DCM-boosLng(Bayzidetal.,BMCGenomics2014)enablescomputaLonallyintensivemethodstoscaletolargenumbersofspecies

MathemaLcaltheory:

•  RochandWarnow,SystemaLcBiology2015)regardingstaLsLcalconsistencyundertheMSCgivenfinitelengthsequences.

•  Uricchioetal.,BMCBioinformaLcs2016,numberoflocineededtorecoverallthesplitswithhighprobability

Biologicaldataanalyses:

•  Avianphylogenomicsproject(Jarvis,Mirarabetal.,Science2014)

•  ThousandPlantTranscriptomeProject(Wickee,Mirarabetal.PNAS2014)

•  Tarveretal.GenomeBiologyandEvoluLon2016,Mammalianphylogeny

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

Metagenomictaxonomiciden*fica*onandphylogene*cprofiling

Metagenomics,Venteretal.,ExploringtheSargassoSea:Scien*stsDiscoverOneMillionNewGenesinOceanMicrobes

1. Whatisthisfragment?(Classifyeachfragmentaswellaspossible.)

2.WhatisthetaxonomicdistribuLoninthedataset?(Note:helpfultousemarkergenes.)

3.Whataretheorganismsinthismetagenomicsampledoingtogether?

BasicQuesLons

This talk

•  SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs)

•  Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)

PhylogeneLcPlacement

Input:Backbonealignmentandtreeonfull-lengthsequences,andasetofhomologousquerysequences(e.g.,readsinametagenomicsampleforthesamegene)

Output:Placementofquerysequencesonbackbonetree

PhylogeneLcplacementcanbeusedinsideapipeline,aOerdeterminingthegenesforeachofthereadsinthemetagenomicsample.

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT

Fragmentarysequencesfromsomegene

Full-lengthsequencesforsamegene,andanalignmentandatree

AlignSequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

AlignSequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

PlaceSequence

S1

S4

S2

S3Q1

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

PhylogeneLcPlacement•  Aligneachquerysequencetobackbonealignment

–  HMMALIGN(Eddy,BioinformaLcs1998)–  PaPaRa(BergerandStamatakis,BioinformaLcs2011)

•  Placeeachquerysequenceintobackbonetree–  Pplacer(Matsenetal.,BMCBioinformaLcs,2011)–  EPA(BergerandStamatakis,SystemaLcBiology2011)

Note:pplacerandEPAusemaximumlikelihood

HMMERvs.PaPaRaAlignments

Increasing rate of evolution

0.0

One Hidden Markov Model for the entire alignment?

Or2HMMs?

HMM1

HMM2

HMM1

HMM3 HMM4

HMM2

Or4HMMs?

SEPPParameterExploraLon

§  Alignmentsubsetsizeandplacementsubsetsizeimpacttheaccuracy,runningLme,andmemoryofSEPP

§  10%rule(subsetsizes10%ofbackbone)hadbestoverallperformance

SEPP(10%-rule)onsimulateddata

0.0

0.0

Increasing rate of evolution

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT

Fragmentarysequencesfromsomegene

Full-lengthsequencesforsamegene,andanalignmentandatree

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1.  Uses BLAST to assign reads to marker genes 2.  Computes UPP/PASTA reference alignments 3.  Uses reference taxonomies, refined to binary trees using reference

alignment 4.  Modifies SEPP by considering statistical uncertainty in the

extended alignment and placement within the tree

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: The distribution of the sample at the species-level is:

50% species A

20% species B

15% species C

14% species D

1% species E

Abundance Profiling

Highindeldatasetscontainingknowngenomes

Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.

“Novel”genomedatasets

Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.

TIPPvs.otherabundanceprofilers

•  TIPPishighlyaccurate,eveninthepresenceofhighindelratesandnovelgenomes,andforbothshortandlongreads.

•  Allothermethodshavesomevulnerability(e.g.,mOTUisonlyaccurateforshortreadsandisimpactedbyhighindelrates).

•  ImprovedaccuracyisduetotheuseofeHMMs;singleHMMsdonotprovidethesameadvantages,especiallyinthepresenceofhighindelrates.

SEPPandeHMMs

AnensembleofHMMsprovidesabeeermodelofamulLplesequencealignmentthanasingleHMM,andisbeeerableto•  detecthomologybetweenfulllengthsequencesandfragmentarysequences

•  addfragmentarysequencesintoanexisLngalignment

especiallywhentherearemanyindelsand/orsubsLtuLons.

OurPublicaLonsusingeHMMs•  S.Mirarab,N.Nguyen,andT.Warnow."SEPP:SATé-EnabledPhylogeneLc

Placement."Proceedingsofthe2012PacificSymposiumonBiocompuLng(PSB2012)17:247-258.

•  N.Nguyen,S.Mirarab,B.Liu,M.Pop,andT.Warnow"TIPP:TaxonomicIdenLficaLonandPhylogeneLcProfiling."BioinformaLcs(2014)30(24):3548-3555.

•  N.Nguyen,S.Mirarab,K.Kumar,andT.Warnow,"Ultra-largealignmentsusingphylogenyawareprofiles".ProceedingsRECOMB2015andGenomeBiology(2015)16:124

•  N.Nguyen,M.Nute,S.Mirarab,andT.Warnow,HIPPI:HighlyaccurateproteinfamilyclassificaLonwithensemblesofHMMs.BMCGenomics(2016):17(Suppl10):765

Allcodesareavailableinopensourceformatheps://github.com/smirarab/sepp

Overview•  Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon

•  SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon

•  Biologicaldataanalysis:refiningmethodsandenablingdiscovery

•  OpensourcesoOwaredevelopment•  HighperformancecompuLng•  ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)

Computational Phylogenomics

NP-hardproblemsLargedatasetsComplexstaLsLcalesLmaLonproblems

MetagenomicsProteinstructureandfuncLonpredicLonMedicalforensicsSystemsbiologyPopulaLongeneLcs

FutureWork-Phylogenomics•  Beeertheory,addressingimpactofgenetreeesLmaLonerrorandmissingdata

•  Fastgenome-scalephylogeneLctreeesLmaLon(highperformancecompuLng,staLsLcally-basedesLmaLontakingmulLplesourcesofdiscordintoaccount)

•  PhylogeneLcnetworkconstrucLononlargedatasets(staLsLcalmethodswithindivide-and-conquerframework)

•  BeeerstaLsLcalmodelsofsequenceevoluLon,addressingheterotachy

•  Co-esLmaLonofgenetreesandspeciestrees/networks

Futurework-Metagenomics

•  Improvedmarker-basedanalyses,andaddressinggenetreeheterogeneity

•  RigorousmethodsfordetecLngnovelgenesandspecies

•  HighthroughputanalysiswithhighsensiLvity•  Metagenomeassembly•  HPCimplementaLons•  CollaboraLonswithbiologistsandbiomedicalresearchers

Futurework–MulLpleSequenceAlignment

•  Improvedlarge-scaleMSA(e.g.,PASTAandUPP)•  ExtendingstaLsLcalco-esLmaLonoftreesandMSAtolargedatasets(e.g.,NuteandWarnow2016)

•  EfficientandusefulsamplingofMSAs•  MSAesLmaLoninthepresenceofduplicaLonsandrearrangements(e.g.,wholegenomealignment)

•  BeeerHMM+phylogenymodelsthatareusefulforesLmaLngalignmentsandtrees

Futurework-Theory

•  Basicalgorithmicchallenges:–  supertrees–  compuLngtreesfromdistancematrices–  usingchordalgraphsfordivide-and-conquer–  Consensustrees

•  Appliedprobability:–  Trade-offbetweendataqualityandquanLty(e.g.,

staLsLcalbinning)–  IdenLfiabilityoftreemodelswithnoisydata–  UnderstandingensemblesofHMMs