Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...

ChalkTalk

TandyWarnowDepartmentsofComputerScienceand

BioengineeringUniversityofIllinoisatUrbana-Champaign

•  Large-scale statistical phylogeny estimation •  Ultra-large multiple-sequence alignment •  Estimating species trees from incongruent gene trees •  Supertree estimation •  Genome rearrangement phylogeny •  Reticulate evolution •  Visualization of large trees and alignments •  Data mining techniques to explore multiple optima

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Applications areas: •  metagenomics •  protein structure and function prediction •  trait evolution •  detection of co-evolution •  systems biology

Techniques: •  Graph theory (especially chordal graphs) •  Probability theory and statistics •  Hidden Markov models •  Combinatorial optimization •  Heuristics •  Supercomputing

Overview•  Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon

•  SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon

•  Biologicaldataanalysis:refiningmethodsandenablingdiscovery

•  OpensourcesoOwaredevelopment•  HighperformancecompuLng•  ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)

PastWork(highlights)•  GenetreeesLmaLon(theoreLcalresultsunderstochasLcmodelsofsequenceevoluLon)

•  MulLplesequencealignmentonlargedatasets,andco-esLmaLonofalignmentsandtrees

•  PhylogeneLcnetworksandspeciestreesfrommulL-locusdatasets

•  Genomerearrangementphylogeny•  Supertreemethods•  Metagenomics•  HistoricallinguisLcs

Futurework

Theory,methods,andempiricalstudiesfor•  Genome-scalephylogenyesLmaLonaddressingmulLplesourcesforgenetreeheterogeneity

•  Microbiomeanalysis•  Ultra-largemulLplesequencealignmentandtreeesLmaLon

AndapplicaLonsofthesetechniquesoutsidebiology

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

CurrentNSFgrants

MajorAreas•  Phylogenomics:SpeciestreeandnetworkesLmaLonusing

wholegenomes(andgenetreeesLmaLoninthecontextofwholegenomes)

•  Mul*pleSequenceAlignment:InferringrelaLonshipsbetweenleeersinmolecularsequences,especiallyonverylargedatasets(upto1,000,000sequences)

•  Metagenomics:Analysisofmolecularsequencesobtainedfromenvironmentalsamples(jointwithMihaiPopandBillGropp)

•  Scalingcomputa*onallyintensivemethodstolargedatasets:CombiningdiscretemathandstaLsLcalmethodstoenablehighlyaccurateanalysisofultra-largedatasets(jointwithChandraChekuriandSaLshRao)

Phylogenomics = Species trees from whole genomes

“NothinginbiologymakessenseexceptinthelightofevoluLon”-Dobhzansky

phylogenomics

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Gene tree discordance

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

IncompleteLineageSorLng(ILS)isadominantcauseofgenetreeheterogeneity

IncompleteLineageSorLng(ILS)

•  ConfoundsphylogeneLcanalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.

•  ThereissubstanLaldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundstaLsLcalconsistencyguarantees(theory)andperformanceondata.

Analyzeseparately

Summary Method

MaincompeLngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

StaLsLcalConsistency

Analyzeseparately

Summary Method

MaincompeLngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]

• Optimization Problem (NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree

Score(T ) =X

t2TQ(T ) \Q(t)

all input gene trees

ConstrainedMQST(MaximumQuartetSupportTree)

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbiparLLons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT,subjecttoTdrawingitsbiparLLonsfromX.

Theorems(Mirarabetal.,2014):•  IfXcontainsthebiparLLonsfromtheinputgenetrees(andperhaps

others),thenanexactsoluLontothisproblemisstaLsLcallyconsistentundertheMSC.

•  TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)Lme.(Weusedynamicprogramming,andbuildtheunrootedtreefromtheboeom-up,basedon“allowedclades”–halvesoftheallowedbiparLLons.)

200 Estimated Gene Trees

Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27

7Price, Dehal, Arkin 20158Fletcher, Yang 2009

ASTRALisfairlyrobusttoHGT+ILS

Davidsonetal.,RECOMB-CG,BMCGenomics2015

10 50 100 200 500 1000number of species

rror (

ASTRAL−IIMP−EST

10 50 100 200 500 1000number of species

rror (

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

ContribuLons(sample)MethodsforesLmaLngspeciestreesfromgenome-scaledata:

•  ASTRAL(Mirarabetal.,BioinformaLcs2014,2015)andASTRID(VachaspaLandWarnow,BMCGenomics2015):polynomialLmemethodsthatarestaLsLcallyconsistentundertheMSC.Bothcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.

•  StaLsLcalbinning(Mirarabetal.,Science2014,Bayzidetal.PLOSOne2015)canreducegenetreeesLmaLonerror,andleadtoimprovedspeciestreeesLmaLons(topology,branchlengths,andincidenceoffalseposiLves)

•  BBCA(Zimmermannetal.,BMCGenomics2014)enablesBayesianco-esLmaLonmethodstoscaletolargenumbersofgenes

•  DCM-boosLng(Bayzidetal.,BMCGenomics2014)enablescomputaLonallyintensivemethodstoscaletolargenumbersofspecies

MathemaLcaltheory:

•  RochandWarnow,SystemaLcBiology2015)regardingstaLsLcalconsistencyundertheMSCgivenfinitelengthsequences.

•  Uricchioetal.,BMCBioinformaLcs2016,numberoflocineededtorecoverallthesplitswithhighprobability

Biologicaldataanalyses:

•  Avianphylogenomicsproject(Jarvis,Mirarabetal.,Science2014)

•  ThousandPlantTranscriptomeProject(Wickee,Mirarabetal.PNAS2014)

•  Tarveretal.GenomeBiologyandEvoluLon2016,Mammalianphylogeny

CurrentNSFgrants

Metagenomictaxonomiciden*fica*onandphylogene*cprofiling

Metagenomics,Venteretal.,ExploringtheSargassoSea:Scien*stsDiscoverOneMillionNewGenesinOceanMicrobes

1. Whatisthisfragment?(Classifyeachfragmentaswellaspossible.)

2.WhatisthetaxonomicdistribuLoninthedataset?(Note:helpfultousemarkergenes.)

3.Whataretheorganismsinthismetagenomicsampledoingtogether?

BasicQuesLons

This talk

•  SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs)

•  Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)

PhylogeneLcPlacement

Input:Backbonealignmentandtreeonfull-lengthsequences,andasetofhomologousquerysequences(e.g.,readsinametagenomicsampleforthesamegene)

Output:Placementofquerysequencesonbackbonetree

PhylogeneLcplacementcanbeusedinsideapipeline,aOerdeterminingthegenesforeachofthereadsinthemetagenomicsample.

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT

Fragmentarysequencesfromsomegene

Full-lengthsequencesforsamegene,andanalignmentandatree

AlignSequence

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

AlignSequence

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

PlaceSequence

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

PhylogeneLcPlacement•  Aligneachquerysequencetobackbonealignment

–  HMMALIGN(Eddy,BioinformaLcs1998)–  PaPaRa(BergerandStamatakis,BioinformaLcs2011)

•  Placeeachquerysequenceintobackbonetree–  Pplacer(Matsenetal.,BMCBioinformaLcs,2011)–  EPA(BergerandStamatakis,SystemaLcBiology2011)

Note:pplacerandEPAusemaximumlikelihood

HMMERvs.PaPaRaAlignments

Increasing rate of evolution

One Hidden Markov Model for the entire alignment?

Or2HMMs?

HMM3 HMM4

Or4HMMs?

SEPPParameterExploraLon

§  Alignmentsubsetsizeandplacementsubsetsizeimpacttheaccuracy,runningLme,andmemoryofSEPP

§  10%rule(subsetsizes10%ofbackbone)hadbestoverallperformance

SEPP(10%-rule)onsimulateddata

Increasing rate of evolution

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT

Fragmentarysequencesfromsomegene

Full-lengthsequencesforsamegene,andanalignmentandatree

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1.  Uses BLAST to assign reads to marker genes 2.  Computes UPP/PASTA reference alignments 3.  Uses reference taxonomies, refined to binary trees using reference

alignment 4.  Modifies SEPP by considering statistical uncertainty in the

extended alignment and placement within the tree

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: The distribution of the sample at the species-level is:

50% species A

20% species B

15% species C

14% species D

1% species E

Abundance Profiling

Highindeldatasetscontainingknowngenomes

Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.

“Novel”genomedatasets

Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.

TIPPvs.otherabundanceprofilers

•  TIPPishighlyaccurate,eveninthepresenceofhighindelratesandnovelgenomes,andforbothshortandlongreads.

•  Allothermethodshavesomevulnerability(e.g.,mOTUisonlyaccurateforshortreadsandisimpactedbyhighindelrates).

•  ImprovedaccuracyisduetotheuseofeHMMs;singleHMMsdonotprovidethesameadvantages,especiallyinthepresenceofhighindelrates.

SEPPandeHMMs

AnensembleofHMMsprovidesabeeermodelofamulLplesequencealignmentthanasingleHMM,andisbeeerableto•  detecthomologybetweenfulllengthsequencesandfragmentarysequences

•  addfragmentarysequencesintoanexisLngalignment

especiallywhentherearemanyindelsand/orsubsLtuLons.

OurPublicaLonsusingeHMMs•  S.Mirarab,N.Nguyen,andT.Warnow."SEPP:SATé-EnabledPhylogeneLc

Placement."Proceedingsofthe2012PacificSymposiumonBiocompuLng(PSB2012)17:247-258.

•  N.Nguyen,S.Mirarab,B.Liu,M.Pop,andT.Warnow"TIPP:TaxonomicIdenLficaLonandPhylogeneLcProfiling."BioinformaLcs(2014)30(24):3548-3555.

•  N.Nguyen,S.Mirarab,K.Kumar,andT.Warnow,"Ultra-largealignmentsusingphylogenyawareprofiles".ProceedingsRECOMB2015andGenomeBiology(2015)16:124

•  N.Nguyen,M.Nute,S.Mirarab,andT.Warnow,HIPPI:HighlyaccurateproteinfamilyclassificaLonwithensemblesofHMMs.BMCGenomics(2016):17(Suppl10):765

Allcodesareavailableinopensourceformatheps://github.com/smirarab/sepp

Overview•  Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon

•  SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon

•  Biologicaldataanalysis:refiningmethodsandenablingdiscovery

•  OpensourcesoOwaredevelopment•  HighperformancecompuLng•  ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)

Computational Phylogenomics

NP-hardproblemsLargedatasetsComplexstaLsLcalesLmaLonproblems

MetagenomicsProteinstructureandfuncLonpredicLonMedicalforensicsSystemsbiologyPopulaLongeneLcs

FutureWork-Phylogenomics•  Beeertheory,addressingimpactofgenetreeesLmaLonerrorandmissingdata

•  Fastgenome-scalephylogeneLctreeesLmaLon(highperformancecompuLng,staLsLcally-basedesLmaLontakingmulLplesourcesofdiscordintoaccount)

•  PhylogeneLcnetworkconstrucLononlargedatasets(staLsLcalmethodswithindivide-and-conquerframework)

•  BeeerstaLsLcalmodelsofsequenceevoluLon,addressingheterotachy

•  Co-esLmaLonofgenetreesandspeciestrees/networks

Futurework-Metagenomics

•  Improvedmarker-basedanalyses,andaddressinggenetreeheterogeneity

•  RigorousmethodsfordetecLngnovelgenesandspecies

•  HighthroughputanalysiswithhighsensiLvity•  Metagenomeassembly•  HPCimplementaLons•  CollaboraLonswithbiologistsandbiomedicalresearchers

Futurework–MulLpleSequenceAlignment

•  Improvedlarge-scaleMSA(e.g.,PASTAandUPP)•  ExtendingstaLsLcalco-esLmaLonoftreesandMSAtolargedatasets(e.g.,NuteandWarnow2016)

•  EfficientandusefulsamplingofMSAs•  MSAesLmaLoninthepresenceofduplicaLonsandrearrangements(e.g.,wholegenomealignment)

•  BeeerHMM+phylogenymodelsthatareusefulforesLmaLngalignmentsandtrees

Futurework-Theory

•  Basicalgorithmicchallenges:–  supertrees–  compuLngtreesfromdistancematrices–  usingchordalgraphsfordivide-and-conquer–  Consensustrees

•  Appliedprobability:–  Trade-offbetweendataqualityandquanLty(e.g.,

staLsLcalbinning)–  IdenLfiabilityoftreemodelswithnoisydata–  UnderstandingensemblesofHMMs

Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...

Documents

Transcript of Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...

Chalk Pastel Landscapes

Chalk and Clay Scenery. Distribution Describe the distribution (the spread and location) of chalk in England Chalk is sedimentary rock Chalk tends to.

Drawing and Painting Chalk Pastel. Day 1: Students will… learn vocabulary about chalk pastel Experiment with chalk pastels.

CHALK & POP

Engineering in Chalk - Raison Foster Associates · C574 GI for Chalk. Engineering Performance Bearing Capacity. Engineering Performance Settlement. Chalk Properties. Chalk Properties.

Chalk Twins Chalk Art for Anime Conventions

Chalk festival 2014

Jumbo Chalk

Chalk Painting

E2.0 Chalk Talk

Tips on how to use soft chalk pastels SOFT CHALK PASTELS.

Chalk (chok)

The chalk streams of Hampshire - James Osmond … Chalk Streams.pdf · The chalk streams of Hampshire discover Photos: ... true origin of chalk. ... path, I cross the flow again,

Chalk Your Support

Chalk Cromatography

Chalk Grassland

Better Chalk Bags

Chalk Lines

Chalk Pastel Techniques

The Caucasian Chalk Circle - socialiststories.com Caucasian Chalk Circle... · The Caucasian Chalk Circle (1944) by Bertolt Brecht Digitalized by RevSocialist for SocialistStories