Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other...
Transcript of Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other...
![Page 1: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/1.jpg)
Ultra-‐large Mul,ple Sequence Alignment
Tandy Warnow Founder Professor of Engineering
The University of Illinois at Urbana-‐Champaign hEp://tandy.cs.illinois.edu
![Page 2: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/2.jpg)
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Phylogeny (evolu,onary tree)
![Page 3: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/3.jpg)
Phylogenies and Applications
Basic Biology: How did life evolve?
Applica,ons of phylogenies to: protein structure and func,on popula,on gene,cs human migra,ons metagenomics
![Page 4: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/4.jpg)
![Page 5: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/5.jpg)
Computational Phylogenetics and Metagenomics
Courtesy of the Tree of Life project
![Page 6: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/6.jpg)
Hard Computational Problems
NP-‐hard problems Large datasets
100,000+ sequences thousands of genes
“Big data” complexity:
model misspecifica,on fragmentary sequences errors in input data streaming data
![Page 7: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/7.jpg)
Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of
computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and
probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: • Historical linguistics, 1994-present • Absolute fast converging methods 1997-2002 • Phylogenetic networks, 2003-2005 • Genome rearrangements, 2000-2006 • Multiple sequence alignment, 2009-present (many papers, including SATé-1
(Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology))
• Supertree methods, 2009-present • Metagenomic analysis, 2014-present • Coalescent-based species tree estimation (2011-present, including Science 2014a,
Science 2014b, PNAS 2014)
![Page 8: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/8.jpg)
Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of
computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and
probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: • Historical linguistics, 1994-present • Absolute fast converging methods 1997-2002 • Phylogenetic networks, 2003-2005 • Genome rearrangements, 2000-2006 • Multiple sequence alignment, 2009-present (many papers, including SATé-1
(Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology))
• Supertree methods, 2009-present • Metagenomic analysis, 2014-present • Coalescent-based species tree estimation (2011-present, including Science 2014a,
Science 2014b, PNAS 2014)
![Page 9: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/9.jpg)
Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of
computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and
probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: • Historical linguistics, 1994-present • Absolute fast converging methods 1997-2002 • Phylogenetic networks, 2003-2005 • Genome rearrangements, 2000-2006 • Multiple sequence alignment, 2009-present (many papers, including SATé-1
(Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology))
• Supertree methods, 2009-present • Metagenomic analysis, 2014-present • Coalescent-based species tree estimation (2011-present, including Science 2014a,
Science 2014b, PNAS 2014)
![Page 10: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/10.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTT AAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTT AAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT
![Page 11: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/11.jpg)
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y
U
V W
X
Y
![Page 12: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/12.jpg)
Markov Model of Site Evolu,on Simplest (Jukes-‐Cantor, 1969): • The model tree T is binary and has subs,tu,on probabili,es p(e) on
each edge e. • The state at the root is randomly drawn from {A,C,T,G} (nucleo,des) • If a site (posi,on) changes on an edge, it changes with equal probability
to each of the remaining states. • The evolu,onary process is Markovian.
More complex models (such as the General Time Reversible model, or the General Markov model) are also considered, o`en with liEle change to the theory.
![Page 13: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/13.jpg)
Quan,fying Error
FN: false negative (missing edge) FP: false positive (incorrect edge)
FN
FP 50% error rate
![Page 14: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/14.jpg)
Sta,s,cal Consistency
error
Data
Maximum likelihood is sta,s,cally consistent under standard models (e.g., GTR)
![Page 15: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/15.jpg)
Mathema,cal Ques,ons
• Is the model tree iden,fiable? • Which es,ma,on methods are sta,s,cally consistent under this model?
• How much data does the method need to es,mate the model tree correctly (with high probability)?
• What is the impact of model misspecifica,on? • What is the computa,onal complexity of an es,ma,on problem?
![Page 16: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/16.jpg)
The Classical Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y
U
V W
X
Y
![Page 17: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/17.jpg)
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y
U
V W
X
Y
Much is known about this problem from a mathema,cal and empirical viewpoint
![Page 18: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/18.jpg)
AGAT TAGACTT TGCACAA TGCGCTT AGGGCATGA
U V W X Y
U
V W
X
Y
However…
![Page 19: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/19.jpg)
…ACGGTGCAGTTACCA…
Mutation Deletion
…ACCAGTCACCA…
Indels (insertions and deletions)
![Page 20: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/20.jpg)
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
The true mul*ple alignment – Reflects historical substitution, insertion, and deletion
events – Defined using transitive closure of pairwise alignments
computed on edges of the true tree
…ACGGTGCAGTTACCA…
Substitution Deletion
…ACCAGTCACCTA…
Insertion
![Page 21: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/21.jpg)
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
![Page 22: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/22.jpg)
Phase 1: Alignment
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
![Page 23: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/23.jpg)
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
S1
S4
S2
S3
![Page 24: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/24.jpg)
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly iden,fy orthologs
• Compute mul,ple sequence alignments for each locus, and construct gene trees
• Compute species tree or network:
– Combine the es,mated gene trees, OR
– Es,mate a tree from a concatena,on of the mul,ple sequence alignments
• Get sta,s,cal support on each branch (e.g., bootstrapping)
• Es,mate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
![Page 25: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/25.jpg)
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly iden,fy orthologs
• Compute mul,ple sequence alignments for each locus, and construct gene trees
• Compute species tree or network:
– Combine the es,mated gene trees, OR
– Es,mate a tree from a concatena,on of the mul,ple sequence alignments
• Get sta,s,cal support on each branch (e.g., bootstrapping)
• Es,mate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
Coalescent-‐based species tree es,ma,on!
![Page 26: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/26.jpg)
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly iden,fy orthologs
• Compute mul,ple sequence alignments for each locus, and construct gene trees
• Compute species tree or network:
– Combine the es,mated gene trees, OR
– Es,mate a tree from a concatena,on of the mul,ple sequence alignments
• Get sta,s,cal support on each branch (e.g., bootstrapping)
• Es,mate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
![Page 27: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/27.jpg)
Large-scale Alignment Estimation
• Many genes are considered unalignable due to high rates of evolu,on
• Only a few methods can analyze large datasets
• iPlant (NSF Plant Biology Collabora,ve) and other projects planning to construct phylogenies with 500,000 taxa
![Page 28: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/28.jpg)
1kp: Thousand Transcriptome Project
l First study (WickeE, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments es,mated using SATé, and a coalescent-‐based species tree es,mated using ASTRAL
l Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy)
Gene Tree Incongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin
Challenges: Species tree estimation from conflicting gene trees Gene tree estimation of datasets with > 100,000 sequences
Plus many many other people…
![Page 29: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/29.jpg)
Multiple Sequence Alignment (MSA): a scientific grand challenge1
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA
Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation
1 Frontiers in Massive Data Analysis, National Academies Press, 2013
![Page 30: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/30.jpg)
This talk
• “Big data” multiple sequence alignment
• SATé (Science 2009, Systematic Biology 2012) and PASTA (RECOMB and J Comp Biol 2015), methods for co-estimation of alignments and trees
• UPP (Genome Biology 2015): ultra-large multiple sequence alignment, using the “Ensemble of HMMs technique”.
• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012)
– metagenomic taxon identification (TIPP, Bioinformatics 2014)
– protein structure and function classification
– gene binning
![Page 31: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/31.jpg)
Mul,ple Sequence Alignment
![Page 32: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/32.jpg)
First Align, then Compute the Tree
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
S1
S4
S2
S3
![Page 33: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/33.jpg)
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
S1
S4
S2
S3
Co-‐es,ma,on would be much beEer!!!
![Page 34: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/34.jpg)
Simulation Studies
S1 S2
S3 S4
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA
Compare
True tree and alignment
S1 S4
S3 S2
Estimated tree and alignment
Unaligned Sequences
![Page 35: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/35.jpg)
Quantifying Error
FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate
FN
FP
![Page 36: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/36.jpg)
Two-‐phase es,ma,on Alignment methods • Clustal • POY (and POY*) • Probcons (and Probtree) • Probalign • MAFFT • Muscle • Di-‐align • T-‐Coffee • Prank (PNAS 2005, Science 2008) • Opal (ISMB and Bioinf. 2007) • FSA (PLoS Comp. Bio. 2009) • Infernal (Bioinf. 2009) • Etc.
Phylogeny methods • Bayesian MCMC • Maximum parsimony • Maximum likelihood • Neighbor joining • FastME • UPGMA • Quartet puzzling • Etc.
RAxML: heuris>c for large-‐scale ML op>miza>on
![Page 37: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/37.jpg)
1000-‐taxon models, ordered by difficulty (Liu et al., 2009)
![Page 38: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/38.jpg)
SATé “Family” of methods
• Itera,ve divide-‐and-‐conquer methods – Each itera,on re-‐aligns the sequences using the current tree, running preferred MSA methods on small local subsets, and merging subset alignments
– Each itera,on computes an ML tree on the current alignment, under the GTR (Generalized Time Reversible) Markov model of evolu,on
• Note: these methods are “MSA boosters”, designed to improve accuracy and/or scalability of the base method
• We show results using MAFFT-‐l-‐ins-‐i to align subsets
![Page 39: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/39.jpg)
Re-aligning on a tree A
B D
C
Merge sub-alignments
Estimate ML tree on merged
alignment
Decompose dataset
A B
C D
Align subsets
A B
C D
ABCD
![Page 40: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/40.jpg)
SATé and PASTA Algorithms
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
Repeat un,l termina,on condi,on, and
return the alignment/tree pair with the best ML score
![Page 41: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/41.jpg)
1000-‐taxon models, ordered by difficulty – rate of evolu,on generally increases from le` to right
SATé-‐1 24 hour analysis, on desktop machines
(Similar improvements for biological datasets)
SATé-‐1 can analyze up to about 8,000 sequences.
SATé-‐1 (Science 2009) performance
![Page 42: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/42.jpg)
1000-‐taxon models ranked by difficulty
SATé-‐1 and SATé-‐2 (Systema*c Biology, 2012)
SATé-‐1: up to 8K SATé-‐2: up to ~50K
![Page 43: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/43.jpg)
SATé variants differ only in the decomposition strategy
A
B D
C
Merge sub-alignments
Estimate ML tree on merged
alignment
Decompose dataset
A B
C D
Align subsets
A B
C D
ABCD
![Page 44: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/44.jpg)
SATé-‐II: centroid edge decomposi,on
A
B
C
D
E
ABCDE
ABC
AB
A B
C
DE
D E
SATé-‐II makes all subsets small (user parameter), and can analyze 50K sequences,
SATé-‐I decomposi,on produced clades and had bigger subsets; limited to 8K sequences
![Page 45: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/45.jpg)
SATé: merger strategy
A
B
C
D
E
ABCDE
ABC
AB
A B
C
DE
D E
Both SATé’s use the same hierarchical merger strategy. On large (50K) datasets, the last pairwise merger can use
more than 70% of the running ,me
![Page 46: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/46.jpg)
A
B
C
D
E
PASTA merging: Step 1
D
C
E B A
Compute a spanning tree connec,ng alignment subsets
![Page 47: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/47.jpg)
A
B
C
D
E
PASTA merging: Step 2
D
C
E B A
AB
BD
CD
DE
AB BD
CD
DE
Use Opal (or Muscle) to merge adjacent subset alignments in the spanning tree
![Page 48: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/48.jpg)
PASTA merging: Step 3
D
C
E B A
Use transi,vity to merge all pairwise-‐merged alignments from Step 2 into final an alignment on en,re dataset
AB + BD = ABD ABD + CD = ABCD ABCD + DE = ABCDE
AB BD
CD
DE
Overall: O(n log(n) + L)
![Page 49: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/49.jpg)
�
�
�
�
�
� � � � � �� ������ � �������
�������
�����
PASTA vs. SATé-‐II profiling and scaling 10 PASTA: ultra-large multiple sequence alignment
(a)
●
●
●
●
●
●
●
●
0
250
500
750
1000
1250
10,000 50,000 100,000 200,000Number of Sequences
Run
ning
tim
e (m
inut
es)
(b)
●●
●●
●
●
●
●
●
●
●
●
●
●
1
2
4
6
8
1 2 4 6 8 10 12Number of Threads
Spee
dup
●
●
PASTASATe2
(c)
Fig. 5. Running time comparison of PASTA and SATe. (a) Running time pro-filing on one iteration for RNASim datasets with 10K and 50K sequences (the dottedregion indicates the last pairwise merge). (b) Running time for one iteration of PASTAwith 12 CPUs as a function of the number of sequences (the solid line is fitted to firsttwo points). (c) Scalability for PASTA and SATe with increased number of CPUs.
reason SATe uses so much time is that all mergers are done hierarchically usingeither Opal (for small datasets) or Muscle (on larger datasets), and both arecomputationally expensive with increased number of sequences. For example,the last pairwise merge within SATe, shown by the dotted area in Figure 5a,is entirely serial and takes up a large chunk of the total time. PASTA solvesthis problem by using transitivity for all but the initial pairwise mergers, andtherefore scales well with increased dataset size, as shown in Figure 5b (thesub-linear scaling is due to a better use of parallelism with increased number ofsequences). Finally, Figure 5c shows that PASTA is highly parallelizable, andhas a much better speed-up with increasing number of threads than SATe does.While PASTA has a much improved parallelization, it does not quite scale uplinearly, because FastTree-2 does not scale up well with increased thread count.
Divide-and-Conquer strategy: impact of guide tree. We also investigated theimpact of the use of the guide tree for computing the subset decomposition,and hence defining the Type 1 sub-alignments. We compared results obtainedusing three di↵erent decompositions: the decomposition computed by PASTAon the HMM-based starting tree, the decomposition computed by PASTA onthe true (model) tree, and a random decomposition into subsets of size 200,all on the RNASim 10k dataset. PASTA alignments and trees had roughly thesame accuracy when the guide tree was either the true tree or the HMM-basedstarting tree (Table 3). However, when based on a random decomposition, treeerror increased dramatically from 10.5% to 52.3%, and alignment scores alsodropped substantially. Thus, the guide-tree based dataset decomposition usedby PASTA provides substantial improvements over random decompositions, andthe default technique for getting the starting tree works quite well.
����� �����
�
��
���
�
����
����
����
�����
�� ����� ��
� � ���� ����� �� ��
����� ��� �������� ��� ���
![Page 50: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/50.jpg)
PASTA Running Time and Scalability 10 PASTA: ultra-large multiple sequence alignment
(a)
●
●
●
●
●
●
●
●
0
250
500
750
1000
1250
10,000 50,000 100,000 200,000Number of Sequences
Run
ning
tim
e (m
inut
es)
(b)
●●
●●
●
●
●
●
●
●
●
●
●
●
1
2
4
6
8
1 2 4 6 8 10 12Number of Threads
Spee
dup
●
●
PASTASATe2
(c)
Fig. 5. Running time comparison of PASTA and SATe. (a) Running time pro-filing on one iteration for RNASim datasets with 10K and 50K sequences (the dottedregion indicates the last pairwise merge). (b) Running time for one iteration of PASTAwith 12 CPUs as a function of the number of sequences (the solid line is fitted to firsttwo points). (c) Scalability for PASTA and SATe with increased number of CPUs.
reason SATe uses so much time is that all mergers are done hierarchically usingeither Opal (for small datasets) or Muscle (on larger datasets), and both arecomputationally expensive with increased number of sequences. For example,the last pairwise merge within SATe, shown by the dotted area in Figure 5a,is entirely serial and takes up a large chunk of the total time. PASTA solvesthis problem by using transitivity for all but the initial pairwise mergers, andtherefore scales well with increased dataset size, as shown in Figure 5b (thesub-linear scaling is due to a better use of parallelism with increased number ofsequences). Finally, Figure 5c shows that PASTA is highly parallelizable, andhas a much better speed-up with increasing number of threads than SATe does.While PASTA has a much improved parallelization, it does not quite scale uplinearly, because FastTree-2 does not scale up well with increased thread count.
Divide-and-Conquer strategy: impact of guide tree. We also investigated theimpact of the use of the guide tree for computing the subset decomposition,and hence defining the Type 1 sub-alignments. We compared results obtainedusing three di↵erent decompositions: the decomposition computed by PASTAon the HMM-based starting tree, the decomposition computed by PASTA onthe true (model) tree, and a random decomposition into subsets of size 200,all on the RNASim 10k dataset. PASTA alignments and trees had roughly thesame accuracy when the guide tree was either the true tree or the HMM-basedstarting tree (Table 3). However, when based on a random decomposition, treeerror increased dramatically from 10.5% to 52.3%, and alignment scores alsodropped substantially. Thus, the guide-tree based dataset decomposition usedby PASTA provides substantial improvements over random decompositions, andthe default technique for getting the starting tree works quite well.
• One itera,on
• Using • 12 cpus • 1 node on Lonestar TACC • Maximum 24 GB memory
• Showing wall clock running ,me • ~ 1 hour for 10k taxa • ~ 17 hours for 200k taxa
![Page 51: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/51.jpg)
Tree accuracy
10
Figure 3.3: Tree error rates on nucleotide datasets. We show missingbranch (also known as false negative or FN) rates for maximum likelihood treesestimated using FastTree-II, on the reference alignment as well as alignmentscomputed using PASTA and other methods; results not shown indicate failureto complete within 24 hours using 12 cores on the datasets. Error bars showstandard error over 10 replicates for all model conditions of the Indelible andthe 10,000-sequence RNASim datasets.
82
Figure 3.3: Tree error rates on nucleotide datasets. We show missingbranch (also known as false negative or FN) rates for maximum likelihood treesestimated using FastTree-II, on the reference alignment as well as alignmentscomputed using PASTA and other methods; results not shown indicate failureto complete within 24 hours using 12 cores on the datasets. Error bars showstandard error over 10 replicates for all model conditions of the Indelible andthe 10,000-sequence RNASim datasets.
82
1 million sequences:
• PASTA finished one iteration in 15 days
• PASTA tree had 6% error, compared to 5.6% when using true alignment
• Starting tree had 8.4% error
![Page 52: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/52.jpg)
PASTA vs. SATé-‐II • Difference is how subset alignments are merged together (transi,vity instead of Opal/Muscle).
• As expected, PASTA is faster and can analyze larger datasets.
• Unexpected: PASTA produces more accurate alignments and trees (on both simulated and biological data, including DNA, RNA, and AA sequences).
• Thus, transi,vity applied to compa,ble and overlapping alignments gives a surprisingly accurate technique for merging a collec,on of alignments.
![Page 53: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/53.jpg)
PASTA and SATé-‐II: MSA “boosters”
• PASTA and SATé-‐II are techniques for improving the scalability of MSA methods to large datasets.
• We showed results here using MAFFT-‐l-‐ins-‐i to align small subsets with 200 sequences.
• We have also explored results using other MSA methods (e.g., Prank, Clustal, Bali-‐Phy), and obtain similar improvements in accuracy and/or scalability.
![Page 54: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/54.jpg)
1kp: Thousand Transcriptome Project
l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin
Challenge: Massive gene tree conflict consistent with ILS Alignment of datasets with > 100,000 sequences
Plus many many other people…
![Page 55: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/55.jpg)
Length
Counts
0
2000
4000
6000
8000
10000
12000 Mean:317Median:266
0 500 1000 1500 2000
1KP dataset: more than 100,000 p450 amino-‐acid sequences, many fragmentary
![Page 56: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/56.jpg)
Length
Counts
0
2000
4000
6000
8000
10000
12000 Mean:317Median:266
0 500 1000 1500 2000
1KP dataset: more than 100,000 p450 amino-‐acid sequences, many fragmentary
All standard mul>ple sequence alignment methods we tested performed poorly on datasets with fragments.
![Page 57: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/57.jpg)
1kp: Thousand Transcriptome Project
l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UIUC UT-Austin UT-Austin
Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences
Plus many many other people…
![Page 58: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/58.jpg)
UPP UPP = “Ultra-‐large mul,ple sequence alignment using Phylogeny-‐aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-‐scale mul,ple sequence alignments, even in the presence of fragmentary sequences.
![Page 59: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/59.jpg)
UPP UPP = “Ultra-‐large mul,ple sequence alignment using Phylogeny-‐aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-‐scale mul,ple sequence alignments, even in the presence of fragmentary sequences.
Uses an ensemble of HMMs
![Page 60: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/60.jpg)
Simple idea (not UPP)
• Select random subset of sequences, and build “backbone alignment”
• Construct a Hidden Markov Model (HMM) on the backbone alignment
• Add all remaining sequences to the backbone alignment using the HMM
![Page 61: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/61.jpg)
One Hidden Markov Model for the entire alignment?
![Page 62: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/62.jpg)
Simple idea (not UPP)
• Select random subset of sequences, and build “backbone alignment”
• Construct a Hidden Markov Model (HMM) on the backbone alignment
• Add all remaining sequences to the backbone alignment using the HMM
![Page 63: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/63.jpg)
• Select random subset of sequences, and build “backbone alignment”
• Construct a Hidden Markov Model (HMM) on the backbone alignment
• Add all remaining sequences to the backbone alignment using the HMM
This approach works well if the dataset is small and has low evolu,onary rates, but is not very accurate otherwise.
![Page 64: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/64.jpg)
One Hidden Markov Model for the en,re alignment?
HMM 1
![Page 65: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/65.jpg)
Or 2 HMMs?
HMM 1
HMM 2
![Page 66: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/66.jpg)
HMM 1
HMM 3 HMM 4
HMM 2
Or 4 HMMs?
![Page 67: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/67.jpg)
m
HMM 2
HMM 3
HMM 1
HMM 4
HMM 5 HMM 6
HMM 7
Or all 7 HMMs?
![Page 68: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/68.jpg)
UPP Algorithmic Approach
1. Select random subset of full-‐length sequences, and build “backbone alignment”
2. Construct an “Ensemble of Hidden Markov Models” on the backbone alignment
3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs
![Page 69: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/69.jpg)
UPP Algorithmic Approach 1. Select random subset of full-‐length sequences, and build
“backbone alignment” and “backbone tree”
Notes: – Need to avoid fragments in the backbone
– We show results using PASTA for the backbone alignment and tree, but are other methods can be used – we have explored BAli-‐Phy, a powerful Bayesian sta,s,cal method
– Random is good when taxonomic sampling is rela,vely uniform, but directed sampling can improve accuracy
– We explored backbones with 100 and 1000 sequences, even when the full dataset is very big (1,000,000 – one million)
![Page 70: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/70.jpg)
UPP Algorithmic Approach
2. Construct an “Ensemble of Hidden Markov Models” on the backbone alignment
– Technique: Create set of subsets (using the tree). Then, for each subset, build an HMM on the induced alignment on each subset.
– Note: Different subset sizes are good for different situa,ons, and the ensemble technique is more accurate than disjoint sets
![Page 71: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/71.jpg)
UPP Algorithmic Approach
3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs
– For each of the remaining sequences s, find H, the HMM from the ensemble that has the best score (i.e., HMM maximizing Pr(s|H))
– Use HMMER code and H to add s into the backbone alignment
![Page 72: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/72.jpg)
Evalua,on • Simulated datasets (some have fragmentary sequences): – 10K to 1,000,000 sequences in RNASim – complex RNA sequence evolu,on simula,on
– 1000-‐sequence nucleo,de datasets from SATé papers – 5000-‐sequence AA datasets (from FastTree paper) – 10,000-‐sequence Indelible nucleo,de simula,on
• Biological datasets: – Proteins: largest BaliBASE and HomFam – RNA: 3 CRW datasets up to 28,000 sequences
![Page 73: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/73.jpg)
RNASim: alignment error
Note: Maz was run under default se{ngs for 10K and 50K sequences and under ParEree for 100K sequences, and fails to complete under any se{ng For 200K sequences. Clustal-‐Omega only completes on 10K dataset.
All methods given 24 hrs on a 12-‐core machine
![Page 74: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/74.jpg)
RNASim: tree error
Note: Maz was run under default se{ngs for 10K and 50K sequences and under ParEree for 100K sequences, and fails to complete under any se{ng For 200K sequences. Clustal-‐Omega only completes on 10K dataset.
All methods given 24 hrs on a 12-‐core machine
![Page 75: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/75.jpg)
RNASim Million Sequences: alignment error
Notes: • We show alignment error
using average of SP-FN and SP-FP.
• UPP variants have better alignment scores than PASTA.
• (Not shown: Total Column Scores – PASTA more accurate than UPP)
• No other methods tested could complete on these data
• PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).
![Page 76: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/76.jpg)
RNASim Million Sequences: tree error
Using 12 processors: • UPP(Fast,NoDecomp)
took 2.2 days,
• UPP(Fast) took 11.9 days, and
• PASTA took 10.3 days
![Page 77: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/77.jpg)
0.0
0.2
0.4
0.6
0 12.5 25 50% Fragmentary
Mea
n al
ignm
ent e
rror
PASTA UPP(Default)
(a) Average alignment error
0.0
0.2
0.4
0 12.5 25 50% Fragmentary
Del
ta F
N tr
ee e
rror
PASTA UPP(Default)
(b) Average tree error
Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.
80
Performance on fragmentary datasets of the 1000M2 model condi,on
UPP vs. PASTA: impact of fragmenta,on
Under high rates of evolu,on, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Under low rates of evolu,on, PASTA can s,ll be highly accurate (data not shown). UPP con,nues to have good accuracy even on datasets with many fragments under all rates of evolu,on.
![Page 78: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/78.jpg)
●
●
●
●
0
5
10
15
50000 100000 150000 200000Number of sequences
Wal
l clo
ck a
lign
time
(hr)
● UPP(Fast)
UPP Running Time
Wall-‐clock ,me used (in hours) given 12 processors
![Page 79: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/79.jpg)
Current Related Research UPP: • Using itera,on • Using other MSA models (not just HMMs) within the “Ensemble” • Using structural alignments for the backbone • Using powerful sta,s,cal methods to produce the backbone alignment and tree
PASTA : • Using powerful sta,s,cal methods for the subset alignments • Improving the pairwise merging technique
ENSEMBLE OF HMMS: • Metagenomic taxon iden,fica,on (collabora,on with Mihai Pop, Maryland) • Gene binning (joint with Jian Peng, UIUC, and with Jim Leebens-‐Mack, Georgia) • Protein structure and func,on classifica,on (collabora,ons with Mar,n Weigt, Paris, and
Jian Peng, UIUC)
![Page 80: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/80.jpg)
Acknowledgments
PhD students: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD) Undergrad: Keerthana Kumar Current NSF grants: • ABI-‐1458652 (mul,ple sequence alignment) • III:AF:1513629 (metagenomics – collabora,ve with Mihai Pop, University of Maryland) • DBI:1461364 (phylogenomics – collabora,ve with Rice and Stanford) • CCF:1535977 (graph algorithms to improve phylogene,c es,ma,on – collabora,ve with Berkeley) Other recent or current support: Guggenheim Founda,on, NSF DEB:0733029, NSF DBI:1062335, Microso` Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Compu,ng Center), the University of Alberta (Canada), Grainger Founda,on (at UIUC), and UIUC TACC, UTCS, and UIUC computa,onal resources
![Page 81: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/81.jpg)
Alignment Accuracy – Correct columns 8 PASTA: ultra-large multiple sequence alignment
RNASim Gutell
0
200
400
600
800
0
25
50
75
100
125
Alig
nmen
tAcc
urac
y(T
C)
Starting Alignment ClustalW Mafft"Profile Muscle SATe2 PASTA
Fig. 3. Alignment accuracy on the RNASim 10K-200K (left) and biological
(right) datasets. We show the number of correctly aligned sites (top) and the averageof the SP-score and modeler score (bottom). The starting alignment was incomplete onthe 16S.T dataset, and so no result is shown for the starting alignment on that dataset.
were recovered entirely correctly. Another interesting trend is that as the numberof sequences increases, the alignment accuracy decreased for MAFFT-profile butnot for PASTA.
The biological datasets are smaller (see Table 1) and so are not as challenging.On the 16S.T dataset, the starting alignment did not return an alignment withall the sequences on the 16S.T dataset because HMMER considered one of thesequences unalignable. However, the starting alignment technique had good SP-scores for the other two datasets. Of the remaining methods, PASTA has thebest sum-of-pairs scores (bottom panel), and MAFFT-profile has only slightlypoorer scores; the other methods are substantially poorer. With respect to TCscores, on 16S.B.ALL and 16S.T, PASTA is in first place and SATe is in secondplace, but they swap positions on 16S.3. TC scores for the other methods areclearly less accurate, though Muscle does fairly well on the 16S.B.ALL dataset.
Comparison to SATe on 50,000 taxon dataset. SATe could not finish even oneiteration on the RNASim with 50,000 sequences running for 24 hours and given12 CPUs on TACC. However, we were able to run two iterations of SATe on aseparate machine with no running time limits (12 Quad-Core AMD Opteron(tm)processors, 256GB of RAM memory). Given 12 CPUs, each iteration of SATetakes roughly 70 hours, compared to 5 hours for PASTA, and as shown next, the
8 PASTA: ultra-large multiple sequence alignment
Starting Alignment ClustalW Mafft"Profile Muscle SATe2 PASTA
Fig. 3. Alignment accuracy on the RNASim 10K-200K (left) and biological
(right) datasets. We show the number of correctly aligned sites (top) and the averageof the SP-score and modeler score (bottom). The starting alignment was incomplete onthe 16S.T dataset, and so no result is shown for the starting alignment on that dataset.
were recovered entirely correctly. Another interesting trend is that as the numberof sequences increases, the alignment accuracy decreased for MAFFT-profile butnot for PASTA.
The biological datasets are smaller (see Table 1) and so are not as challenging.On the 16S.T dataset, the starting alignment did not return an alignment withall the sequences on the 16S.T dataset because HMMER considered one of thesequences unalignable. However, the starting alignment technique had good SP-scores for the other two datasets. Of the remaining methods, PASTA has thebest sum-of-pairs scores (bottom panel), and MAFFT-profile has only slightlypoorer scores; the other methods are substantially poorer. With respect to TCscores, on 16S.B.ALL and 16S.T, PASTA is in first place and SATe is in secondplace, but they swap positions on 16S.3. TC scores for the other methods areclearly less accurate, though Muscle does fairly well on the 16S.B.ALL dataset.
Comparison to SATe on 50,000 taxon dataset. SATe could not finish even oneiteration on the RNASim with 50,000 sequences running for 24 hours and given12 CPUs on TACC. However, we were able to run two iterations of SATe on aseparate machine with no running time limits (12 Quad-Core AMD Opteron(tm)processors, 256GB of RAM memory). Given 12 CPUs, each iteration of SATetakes roughly 70 hours, compared to 5 hours for PASTA, and as shown next, the
8 PASTA: ultra-large multiple sequence alignment
0.00 0.00
10000 50000 100000 200000 16S.3 16S.T 16S.B.ALL
Fig. 3. Alignment accuracy on the RNASim 10K-200K (left) and biological
(right) datasets. We show the number of correctly aligned sites (top) and the averageof the SP-score and modeler score (bottom). The starting alignment was incomplete onthe 16S.T dataset, and so no result is shown for the starting alignment on that dataset.
were recovered entirely correctly. Another interesting trend is that as the numberof sequences increases, the alignment accuracy decreased for MAFFT-profile butnot for PASTA.
The biological datasets are smaller (see Table 1) and so are not as challenging.On the 16S.T dataset, the starting alignment did not return an alignment withall the sequences on the 16S.T dataset because HMMER considered one of thesequences unalignable. However, the starting alignment technique had good SP-scores for the other two datasets. Of the remaining methods, PASTA has thebest sum-of-pairs scores (bottom panel), and MAFFT-profile has only slightlypoorer scores; the other methods are substantially poorer. With respect to TCscores, on 16S.B.ALL and 16S.T, PASTA is in first place and SATe is in secondplace, but they swap positions on 16S.3. TC scores for the other methods areclearly less accurate, though Muscle does fairly well on the 16S.B.ALL dataset.
Comparison to SATe on 50,000 taxon dataset. SATe could not finish even oneiteration on the RNASim with 50,000 sequences running for 24 hours and given12 CPUs on TACC. However, we were able to run two iterations of SATe on aseparate machine with no running time limits (12 Quad-Core AMD Opteron(tm)processors, 256GB of RAM memory). Given 12 CPUs, each iteration of SATetakes roughly 70 hours, compared to 5 hours for PASTA, and as shown next, the
![Page 82: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/82.jpg)
TIPP: high accuracy taxonomic identification of metagenomic data
TIPP: taxon identification and phylogenetic profiling
(Bioinformatics, 2014) Technique: combines UPP alignments and phylogenetic
placement algorithms, and considers statistical uncertainty.
Results: better accuracy than all current methods, even for
sequencing technologies producing high indel rates Research funded by new NSF grant III:AF:1513629
(collabora,ve with Mihai Pop at University of Maryland)
![Page 83: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/83.jpg)
Metagenomic taxonomic iden*fica*on and phylogene*c profiling
Metagenomics, Venter et al., Exploring the Sargasso Sea: Scien*sts Discover One Million New Genes in Ocean Microbes
![Page 84: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/84.jpg)
1. What is this fragment? (Classify each fragment as well as possible.)
2. What is the taxonomic distribu,on in the dataset? (Note: helpful to use marker genes.)
3. What are the organisms in this metagenomic sample doing together?
Basic Ques,ons
![Page 85: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/85.jpg)
Scientific challenges: • Ultra-large multiple-sequence alignment • Alignment-free phylogeny estimation • Supertree estimation • Estimating species trees from many gene trees • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima • Theoretical guarantees under Markov models of evolution
Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics
The Tree of Life: Multiple Challenges
![Page 86: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/86.jpg)
High indel datasets containing known genomes
Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.
![Page 87: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/87.jpg)
TIPP vs. other abundance profilers • TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads.
• All other methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates).
![Page 88: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/88.jpg)
Metagenomic Taxon Iden,fica,on Objec,ve: classify short reads in a metagenomic sample
![Page 89: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/89.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: The distribution of the sample at the species-level is:
50% species A
20% species B
15% species C
14% species D
1% species E
Abundance Profiling
![Page 90: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/90.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
Leading techniques:
PhymmBL (Brady & Salzberg, Nature Methods 2009)
NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011)
MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland
MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard
mOTU (Bork et al., Nature Methods 2013)
MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).
Marker gene are single-copy, universal, and resistant to horizontal transmission.
Abundance Profiling
![Page 91: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/91.jpg)
“Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.
![Page 92: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/92.jpg)
Summary • SATé-‐1 (Science 2009), SATé-‐2 (Systema,c Biology 2012), co-‐es,ma,on of alignments and
trees. SATé-‐2 is well established in the biology community.
• PASTA (RECOMB 2014 and J Comp Biol 2015) is the replacement for SATé. PASTA can analyze up to 1,000,000 sequences.
• UPP (ultra-‐large mul,ple sequence alignment), Genome Biology 2015. Uses a collec,on of HMMs to represent a “backbone alignment”. Improves alignment and also detec,on of remote homology compared to a single HMM. UPP produces highly accurate alignments, even in the presence of fragmentary sequences. Can analyze datasets with 1,000,000 sequences.
• Other applica,ons of the Ensemble of HMMs technique
• TIPP (metagenomic taxon iden,fica,on and abundance profiling), Bioinforma,cs 2014.
• SEPP (phylogene,c placement), PSB 2012.
• Protein sequence analysis (collabora,ons with Mar,n Weigt, Paris, and Jian Peng, UIUC)
![Page 93: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/93.jpg)
SEPP • SEPP: SATé-enabled Phylogenetic
Placement, by Mirarab, Nguyen, and Warnow. Pacific Symposium on Biocomputing, 2012, special session on the Human Microbiome
• Objective: – phylogenetic analysis of single-gene datasets with
fragmentary sequences
• Introduces “HMM Family” technique
![Page 94: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/94.jpg)
Phylogenetic Placement
ACT..TAGA..A AGC...ACA TAGA...CTT TAGC...CCA AGG...GCAT
ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG • . • . • . ACCT
Fragmentary sequences from some gene
Full-‐length sequences for same gene, and an alignment and a tree
![Page 95: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/95.jpg)
Step 1: Align each query sequence to backbone alignment
Step 2: Place each query sequence
into backbone tree, using extended alignment
Phylogenetic Placement
![Page 96: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/96.jpg)
HMMER vs. PaPaRa Alignments
Increasing rate of evolution
0.0
![Page 97: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/97.jpg)
Align Sequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC
![Page 98: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/98.jpg)
Align Sequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 99: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/99.jpg)
Phylogenetic Placement • Align each query sequence to backbone alignment
– HMMALIGN (Eddy, Bioinformatics 1998) – PaPaRa (Berger and Stamatakis, Bioinformatics 2011)
• Place each query sequence into backbone tree – Pplacer (Matsen et al., BMC Bioinformatics, 2011) – EPA (Berger and Stamatakis, Systematic Biology 2011)
Note: pplacer and EPA use maximum likelihood, and are reported to have the same accuracy.
![Page 100: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/100.jpg)
Place Sequence
S1
S4
S2
S3 Q1
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 101: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/101.jpg)
HMMER+pplacer: 1) build one HMM for the entire alignment 2) Align fragment to the HMM, and insert into
alignment 3) Insert fragment into tree to optimize likelihood
![Page 102: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/102.jpg)
Using SEPP for taxon identification
ACT..TAGA..A AGC...ACA TAGA...CTT TAGC...CCA AGG...GCAT
ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG • . • . • . ACCT
Fragmentary sequences from some gene
Full-‐length sequences for same gene, and an alignment and a tree
![Page 103: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/103.jpg)
Using SEPP for taxon identification
ACT..TAGA..A AGC...ACA TAGA...CTT TAGC...CCA AGG...GCAT
ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG • . • . • . ACCT
Fragmentary sequences from some gene
Full-‐length sequences for same gene, and an alignment and a tree
![Page 104: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/104.jpg)
SEPP(10%), based on ~10 HMMs
0.0
0.0
Increasing rate of evolution
![Page 105: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/105.jpg)
![Page 106: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/106.jpg)
• SEPP produced more accurate phylogenetic placements than HMMER+pplacer.
• The only difference is the use of a Family of HMMs instead of one HMM.
• The biggest differences are for datasets with high rates of evolution.
SEPP vs. HMMER+pplacer
![Page 107: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/107.jpg)
Scientific challenges: • Ultra-large multiple-sequence alignment • Gene tree estimation • Metagenomic classification • Alignment-free phylogeny estimation • Supertree estimation • Estimating species trees from many gene trees • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima • Theoretical guarantees under Markov models of evolution
Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data
The Tree of Life: Multiple Challenges
![Page 108: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/108.jpg)
Scientific challenges: • Ultra-large multiple-sequence alignment • Gene tree estimation • Metagenomic classification • Alignment-free phylogeny estimation • Supertree estimation • Estimating species trees from many gene trees • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima • Theoretical guarantees under Markov models of evolution
Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data
The Tree of Life: Multiple Challenges
![Page 109: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/109.jpg)
SATé-‐II running ,me profiling
�
�
�
�
�
� � � � � �� ������ � �������
�������
�����
����� �����
�
��
���
�
����
����
����
�����
�� ����� ��
� � ���� ����� �� ��
��� ������ ���
![Page 110: Ultralarge)Mul,ple)Sequence)Alignmenttandy.cs.illinois.edu/warnow-maryland-msa-v2.pdf• Other applications of the eHMM technique: – phylogenetic placement (SEPP, PSB 2012) – metagenomic](https://reader034.fdocuments.in/reader034/viewer/2022051904/5ff57ba491b1c6056d3d3375/html5/thumbnails/110.jpg)
FIG. 1. Algorithmic design of PASTA. The first six boxes show the steps involved in one iteration of PASTA. The last two boxes show the meaning of transitivity for homologies definedby a column of an MSA, and how the concept of transitivity can be used to merge two compatible and overlapping alignments. MSA, multiple sequence alignment.
3
Figure from Mirarab et al., J. Computa,onal Biology 2014