New Multiple Sequence Alignment Methods
description
Transcript of New Multiple Sequence Alignment Methods
![Page 1: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/1.jpg)
New Multiple Sequence Alignment Methods
Tandy WarnowThe Department of Computer Science
The University of Texas at Austin
![Page 2: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/2.jpg)
The “Tree of Life”
![Page 3: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/3.jpg)
Avian Phylogenomics ProjectG Zhang, BGI
• Approx. 50 species, whole genomes• 8000+ genes, UCEs• Gene sequence alignments and trees computed using SATé
MTP Gilbert,Copenhagen
S. Mirarab Md. S. Bayzid UT-Austin UT-Austin
T. WarnowUT-Austin
Plus many many other people…
Erich Jarvis,HHMI
Challenges: Maximum likelihood tree estimation on multi-million-site
sequence alignmentsMassive gene tree incongruence
![Page 4: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/4.jpg)
1kp: Thousand Transcriptome Project
Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy)Gene Tree Incongruence
G. Ka-Shu WongU Alberta
N. WickettNorthwestern
J. Leebens-MackU Georgia
N. MatasciiPlant
T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUT-Austin UT-Austin UT-Austin UT-Austin
Challenge: Alignment of datasets with > 100,000 sequencesGene tree incongruence
Plus many many other people…
![Page 5: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/5.jpg)
Multiple Sequence Alignment (MSA): another grand challenge1
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …Sn = TCACGACCGACA
Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation
1 Frontiers in Massive Data Analysis, National Academies Press, 2013
![Page 6: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/6.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
![Page 7: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/7.jpg)
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
![Page 8: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/8.jpg)
AGAT TAGACTT TGCACAA TGCGCTTAGGGCATGA
U V W X Y
U
V W
X
Y
The “real” problem
![Page 9: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/9.jpg)
…ACGGTGCAGTTACCA…
MutationDeletion
…ACCAGTCACCA…
Indels (insertions and deletions)
![Page 10: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/10.jpg)
…ACGGTGCAGTTACC-A……AC----CAGTCACCTA…
• The true multiple alignment – Reflects historical substitution, insertion, and deletion events– Defined using transitive closure of pairwise alignments computed on
edges of the true tree
…ACGGTGCAGTTACCA…
SubstitutionDeletion
…ACCAGTCACCTA…
Insertion
![Page 11: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/11.jpg)
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
![Page 12: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/12.jpg)
Phase 1: Alignment
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
![Page 13: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/13.jpg)
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1
S4
S2
S3
![Page 14: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/14.jpg)
Simulation Studies
S1 S2
S3S4
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA
Compare
True tree and alignment
S1 S4
S3S2
Estimated tree and alignment
Unaligned Sequences
![Page 15: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/15.jpg)
Quantifying Error
FN: false negative (missing edge)FP: false positive (incorrect edge)
50% error rate
FN
FP
![Page 16: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/16.jpg)
Two-phase estimationAlignment methods• Clustal• POY (and POY*)• Probcons (and Probtree)• Probalign• MAFFT• Muscle• Di-align• T-Coffee • Prank (PNAS 2005, Science 2008)• Opal (ISMB and Bioinf. 2007)• FSA (PLoS Comp. Bio. 2009)• Infernal (Bioinf. 2009)• Etc.
Phylogeny methods• Bayesian MCMC • Maximum
parsimony • Maximum likelihood • Neighbor joining• FastME• UPGMA• Quartet puzzling• Etc.RAxML: heuristic for large-scale ML optimization
![Page 17: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/17.jpg)
1000-taxon models, ordered by difficulty (Liu et al., 2009)
![Page 18: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/18.jpg)
Problems with the two-phase approach
• Current alignment methods fail to return reasonable alignments on large datasets with high rates of indels and substitutions.
• Manual alignment is time consuming and subjective.
• Systematists discard potentially useful markers if they are difficult to align.
This issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)
![Page 19: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/19.jpg)
SATé
SATé (Simultaneous Alignment and Tree Estimation)• Liu et al., Science 2009• Liu et al., Systematic Biology 2012• Public distribution (open source software) and
user-friendly GUI
![Page 20: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/20.jpg)
1000-taxon models, ordered by difficulty (Liu et al., 2009)
![Page 21: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/21.jpg)
Re-aligning on a treeA
B D
C
Merge sub-
alignments
Estimate ML tree on merged
alignment
Decompose dataset
A BC D Align
subproblems
A BC D
ABCD
![Page 22: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/22.jpg)
SATé Algorithm
Tree
Obtain initial alignment and estimated ML tree
![Page 23: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/23.jpg)
SATé Algorithm
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
![Page 24: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/24.jpg)
SATé Algorithm
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
![Page 25: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/25.jpg)
SATé Algorithm
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
If new alignment/tree pair has worse ML score, realign using a different decomposition
Repeat until termination condition (typically, 24 hours)
![Page 26: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/26.jpg)
1000-taxon models, ordered by difficulty (Liu et al., 2009)
![Page 27: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/27.jpg)
1000 taxon models, ordered by difficulty
24 hour SATé analysis, on desktop machines
(Similar improvements for biological datasets)
![Page 28: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/28.jpg)
1000 taxon models ranked by difficulty
![Page 29: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/29.jpg)
LimitationsA
B D
C
Merge sub-
alignments
Estimate ML tree on merged
alignment
Decompose dataset
A BC D Align
subproblems
A BC D
ABCD
![Page 30: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/30.jpg)
LimitationsA
B D
C
Mergesub-
alignments
Estimate ML tree on merged
alignment
Decompose dataset
A BC D Align
subproblems
A BC D
ABCD
![Page 31: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/31.jpg)
PASTA Algorithm (Improving SATé)
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
New Re-Alignment Technique
Alignment
If new alignment/tree pair has worse ML score, realign using a different decomposition
Repeat until termination condition (typically, 24 hours)
![Page 32: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/32.jpg)
Each PASTA iterationA
B D
C
New technique
for merging alignments
Estimate ML tree on merged
alignment
Decompose dataset
A BC D Align
subproblems
A BC D
ABCD
![Page 33: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/33.jpg)
PASTA vs. SATé: better alignments and trees, better scalability
Analyses of Gutell’s16S datasets with curatedstructural alignments andreference trees usingmaximum likelihoodwith bootstrapping.
PASTA can analyzedatasets with up to200,000 sequencesefficiently.
SATé maxes out around50,000 sequences.
![Page 34: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/34.jpg)
1KP dataset: Large and Highly Fragmentary!
Cytochrome dataset sequence length distribution
>100,000 AA sequencesTranscriptome data
![Page 35: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/35.jpg)
UPP: basic idea
Input: set S of unaligned sequencesOutput: alignment on S
• Select random subset X of S• Estimate “backbone” alignment A and tree T on X• Independently align each sequence in S-X to A• Use transitivity to produce multiple sequence
alignment A* for entire set S
![Page 36: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/36.jpg)
Input: Unaligned Sequences
S1 = AGGCTATCACCTGACCTCCAATS2 = TAGCTATCACGACCGCGCTS3 = TAGCTGACCGCGCTS4 = TACTCACGACCGACAGCTS5 = TAGGTACAACCTAGATCS6 = AGATACGTCGACATATC
![Page 37: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/37.jpg)
Step 1: Pick random subset(backbone)
S1 = AGGCTATCACCTGACCTCCAATS2 = TAGCTATCACGACCGCGCTS3 = TAGCTGACCGCGCTS4 = TACTCACGACCGACAGCTS5 = TAGGTACAACCTAGATCS6 = AGATACGTCGACATATC
![Page 38: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/38.jpg)
Step 2: Compute backbone alignment
S1 = -AGGCTATCACCTGACCTCCA-ATS2 = TAG-CTATCAC--GACCGC--GCTS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC—-GACCGACAGCTS5 = TAGGTAAAACCTAGATCS6 = AGATAAAACTACATATC
![Page 39: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/39.jpg)
Step 3: Align each remaining sequence to backbone
S1 = -AGGCTATCACCTGACCTCCA-AT-S2 = TAG-CTATCAC--GACCGC--GCT-S3 = TAG-CT-------GACCGC—-GCT-S4 = TAC----TCAC--GACCGACAGCT-S5 = TAGG---T-A—CAA-CCTA--GATC
First we add S5 to the backbone alignment
![Page 40: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/40.jpg)
Step 3: Align each remaining sequence to backbone
S1 = -AGGCTATCACCTGACCTCCA-AT-S2 = TAG-CTATCAC--GACCGC--GCT-S3 = TAG-CT-------GACCGC--GCT-S4 = TAC----TCAC—-GACCGACAGCT-S6 = -AG---AT-A-CGTC--GACATATC
Then we add S6 to the backbone alignment
![Page 41: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/41.jpg)
Step 4: Use transitivity to obtain MSA on entire set
S1 = -AGGCTATCACCTGACCTCCA-AT--S2 = TAG-CTATCAC--GACCGC--GCT--S3 = TAG-CT-------GACCGC--GCT--S4 = TAC----TCAC--GACCGACAGCT--S5 = TAGG---T-A—CAA-CCTA--GATC-S6 = -AG---AT-A-CGTC--GACATAT-C
![Page 42: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/42.jpg)
UPP: details
Input: set S of unaligned sequencesOutput: alignment on S
• Select random subset X of S• Estimate “backbone” alignment A and tree T on X• Independently align each sequence in S-X to A• Use transitivity to produce multiple sequence
alignment A* for entire set S
![Page 43: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/43.jpg)
UPP: details
Input: set S of unaligned sequencesOutput: alignment on S
• Select random subset X of S• Estimate “backbone” alignment A and tree T on X• Independently align each sequence in S-X to A• Use transitivity to produce multiple sequence
alignment A* for entire set S
![Page 44: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/44.jpg)
How to align sequences to a backbone alignment?
Standard machine learning technique: Build HMM (Hidden Markov Model) for backbone alignment, and use it to align remaining sequences
We use HMMER (Eddy, HHMI) for this purpose
![Page 45: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/45.jpg)
1: Build Hidden Markov Model (HMM) for backbone MSA2: Use HMM to align remaining sequences (one by one)3: Use transitivity to infer MSA on entire dataset
![Page 46: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/46.jpg)
Using HMMER
Using HMMER works well…• …except when the dataset has a high
evolutionary diameter.
![Page 47: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/47.jpg)
Using HMMER
Using HMMER works well…except when the dataset is big!
![Page 48: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/48.jpg)
One Hidden Markov Model for the entire alignment?
![Page 49: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/49.jpg)
Or 2 HMMs?
![Page 50: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/50.jpg)
Or 4 HMMs?(or 8 or 16 or…)
![Page 51: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/51.jpg)
Or what about many different HMMs(HMMs on subsets of size 10, 20, 40, …, n)?
![Page 52: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/52.jpg)
Evaluation
• Simulated datasets (some have fragmentary sequences):– 10K to 1,000,000 sequences in RNASim (Sheng Guo, Li-San
Wang, and Junhyong Kim, arxiv)– 1000-sequence nucleotide datasets from SATe papers– 5000-sequence AA datasets (from FastTree paper)– 10,000-sequence Indelible nucleotide simulation
• Biological datasets:– Proteins: largest BaliBASE and HomFam – RNA: 3 CRW (Gutell) datasets up to 28,000 sequences
![Page 53: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/53.jpg)
RNASim: Tree error
Maximum Likelihoodtrees computedusing FastTree (Price)
![Page 54: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/54.jpg)
One Million Sequences: Tree Error
UPP(100,100): 1.6 days using 12 processors
UPP(100,10): 7 days using 12 processors
Note: UPP decomposition improves accuracy
![Page 55: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/55.jpg)
UPP vs. MAFFT-profile Running Time
![Page 56: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/56.jpg)
AA Sequence Alignment Error (13 HomFam datasets)
Homfam: Sievers et al., MSB 2011 10,000 to 46,000 sequences per dataset; 5-14 sequences in each seed alignmentSequence lengths from 46-364 AA
![Page 57: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/57.jpg)
1KP Cytochrome dataset sequence length distribution (>100K seqs)
![Page 58: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/58.jpg)
Performance on fragmentary 1000M2 with varying degrees of fragmentation.
![Page 59: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/59.jpg)
Impact of rate of evolution on performance on fragmentary sequence data
![Page 60: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/60.jpg)
Impact of rate of evolution on performance on fragmentary sequence data
![Page 61: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/61.jpg)
MSA and Phylogeny Estimation
• MSA estimation impacts phylogenetic accuracy, and many datasets are hard to align:
– Large datasets– Datasets with high rates of evolution– Datasets with fragmentary sequences
• SATé (Liu et al., Science 2009 and Systematic Biology 2012) provides very good alignments and trees even for high rates of evolution and large datasets
• PASTA (RECOMB 2014) is a direct improvement on SATé (matches or improves accuracy on small datasets, better accuracy on large datasets, can analyze very large datasets – up to 200,000 sequences so far).
• UPP – different approach, highly robust to fragmentary sequences, more accurate alignments than PASTA (but in some cases less accurate trees), can analyze even larger datasets, highly parallelizable. (Not yet available).
![Page 62: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/62.jpg)
UPP Algorithmic Parameters
• Decompose or not? Use a Family of HMMs or just a single HMM?
• Use a small (100 sequence) or large (1000 sequence) backbone?
• Others:– How to compute backbone alignment and tree– How to select backbone dataset
![Page 63: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/63.jpg)
![Page 64: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/64.jpg)
Observations
• Decomposing (using a family of HMMs) improves tree accuracy, but has little impact on alignment accuracy.
• Using large backbones rather than small improves alignment accuracy but has only a small impact on tree accuracy.
• Alignment error metrics and tree error metrics are not that well correlated.
![Page 65: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/65.jpg)
SEPP, TIPP, and UPP
• SEPP: SATe-enabled Phylogenetic Placement (Mirarab, Nugyen and Warnow, PSB 2012)
• TIPP: Taxon identification and phylogenetic profiling (Nguyen, Mirarab, Liu, Pop, and Warnow, submitted)
• UPP: Ultra-large multiple sequence alignment using SEPP (Nguyen, Mirarab, Kumar, Wang, Guo, Kim, and Warnow, in preparation)
![Page 66: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/66.jpg)
Future Work
• Extending TIPP to non-marker genes• Using the new HMM Family technique in SEPP
and UPP• Using external seed alignments in SEPP, TIPP,
and UPP• Boosting statistical co-estimation methods by
using them in UPP for the backbone alignment and tree
![Page 67: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/67.jpg)
Warnow Laboratory
PhD students: Siavash Mirarab*, Nam Nguyen, and Md. S. Bayzid**Undergrad: Keerthana KumarLab Website: http://www.cs.utexas.edu/users/phylo
Funding: Guggenheim Foundation, Packard, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), and the University of Alberta (Canada)
TACC and UTCS computational resources* Supported by HHMI Predoctoral Fellowship** Supported by Fulbright Foundation Predoctoral Fellowship
![Page 68: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/68.jpg)
Indelible 10K: Alignment error
Note: ClustalO and Mafft-Profile(Default)fail to generate an alignmentunder this model condition.
![Page 69: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/69.jpg)
Indelible 10K : Tree error
Note: ClustalO and Mafft-Profile(Default)fail to generate an alignmentunder this model condition.
![Page 70: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/70.jpg)
FastTree AA: Alignment error
![Page 71: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/71.jpg)
FastTree AA: Tree error
![Page 72: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/72.jpg)
Large fragmentary: Tree error
![Page 73: New Multiple Sequence Alignment Methods](https://reader033.fdocuments.in/reader033/viewer/2022051117/56815f5a550346895dce3cf1/html5/thumbnails/73.jpg)
Large fragmentary: Alignment error
10000-sequence dataset from RNASim16S.3 and 16S.T from Gutell’s Comparative Ribosomal Website (CRW)