Phyloinformatics or How to analyze LOTS of sequences
description
Transcript of Phyloinformatics or How to analyze LOTS of sequences
Phyloinformaticsor
How to analyze LOTS of sequences
Heath BlackmonUniversity of Texas at Arlington
Bioinformatics – Spring 2014
Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank
Align• MAFFT………………
Evaluate Alignment• LAST• Gblocks / Guidance
Infer Phylogeny
Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank
Align• MAFFT………………
Evaluate Alignment• LAST• Gblocks / Guidance
Infer Phylogeny
www.phylota.net
Select and Download Data
• Find a sequence cluster with:> 500 sequences< 2000 base pairs
- Tetrapoda- Teleostei- eudicotyledons- arthropoda
Select and Download Data
• Find a sequence cluster with:> 500 sequences< 2000 base pairs
Download the example file of 18S sequences from the class google drive: 18S.fa
- Tetrapoda- Teleostei- eudicotyledons- arthropoda
Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank
Align• MAFFT………………
Evaluate Alignment• LAST• Gblocks / Guidance
Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank
Align• MAFFT………………
Evaluate Alignment• LAST• Gblocks / Guidance
ProbConsTCofee
Clustal
MuscleKalign
PRRN
DIALIGN-T
MAFFT
Alignment Programs
Clustal Omega
Bali-Phy
DECIPHER
Balance Between Scalability & Accuracy
Method Score CPU time (s)
Consistency based methods
MAFFT 5.662 86.91 6,000
ProbCons 1.10 87.25 43,000
TCofee 2.46 84.56 210,000
Iterative refinement methods
Muscle 3.52 81.67 3,400
PRRN 3.11 82.61 250,000
MAFFT 3.89 82.16 3,600
ClustalW 2.0 76.67 58,000
Progressive methods
Kalign 1.0 80.25 480
MAFFT 5.662 78.63 140
Muscle 3.52 77.63 160
ClustalW 1.83 75.34 2,000
MAFFT
• Align 1,000s of sequences in minutes/hours• Progressive and iterative methods supported• Multiple scoring schemes
• Install locally or run on the CBRC servers
Go ahead and try aligning the 18S.fa file that you downloaded from the class google drive.
Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank
Align• MAFFT………………
Evaluate Alignment• LAST• Gblocks / Guidance
Dot PlotA
C
A
A
T
A
C
G
A
G
C
A
T
A
A
A
T
C
C T A A A T A C G A G C A T A A C A
DELETION / INSERTIONA
C
A
A
T
A
C
G
A
G
C
A
T
A
A
A
T
C
C T A A A T A C G C A T A A C A
INVERSIONA
C
A
A
T
A
C
G
A
G
C
A
T
A
A
A
T
C
C T A A A T A C A C A A T A C G A G
INVERSIONA
C
A
A
T
A
C
G
A
G
C
A
T
A
A
A
T
C
C T A A A T A C T G T T A T G C T C
Matches between same strand
Matches between opposite strand
Evaluating the 18S alignment
• Look at your dot plots first. What is wrong with the sequences?
• How would you fix/prevent this problem?
Evaluating Sites in an Alignments
• Bootstrapping - Guidance• ID regions with strong support - Gblocks
GBlocks
GBlocks
GBlocks
9 W residues6 I residues8 F residues
Bootstrapping
Bootstrapping
These scores across the bottom scaled between 0 and 1 report the proportion of alignments that agree on the assignment of nucleotides in the original MSA
Try The Data You Downloaded
• Make an alignment• Check the dot plots• Use Gblocks to remove uncertain sites
– How many sites in initial alignment?– How many sites in filtered alignment?– Did you lose any taxa?
• Treat your alignment as a model parameter!
• BaliPhy: Estimates phylogenetic trees across all possible alignments without conditioning on a single alignment being “true”
• Thanks for listening to me!