Post on 11-Jan-2016
Phylogenomics
Phylogenetics
Phylogenomics
reconstruction of phyletic relationships based on the analysis of -
- several (to several dozens) genes
- complete genetic information (ideal)- several dozens to hundreds of coding sequences (phylotranscriptomics)
Why?
vast amount of genetic information should significantly improve the prediction of phylogenetic relationships and
eliminate signal noise
... and sometimes it really works
Adl et al, 2012
... but sometimes it doesn’t
possible source of error: - incorrect sequence annotation
OCT
ATC
possible source of error: - paralogues
possible source of error: - sins of the past
L/HGT
possible source of error: - sins of the past
EGT
“LEUCA”
...ANIMALS/FUNGI PLANTS RHODOPHYTES
EGT
...
18(16)S rRNA
- combination of variable and conserved regions
- zero L/HGT- exhaustive taxon sampling- known secondary structure- hundreds of copies per cell -
single-cell PCR- cost per nt + speed- ‘18S is always right’
+
- - ~1800bp- intraindividual paralogues- lower branching support
MULTI-PROTEIN DATASETs
+ - large ammount of information- modular- robust branching support
(although often false)
- - limited sampling- variable quality of
phylohenetic signal- L/H(E)GT- still costly and slow- HW demanding analysis- stability of topologies (or lack
of thereof)
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
n
( )
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGF
CONCATENATION
MGF
- lots of redundancy in dbs (duplicates, close paralogues...)- usually it is better to get rid of them
sequence clustering
+ - speed, relative HW friendly, accuracy
- - accuracy, black-box
CD-HIT
USEARCH
DB editing
FASTA – universal and simple!, but non unified
NCBI:>gi|269120277|ref|YP_003308454.1| carbamate kinase [Sebaldella termitidis ATCC 33386]MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAMSQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVVASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINYGKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK
JGI:>jgi|Dappu1|290510|JCO_fgenesh1_kg.C_scaffold_4000019MKLVYTVASAFLVVLIAQSAYASEKLSAQDYAYNSTCLNHLRSHIKRELQAAVTYLAMGAWANHYSVQRPGLANFFFDSASEEREHGLKLLGYLRMRGHNDLDILPSSLEPLNGKYEWENSLSALRQALKMEKDVTESIKKIIDYCADAEDHQLADYLTGDFMEEQLKGQRNVAGLANTLQGVLRKQPRLGEWIFDNNLSKSMAV
manual for several sequences but several thousands?
GB's of RAM
robust OS and text editor
!Regular expressions!
>gi|269120277|ref|YP_003308454.1| carbamate kinase [Sebaldella termitidis ATCC 33386]MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAMSQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVVASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINYGKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK
Find:>\w+\|\d+\|\w+\|(\w+).*\[(\w+\s\w+).*
Replace:>\2_\1
>Sebaldella termitidis_YP_003308454MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAMSQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVVASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINYGKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK
extremely powerful, easy to learn, fun to use:
!Regular expressions!
BLAST vs annotation
BLAST (plus relatives) is the only reliable way to identify homologues, do not rely on annotation!
the more the better
beware of close paralogues! Meticulous SGF necessary
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
OCT
ATC
possible source of error: - paralogues
commercialDATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
vs. free
- both (shiny GUI/command-line scripts) will get you there relatively fast and easy but... beware of possible errors, there is no universal solution
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
Multiple alignment
- important and necessary step in identification and definition of dna or protein domains,oligonucleotide design, phylogenetic analyses...
- most of the modern algorithms are iterative (can self-improve during the iterations) and reasonably good working (really, don’t use Clustal unless you really have to), some of the most used are:MAFFT, MUSCLE, Kalign, ProbCons (none of them miraculous, each makes mistakes, but it’s not that bad)
- all of the above mentioned are accessible on-line (follow the hyperlinks) or can be run locally... nevertheless, you’ll have to use some alignment-viewer/editor to visualize them
- several free options (depending on what OS you use) MS WIndows: Bioedit- the living legend’, extensive features, user-friendly, can import from GenBank, align (also translation alignment, although with ), edit, annotate, translate, do phylogeny... Mac: MacClade - great editing features and them some more, user friendly, but doesn’t align, nor does phylogenies currently work only up to OSX 10.6, not (mountain) lion. Multi-platform: MEGA – good for alignment, phylogenetic and molecular evolution analyses
Jalview – excellent for proteomics, passable alignment editor
SeaView – great aligner/editor (although takes time to get use to it), excellent features for phylogenetics (inclusion sets, translation alignment, there’s
no UNDO button!)... and then again, if you have access/can afford Geneious (student licenses are cheap), you can skip everything listed above
Editing
- remember: the tree is as good as is the alignment; crap-in-crap-out!- the goal is to keep only unambiguously aligned regions and relevant OTU (remove duplicates or long-branchers)
site selection: AUTOMATED vs. MANUAL
automated: good as a starting point, reproducible, ‘objective’, transparent, but ... crudemanual: subjective, often non-reproducible, needs ‘expertise’, but... better (usually), can be fine-tuned to the each respective dataset
Example- SeaViewopen dataset (in this case apicomplexa_ssu1.fas) and align it.. you already know how, right?
Some regions are conserved (i.e., not much divergent diversity), there’s little doubt about the correctness of alignment. They should be kept for analysis as they carry vital information.
Example- SeaView
On the other hand, some are pretty variable and could be aligned in several ways. Because we cannot be sure the information they contain is correct, we should exclude these prior to analysis in order not to introduce error (remember, crap-in-crap-out).
In some situations, especially when you’re fresh to the problematics, it is not so clear what parts of alignments should be kept and what excluded from analysis. Gblocks (or similar SW) can help you. Luckily, it is also implemented in SeaView: as it tends to remove too much, let’s keep the
parameters the least strict
regions with X are kept, those with dashes excluded from selectionyou can edit the selection afterwards and save it using Files-Save selection
you can also directly perform phylogenetic analysis by clicking on Trees
you can choose from three different methods, PhyML represents Maximum likelihood
the default settings are reasonable compromise between speed and precision, so you can leave them on
for publication, you will have to also assess branching support
and you may want to use more thorough algorithm of tree search (check ‘Best of NNI and SPR’)
then hit Run ... and wait ... time depends on the method and size of dataset (obviously, the bigger the longer).
ING
ROU
P
OUTGROUP (root)
branch scale bar (substitutions per site – the longer the branch, the more divergent the sequence)
node (represents hypothetical ancestor of all taxa/branches stemming off the node, also defines clade)
clade (group of sequences sharing common ancestor/stemming from single node)
sister taxa(two taxa forming clade )
sister clades
SeaView has also implemented very decent tree viewer/editor
you can also create several subsets of alignments (inclusion sets) by clicking Sites-Create set
and give it the name
parts of sequences above X (highlighted) are included in selection. You can select the sites by combination of right- and left-clicks (left unselect point sites, right removes selection between two unselected regions, single left-click select single site, by holding left and moving mouse, you can re-select the whole regions) I know, it sounds awkward... TRY TO PRACTICE iT!you can then duplicate-rename and create different inclusion sets and Save just selection, not the whole alignment. This feature can be extremely useful in phylogenies and sets the SeaView apart from the others alignment editors (will get to it next time)
Coding sequences should be aligned in ‘translation’ mode – temporarily translated into and aligned as amino acids and back-translated into nucleotides keeping the alignment positions
in SeaView click Props-View as proteins
uncheck View as proteins
now, the sequences are aligned according ORF
Phylogenetic inferenceYou don’t have to use the state-of-art phylogenetic methods for initial analysis/es, which purpose is
to (quickly) identify redundancy (duplicates and very similar sequences), aberrant and very divergent sequences or the need to extend the dataset (quite often, you realize, you should’ve add some other taxa). For that, simple neighbour joining tree based on J-C, K2P or HKY model, or stripped-down maximum likelihood run (without gamma categories and branching support) would suffice and do the job quickly even on some older computers.
On the other hand, for the purpose of the publication (or if you want to be sure), once you’ve polished your dataset, you should use the best (possible) methods. That usually means Maximum-likelihood with gamma-corrected and GTR (nucleotides) of LG or WAG (amino acids) substitution matrices (or models of evolution, if you wish... these matrices tells computer, how probable is change from one state to another). But, it all depends on the dataset... if the sequences are similar and/or there’re just few of them, it may be preferable to use simpler matrices/models. There are also some models dedicated to the organellar genomes and/or specific taxonomic groups (like mtArt, which is tailored for analysis of mitochondrial genes of arthropods). There are some programs to tell you, which model suits your dataset the best (for example jModeltest for nucleotides and ProtTest (available also as a server).
The credibility of topology should be ‘tested’ using (non-parametric) bootstrap analysis, during which software creates subreplicates made of random parts of the sequences (all taxa are included) and infers topology form these subreplicates instead of the original dataset. For the purpose of the publication 100 replicates are a bare minimum, the reviewer will probably require 300 or higher number though. If the analysis is meant just for you (or your boss), 100 is totally enough (in my opinion), alternatively you can use even faster method called ‘approximate Likelihood-Ratio test’ (aLRT, implemented in some software).
Nowadays, most reviewers/editors will also require another type of phylogenetic analysis called Bayesian inference. Here, you use the same (similar) models, but the method of topology search is totally different, also, the branching support is expressed as a posterior probability (ranging from 0-1), instead of bootstrap values. Be careful with interpretation of these two values. In bootstrap, everything higher than 50 (meaning the topology appeared in at least 50% of the replicates) is considered to be supported (although weakly), the more you approaching 100, the more confident you could be with the branching. OTOH, the posterior probability anything bellow 0.95 (some go to 0.90) shall be considered as unsupported! Only nodes with 1.0 (or 0.99) PP value are considered to be strongly supported.
Phylogenetic inference - softwareSurprisingly lot software is available (given the obscurity of the topic, almost-exhaustive list to be found here), but most are either too specialized, slow, obsolete or not worth use from some different reasons . Unfortunately, most (like 99%) are command-line based without any user-friendly graphic interface. But some of the good/passable are implemented in SW with GUI (like SeaView or Geneious) or at least have server-version. So, here is the short list some recommended phylogenetic software:
Ambiguous regions detection/removal: several SW, but nothing exciting, try Aliscore or Gblock (server)
Distance methods: PAUP (commercial), Phylip, BioNJ
Maximum Parsimony: PAUP (commercial), Phylip
Maximum likelihood: RAxML (server), PhyML (server), FastTree (REALLY fast, great for preliminary analyses), garli
Bayesian Inferrence: MrBayes, Phylobayes
Tree Viewer/Editor: NJplot (improved version also implemented in SeaView), FigTree, Treeview
this list is far from being exhaustive, but above noted SW should fit general audience (like you ) in terms of purpose and performance.
meticulous analysis of SGP is necessary!!!
you could use also the automated approach (Phylosorter), but the risk of error is quite a significant and the parameters should be as strict as possible
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
n
( )
‘clean’ datasets could be merged (concatenated) into the supermatrix
Scafos, phyutility, SeaView, MacClade, Bioedit?...
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
- both SW and HW demanding- due to the amount of data. the most complex models are
necessary, prone to errors and time consuming
+ SHOULD produce robust results
Multi-Gene PhylogeniesDATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
why? - poor taxon sampling - too weak/strong phylogenetic signal - violation of the model assumptions (different base composition, mutation rates...) - inappropriate model used
phylogenetic artifacts
Long-Branch Attraction (LBA)
- the most (in)famous and common artifact- high evolutionary rates cause artificial grouping of long-branching taxa
- adding more genes
Artifacts elimination
2012 - 2582009 - 1272008 - 135same author – different datasets
- adding more genes
- adding more taxa
- poor taxon sampling is considered to be the most common reason- ideally, all taxa should be included- reasonably, all relevant and available taxa should be included- realistically, we have to work with the few available
Artifacts elimination
Artifacts elimination
- adding genes to MGP- adding more taxa
- removal of problematic (fast-evolving) taxa- improving methodology
- analysis of dataset with different combination of taxa and comparison of resulting topologies
- efficient way to over-come the LBA
- current HW a SW enable application of the state-of-art models- LG4M, LG4X (RAxML)- CAT(+GTR): each position of alignment has specific equilibrium and model
parameters- covarion, non-homogenous: each taxon has specific rate of evolution- HW and time demanding!
Artifacts elimination
- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa
- improving methodology
- simple and fast way to reduce signal noise- for each gene, we compute overall ML distance and remove the the
most divergent genes
- TREEPUZZLE, RAxML
Artifacts elimination
- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology
- removal of fast evolving genes
- usually more efficient- each site of alignment is assigned to specific rate category (usually
8/16)- the highest category(ies) are removed- dependent on topology/model- TREEPUZZLE, AIRremover
Artifacts elimination
- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes
- removal of fast-evolving sites
- for datasets with a large proportion of saturated sites
- amino acids are recoded according to their biochemical properties to four categories (Dayhoff matrix)
Artifacts elimination
- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes- removal of fast-evolving sites
- decoding of aa
- clever, but is it kosher? ... doesn’t work that well anyway - concaterpillar
Artifacts elimination
- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes- removal of fast-evolving sites- decoding of aa
- selection of genes with congruent signal
Phylogenomics is (not)surprisingly hard to publish, usually you have to do combination of at least few above to satisfy the reviewers!
Artifacts elimination
- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes- removal of fast-evolving sites- decoding of aa- selection of genes with congruent signal
So... is it worth when quite often you get the same topology as with SSU rRNA?
?