Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM.

download Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM.

If you can't read please download the document

description

Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly

Transcript of Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM.

Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly What do I need to do? Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly So you have a sequence now what? MKILLLCIIFLYYVNAFKNTQKDGVSLQILK KKRSNQVNFLNRKNDYNLIKNKNPSSSLK STFDDIKKIISKQLSVEEDKIQMNSNFTKDL GADSLDLVELIMALEEKFNVTISDQDALKI NTVQDAIDYIEKNNKQ #1: What is it? Does source organism have its own genome database? Pubmed genome database (GeneDB, PlasmoDB, etc.) Unknown/No Yes Why start with genome-specific database? BLAST Expression data Strain variability Genome location/structure Pathway data PubMed BLAST Blastp Protein families Conserved Domains BLAST Hits Downloading sequences FASTA format Getting sequences FASTA format Saving and editing FASTA files Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly Pair-wise sequence alignment GYTSLLLSRQNED--G G--SLLLSHK-D-HTG TSLLLSR TSLLLSH GYTSLLLSRQNEDG-- --GSLLLSHK-D-HTG Global Overlap Local Smith-Waterman - Y T S L L L S R Q -YASLLWRQA-YASLLWRQA YTSLLLSRQ- YASLLW-RQA YTSLLLSRQ YASLLWRQA Aligning 2 sequences globally Multiple sequence alignment Progressive YTSLLLSRQ- YASLLW-RQA Align 2 closest sequences Add in next closest sequence YTSLLLSRQ- YASLLW-RQA PASIILSRQA Continue adding. Hyper dependent on initial matches. YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM Multiple sequence alignment Iterative Initial MSA Score (low) Optimize MSA score Probabilistic methods dont always generate the same answer YTTSLLLSRQ-- YATSLLW-RQ-A PA-SIILSRQ-A GRTSIVLTRQMA YTTSLLLSRQ-- YATSLLWRQA-- PASIILSRQA-- GRTSIVLTRQMA Multiple sequence alignment programs Local Global Pair-wise alignment type progressive iterative MSA Alignment type ClustalX T-Coffee HMMs GAs Dialign POA Multiple Sequence Alignments CLUSTAL progressive global POAVIZ progressive local Multiple Sequence Alignments CLUSTAL progressive global POAVIZ progressive local POAVIZ Multiple Sequence Alignments CLUSTAL progressive global POAVIZ progressive local CLUSTALX Parameters CLUSTALX CLUSTALX Protein Weight Matrices 1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. 3) GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger dataset. BLOSUM (BLOck SUbstitution Matrix) >99% identity>62% identity BLOSUM >BLOSUM62 BLOSUM62 Gather proteins with at least 62% identity to obtain actual substitution rates for these proteins Pros Best bet for distantly divergent sequences PAM (point accepted mutation) (# Point mutations / 100 amino acids) 99% identity20% identity PAM >PAM250 Gather the substitution rates for PAM1 (99% identical sequences) Assuming that those substitution rates are consistent over time: Pros Very good for closely related sequences Cons Rare mutations under-represented Substitution rates not constant over time (both are problems for phylogenetic estimation) CLUSTALX CLUSTALX - Aligning CLUSTALX Alignment view POAVIZ CLUSTAL CLUSTAL vs POAVIZ (global vs local) Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly BioEdit Alignment manipulation Open the .aln file BioEdit Alignment manipulation Select Edit from the mode dropdown Back colored view gives more contrast BioEdit Alignment manipulation Select Insert so that you dont accidentally lose part of your sequence Then select the unaligned beginning (or end) sequence and delete it. BioEdit Alignment manipulation Now save as a different file.fasta Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly lineage (branch, edge) root outgroup common ancestor (node, branch point) A Operational taxonomic units (OTUs, leaves) BCDEFG branch length Tree terminology A BCDEFG Topology 1 A BCE D FG Topology 2 monophyletic paraphyletic polyphyletic A E B FGCD Topology 3 Human AHuman BRat ARat B orthologues paralogues Ancestral gene AB duplication ABAB Last common ancestor speciation paralogues orthologues Sequence homology orthologues and paralogues Methods of estimating phylogenetic relationships Character-based Maximum Parsimony (MP) Distance-based Neighbor-Joining (NJ) Minimum Evolution (ME) Probabilistic Maximum likelihood (ML) Bayesian inference Methods of estimating phylogenetic relationships Maximum Parsimony (MP) AAA AGAAGA AGAAGA AGAAGAAAGAGAAGAGGAAAAAAGGGAAAGGGAAAA Taxa1 AAG Taxa2 AAA Taxa3 GGA Taxa4 AGA 3 changes required (best tree) 4 changes required Methods of estimating phylogenetic relationships Distance-based Neighbor-Joining (NJ) Method The NJ method involves clustering of neighbor species that are joined by one node. It does not evaluate all the possible tree topologies. Not guaranteed to obtain the optimal tree Minimum Evolution (ME) Method Estimates the total branch length of each topology exhaustively, then chooses the topology with the least total branch length. Time intensive for large numbers of taxa. Methods of estimating phylogenetic relationships Probabilistic methods Maximum likelihood (ML) Prob ( data | model + tree ) Search all possible topologies to optimize probability More likely topology found Bayesian inference Prior information Model for selection need both for everyone in the class Methods of estimating phylogenetic relationships Character Maximum Parsimony (MP) Distance Neighbor-Joining (NJ) Minimum Evolution (ME) Probabilistic Maximum likelihood (ML) Bayesian inference Estimating Phylogenetic Relationships MEGA MrBayes Estimating Phylogenetic Relationships MEGA MrBayes MEGA Molecular Evolutionary Genetic Analysis First we have to get a MEGA formatted file made Select All Files [ ] from the dropdown Files of Type menu Then choose the .aln file you just made MEGA making a MEGA formatted file Now click on the Convert to MEGA format button at the top left hand side of the screen MEGA recognizes that you didnt enter a MEGA formatted file Click OK MEGA making a MEGA formatted file Now we have to make sure that the file looks good before starting any analysis Make sure that the file is the right one and that the formatting is correct. Click OK. MEGA making a MEGA formatted file -Make sure all sequences are the same length -Remove all traces of the consensus marks When the file looks good, save it and close both text formatter windows Now try Activating the data file again, this time with the .meg file you just made MEGA input a MEGA formatted file Make sure that the correct sequence type is selected Make sure that the correct characters are selected for missing data and gaps. MEGA input a MEGA formatted file You should now see the sequence data explorer Minimize this window and you can begin analyzing your data MEGA choose an algorithm From the phylogeny window you can choose an appropriate algorithm. In this case well use Minimum Evolution. MEGA set parameters There are two major things to think about first: Model and Rates among Sites In this example, Ill use the Poisson model with gamma (y=2.0) rate variation Substitution models (nucleic acid) Identity Substitution rates Base frequencies Transition and/or transversion frequencies Symmetrical substitution (G->A = A->G) Rate variation across sites Gamma ( )distribution of rate variation among sites Proportion of Invariable Sites ( I ) Variable Equal Variable Equal Kimura 2-parameter: B(E), si(V), sv(V) Tamura-Nei: B(V), si(V), sv(E) Kimura 3-parameter: B(V), si(E), sv(V) General Time Reversible: B(V), Sym + I + GTR Substitution models (amino acid) Identity PAM JTT mtREV Poisson Mixture models Sophistication No model extrapolation of observed substitution rates probabilistic substitution rates Site specific residue frequencies High dimensional model but requires large dataset Each site can choose its own substitution model, and coupled with maximum likelihood probability estimations or MCMC/Bayesian methods MEGA set parameters There are two major things to think about first: Model and Rates among Sites In this example, Ill use the Poisson model with gamma (y=2.0) rate variation MEGA choose tree test options Now switch over to the Test of Phylogeny tab.. In order to determine the validity of your tree youll need to bootstrap it. Since our sequence isnt very long, only a couple hundred replications are needed. Now click the check button, then click Compute in the main window MEGA edit your tree Your tree should appear. Not a very good one in this case. Why? Because the sequences were too identical. The icons on the left allow you to reroot, flip branches, etc. You can also change the format of the tree But lets also compute a condensed tree(Select that from the Compute menu) using a cutoff of 50%.. MEGA interpret the tree Four of the sequences cluster indistinguishably together, while a single other sequence stands out. If we look back at our alignments we could predict this Estimating Phylogenetic Relationships MEGA MrBayes MrBayes Making a NEXUS (.nex) file MrBayes Running MrBayes Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly Phylogenetics Interpret results correctly Sequence similarity (think goldilocks) Quality of aligned sequences Use an appropriate model Use an appropriate estimation method Determine the validity of each part of your tree One bad egg Develop a model to explain your tree how does it square with known information? what can you learn from your sequences? what cant you learn from your analysis? Use appropriate parameters Try different things and compare results wisely The Intelligent Consumer (You dont have to completely understand everything in order to use it properly, but it helps to have a rough idea) BLAST - stochastic processes - random walks Sequence alignments - Markov processes - dynamic programming - Viterbi, Forward, and Backward algorithms Bayesian phylogenetic inference - Bayes theorem - Bayesian inference - Metropolis algorithm Many uses for multiple sequence analysis Protein family analysis multiple sequence alignment profile profileHMM (hidden Markov model) Find new proteins with same domains RNA secondary structure prediction Protein secondary structure prediction Protein structure prediction homology modeling Protein sequence with known structure Aligned sequences with unknown structure Comparative genomics